dbGaP security and SoM risk classifications

Database of Genotype and Phenotype (dbGaP) refers to NIH maintained database of datasets and was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. dbGaP provides two types of data - open access and controlled access.

In common parlance, dbGaP security refers to NIH security best practices for controlled-access data subject to the NIH genomic data sharing (GDS) policy. Researcher needs to comply with dbGaP security practices if her research involves access to dbGaP controlled access data. These best practices are practically impossible to implement unless the researcher is IT savvy enough to implement firewalls, encryption and other NIST security practices.

There are two key components in dbGaP:

  1. Authentication: You are who you say you are
  2. Authorization: You have been granted access to dbGaP data (your PI has to fill out a form requesting the access controlled data on your behalf)

All the dbGaP best practices fundamentally make certain that authentication and authorization (aka authx) are supported by the compute environment. Security practices including firewalls, 2-factor authentication, and encryption are important so no malware can pretend to be you. It is also important to make sure that the data does not fall in hands of unauthorized user. On Linux systems, access to data is limited by setting up proper groups and directory permission. It is also important to limit access to web services, even they are useful analytical web-services like UCSC Genome Browser. 

Stanford School of Medicine (SoM) risk classifications are guidance to protect Personal Health Information (PHI) and Personally Identifyable Information (PII). These classification are not directly applicable to dbGaP requirements.  List of Approved Services does not mean that you can use the service without further IT effort. For example, AWS Infrastructure is currently approved for (High Risk non PHI) but an IT savvy and trusted organization within Stanford needs to set up the firewalls and access control, so there is no accidental breach of authx. Whether the SoM requirements are strictly applicable for your specific compute environment (e.g. worker nodes on a compute cluster) is usually determined by your infrastructure manager.

Note that for dbGaP security compliance, meeting SoM moderate risk classification is necessary but not sufficient. GBSC managed systems are dbGaP security compliant. If you have any questions about GBSC infrastructure security, please contact us.

De-mystifying Data Security

We recommend that you read our peer reviewed Nature Biotechnology commentary ("Secure cloud computing for genomic data"). While, the publication is written to guide Cloud implementation, it will give you a good overview of the pertinent "beind-the-scene" critical points.

Genomic Data Security FAQ

What is PHI/PII and what are the consequences of a PHI/PII breach?

Please refer to your HIPAA training for PHI and PII. Your HIPAA training will also guide you regarding the consequences of a breach. Needless to say, PHI/PII breach has serious consequences for you, your lab and Stanford.  

My genomic data is de-identified, do I need HIPAA compliance?

No. However, your IRB or collaborator may wish you to treat the data as if it were PII ( aka "highly sensitive non-PII"). Genomic data is like a molecular fingerprint and under certain circumatances, can result in privacy loss (see the resources section on the website "Genomics and Patient Privacy" for further information).

Privacy experts agree that NIH dbGaP security compliance is sufficient to support de-identified genomic datasets. Get a complete HIPAA training to familiarize yourself with how you should handle PHI/PII data. 

What is consequence of dbGaP security breach?

You will lose access to dbGaP data. You will be asked to delete all dbGaP data. This will highly impact your ability to complete your project using dbGaP data.

Who is IT Officer on my dbGaP application?

It depends on the system you are planning to use to host your data. For GBSC managed SCG cluster,  use Ruth Marinshaw (CTO, Stanford Research Computing Center). For GBSC managed GCP Cloud, use Dr. Somalee Datta (Director, GBSC). Please contact Somalee Datta if you are planning to use DNAnexus for your dbGaP data.

For other systems, please contact whoever is responsible for system administration.

If IRB permits, should I make de-identified genomic datasets publicly available?

You may. However, we recommend that de-identified genomic datasets be treated as controlled access and follow NIH dbGaP security best practices for controlled access data.