In a nutshell

What is Stanford OMOP?

In 2017, SoM Informatics leadership decided to invest in a new infrastructure for modern data science, AI and population health research. Here are the key elements of this infrastructure:

  1. STARR: STAnford medicine Research data Repository, or STARR, was created to hold different data modalities from the two hospitals. The data is petascale, so it became important to augment on-premise data center with cloud data center.
  2. OMOP: Stanford decided to adopt OHDSI OMOP Common Data Model (CDM) to engage in reproducible and interoperable research. 
  3. Nero: Since the patient data in STARR is typically High Risk or PHI, SoM also decided to invest  in a secure computational enclave, Nero, where the researchers could focus on research and get seamless access to data withpout worrying about technology scalability and data security.

 

Summary of datasets

There are three essential data sets:

  1. Pre-IRB OMOP-deidentified (aka OMOP-deid), only accessible via SQL on Nero
  2. Pre-IRB OMOP-deidentified-lite (aka OMOP-deid-lite), accessible via SQL on Nero or via ATLAS Cohort tool 
  3. Post-IRB OMOP identified, only accessible via concierge service

There is a synthetic dataset called SynPUF that is made accessible. We also make 1% deidentified datasets available to ease query and algorithm development. Both OMOP and OMOP-deid are identical except that the later is anonymized using Safe Harbor. Both contain clinical text and flowsheets. The deid-lite is the same data as deid, except clinical text and flowsheets are stripped off. The ATLAS cohort tool uses deid-lite dataset.

Understanding risk classification:  De-identified data does not imply Low Risk data. Studies have shows that fully de-identified data when combined with public data can result in re-identification. Stanford UPO deems that the OMOP-deid dataset is High Risk. This dataset contains algorithmically de-identied clinical notes and may have incidental PHI.

SQL access to OMOP dataset

The pre-IRB dataset STARR-OMOP-deid can be accessed by researchers directly. However, this dataset is classified as High Risk and is only accessible to researchers via a Nero GCP account. 

The dataset, STARR-OMOP-deid, contains clinical notes and medical concepts derived from these clinical notes. The overall dataset is ~7 terabytes. In order to support scalable and performant access for researchers, Research IT leverages Google Cloud Platform (GCP) BigQuery (BQ). BQ is a highly performant analytical data warehousing technology. BQ supports ANSI-compliant SQL and a powerful Application Programming Interface (API). For more information on BQ database performance, please review our manuscript, Supplementary Materials, Section 3: Database Technologies.

In order to access a BQ dataset, researcher needs a GCP account. Furthermore, the GCP account needs to be secured for High Risk data. Research IT has partnered with Stanford Research Computing to bring Nero, a managed HIPAA compliant data science platform for Stanford researchers. Researcher needs access to Nero GCP in order to get access to STARR-OMOP-deid. Think of Nero as a secure research enclave. When you request access to the data, we will walk you through the steps for access to Nero.

Pre and post IRB workflow for researchers using OMOP

The data is designed for two stages of research, pre-IRB and post-IRB.
  1. In the pre-IRB stage, the data is self-service and you can directly access
    1. STARR-OMOP-deid
    2. STARR-OMOP-deid-lite
  2. For post-IRB, you are currently required to go through concierge service.  The concierge service team are Stanford Honest Brokers. They have access to STARR-OMOP. The data is identical between STARR-OMOP and STARR-OMOP-deid except that the later is anonymized using Safe Harbor.

The OMOP-deid and OMOP-deid-lite are non-human subject dataset. These datasets need you to sign a Data Privacy Attestation. When you access this data in pre-IRB, you can not request identification of specific person ids. This means you can't request for correspondence like the following: {person_ID1 = MRN1, person_ID2 = MRN2, ...}

If your research needs are satisfied with anonymized data, you may not need an IRB. If however, you are ready to move to post-IRB stage, you can request the following:

  • You can share your query that you developed using OMOP-deid and request output from the identified OMOP dataset. Your IRB must permit you access to the tables you requested.
  • Given a list of person_IDs, you can get a list of MRNs back. Your DUA prohibits you from matching person_ID with MRN.
  • You can request IRB specific de-identification service, this will also allow you to get your IRB specific codebook/crosswalk between de-identified person ids and idenfied MRNs.

 

In some cases, researcher may want access to a Limited Data Set (LDS). This requires filing an eProtocol. Once the eProtocol is approved, request Concierge Service to generate a eProtocol specific LDS. A Data Use Agreement (not the same as Data Privacy Attestation) is additionally required for access to a LDS. Note that Limited Data Set contains PHI of some types. It is designated as High Risk data.

 

Is OMOP right for me?

At Stanford, we start with Epic Clarity and derive downstream research data warehouses like STRIDE and STARR-OMOP. Following figure presents the landscape and helps illustrate the fitness of various data warehouses with use cases and researcher's skill sets. STARR-OMOP is a great solution for data scientists who have familiarity with data science techniques like SQL, Python/R and Jupyter Notebooks or wish to develop these skills. Furthermore, OMOP is a common data model, which means, learning OMOP takes you beyond the idiosyncracies of Stanford EHR.  

The OMOP CDM demonstrates strong results in comparative effectiveness research (Ogunyemi et al) with minimal information loss during data transformation (Voss et al), speeds up implementation of clinical phenotypes across networks (Hripcsak et al), and promotes research reproducibility (Zhao et al). For more information on OMOP, please review our manuscript.

 

Participation in network studies using STARR-OMOP

Since its launch in 2019, STARR-OMOP has supported a number of network studies. While  early participation comes from labs with long term experience with OMOP, other Stanford labs are also getting engaged. Network studies do not require data sharing. We expect most studies to leverage de-identified data and hence an IRB is not required. If you are participating in a European study, a DUA specific to GDPR may be required. Please work with your department's IRB liaison for any paperwork requirements.   

2021 YTD

  1. Use of repurposed and adjuvant drugs in hospital patients with covid-19: multinational network cohort study, Prats-Uribe, A., ..., Posada J.D., ...,  Shah N.H., ..., Prieto-Alhambra, D., BMJ 2021; 373 doi: https://doi.org/10.1136/bmj.n1038, May 2021 (link)
  2. Unraveling COVID-19: a large-scale characterization of 4.5 million COVID-19 cases using CHARYBDIS, Prieto-Alambra D, ..., Posada J.D., ...,Shah N.H., ...,Suchard M, DOI: 10.21203/rs.3.rs-279400/v1, Mar 2021 (link)
  3. COVID-19 in patients with autoimmune diseases: characteristics and outcomes in a multinational network of cohorts across three countries, Rheumatology, keab250, https://doi.org/10.1093/rheumatology/keab250, Tan E.H.,  ..., Posada J.D., ...,Shah N.H., ... Prieto-Alhambra, D. Mar 2021 (link)
  4. Characteristics and outcomes of 118,155 COVID-19 individuals with a history of cancer in the United States, Roel E., ..., Posada J.D., ..., Shah N.H., ..., Duarte-Salles, T., https://doi.org/10.1101/2021.01.12.21249672 , Jan 2021 (link)

In 2020

  1. Prediction of Major Depressive Disorder Following Beta-Blocker Therapy in Patients with Cardiovascular Diseases, Jin S., ..., Posada J.D., ..., Shah N. H., ...,  You S. C., J. Pers. Med. 10(4), 288, Dec 2020;  (link)
  2. Use of dialysis, tracheostomy, and extracorporeal membrane oxygenation among 240,392 patients hospitalized with COVID-19 in the United States, Burn, E., ..., Posada J.D., ..., Duarte-Salles T.,  DOI: 10.1101/2020.11.25.20229088, Nov 2020 (link)
  3. Characteristics, outcomes, and mortality amongst 133,589 patients with prevalent autoimmune diseases diagnosed with, and 48,418 hospitalised for COVID-19: a multinational distributed network cohort analysis, Tan E.H.,  ..., Posada J.D., ...,Shah N.H., ... Prieto-Alhambra, D., doi: 10.1101/2020.11.24.20236802, Nov 2020 (link)
  4. Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study, Burn E., ..., Posada J. D., ..., Shah N. H.,..., Ryan P., Nature Communications volume 11, Oct 2020 (link)
  5. Baseline phenotype and 30-day outcomes of people tested for COVID-19: an international network cohort including >3.32 million people tested with real-time PCR and >219,000 tested positive for SARS-CoV-2 in South Korea, Spain and the United States, Golozar, A., ..., Posada J.D., ..., Shah N. H.,..., Prieto-Alhambra, D., doi: 10.1101/2020.10.25.20218875, Oct 2020 (link)
  6. Baseline characteristics, management, and outcomes of 55,270 children and adolescents diagnosed with COVID-19 and 1,952,693 with influenza in France, Germany, Spain, South Korea and the United States: an international network cohort study, Duarte-Salles T., ..., Posada J.D., ..., Shah N.H.,...,Prieto-Alhambra, D., DOI: 10.1101/2020.10.29.20222083, Oct 2020 (link)
  7.  Clinical characteristics, symptoms, management and health outcomes in 8,598 pregnant women diagnosed with COVID-19 compared to 27,510 with seasonal influenza in France, Spain and the US: a network cohort analysis, Lai, L. Y. H., ..., Posada J.D., ..., Shah N.H.,..., Prieto-Alhambra, D.,  doi: https://doi.org/10.1101/2020.10.13.20211821, Oct 2020 (link)
  8. Heterogeneity and temporal variation in the management of COVID-19: a multinational drug utilization study including 71,921 hospitalized patients from China, South Korea, Spain, and the United States of America, Prats-Uribe, A., ..., Posada J.D., ...,  Shah N.H., ..., Prieto-Alhambra, D.,  https://doi.org/10.1101/2020.09.15.20195545, Sep 2020 (link)
  9. Characteristics and outcomes of 627 044 COVID-19 patients with and without obesity in the United States, Spain, and the United Kingdom, Recalde M., ...,  Posada J.D., ..., Shah N.H.,..., Duarte-Salles T.,  https://doi.org/10.1101/2020.09.02.20185173, Sep 2020 (link)
  10. An international characterisation of patients hospitalised with COVID-19 and a comparison with those previously hospitalised with influenza, Burns, E., ..., Posada J.D., ..., Shah N. H., ..., Ryans, P. , doi: 10.1101/2020.04.22.20074336, Apr 2020 (link)