A new Clinical Dataset in OHDSI OMOP Common Data Model launched
Research IT launches phase I of a powerful new clinical data platform
Oct 13, 2019: Research IT is building a new generation cloud scale clinical data platform. The phase I of the new platform is launched and contains STARR-OMOP dataset, the Electronic Health Records (EHR) data from Epic Clarity in Observational Health Data Sciences and Informatics (OHDSI) Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). This is Stanford's first fully de-identified EHR data that is made available to researchers in a secure computational environment prior to IRB approval.
In the STARR-OMOP dataset, aside from the standard encounter tables in OMOP CDM, we populate the clinical notes and note annotations. All data are de-identified in STARR-OMOP-deid including the clinical notes. We use sophisticated NLP, Safe Harbor, and other approaches such as Hiding in Plain Sight (HIPS) in text de-identification. For clinical text mapping to concepts, we use a we use a pipeline developed by LePendu et. al. , that has incorporated both negation detection and history detection. These contextual cues are based on NegEx and ConText and enable us to discern whether a term should not be attributed the patient's current status (e.g., lack of valvular dysfunction, or sister has muscular dystrophy).
The resulting database contains patient encounter data from ~2.67M patients. Over 60% of the patients have a diagnosis (ICD 9/10), over 40% have medication information (RxNorm), ~80% have lab information (LOINC), and over 95% of patients have clinical notes. The dataset is classified as Stanford High Risk dataset.
The underlying technology for the STARR-OMOP-deid dataset is Google Cloud Platform (GCP) BigQuery (BQ). BQ is a highly performant analytical data warehousing technology. BQ supports ANSI-compliant SQL and a powerful Application Programming Interface (API). At launch, access to the STARR-OMOP-deid data is via BigQuery APIs. A future release will support the OHDSI cohort tool, ATLAS.
The data platform also brings a secure infrastructure for Big Data analytics, Stanford Nero platform. Researchers can request query access to the STARR-OMOP-deid BigQuery dataset, a High Risk dataset, on Nero. Nero is a HIPAA and High Risk compliant Big Data analytical platform, built-in collaboration between Research IT and Stanford Research Computing Center, designed to support analysis of datasets such as STARR-OMOP-deid. Nero provides a powerful Jupyter Notebooks based modern data science environment for collaborative research. The Research IT team provides training material so users can get familiar with Nero computing environment and the OMOP CDM using synthetic and STARR-OMOP-deid 1% datasets. There is an active slack channel, starrdatausers, for STARR on Nero community. You also need a valid Data Privacy Attestation (DPA) to access de-identified data.
The clinical data platform is built on the guidelines of a Data Commons, that brings the data resources (STARR datasets), compute infrastructure (Nero) and associated tools and software to access, and analyze the data securely. The platform is being developed in collaboration with Research IT faculty committee and is made possible by Stanford School of Medicine Office of Research.
STARR data lake also supports other data models and tools such as the STRIDE database and associated cohort tool and chart review tools, and Informatics for Integrating Biology and the Bedside (i2b2) SHRINE tool on the CTSA Accrual to Clinical Trials (ACT) network. We are in the process of developing PEDSNet dataset for our pediatric community. Research IT is also in the process of integrating Radiology PACS and LPCH bedside monitoring data in STARR-Radiology and STARR-Waveform datasets for greater accessibility. In the meantime, for access to these new data types, please request a data concierge service.