Clinical Data Warehouse Reimagined
Powered by OMOP Common Data Model and Google Cloud Platform
Nov 20, 2020: Three years ago, we embarked on the journey of re-imagining our Clinical Data Warehouse and related ecosystem of services in view of growing data science needs. It has been a year since we launched the comprehensive set of services and today, we will take the opportunity to present our journey and honor our partners who have helped clear the path on this journey.
Journey in a nutshell
- Sep 2017: STARR effort is launched in Sep 2017
- Aug 2018: Nero research computing, first customers on-boarded in Aug 2018. Now over 120 labs and 700 researchers are on Nero (link)
- Sep 2019: STARR-OMOP, alpha launched in Sep 2019 and beta launched in Nov 2019. Now over 120 data scientists have access to OMOP. (link)
- Nov 2019: OMOP data science training, alpha launched in Nov 2019 and beta launched in Feb 2020. Now over 50 users have completed the training. (link)
- Mar 2020: OMOP manuscript submitted to arxiv in Mar 2020 (link)
- Oct 2020: First peer reviewed publication using STARR-OMOP, a COVID-19 network study, published in Nature Communications in Oct 2020 (link)
Foundations of the next generation Clinical Data Warehouse:
Fast forward another dozen years to 2016, the year SoM envisions STARR (STAnford medicine Research data Repository). By now, it has become clear that Hospital data is growing at a petascale rate (imaging, pathology, computer vision) and we need a brand new set of technologies and capabilities to push the new frontiers of AI in Medicine.
We looked at data stewardship for lessons learned, specifically, complex efforts to manage and analyze data such as Genomic Data Commons and Sage Bionetwork’s Synapse platform. What emerged is the need for a comprehensive one stop data science platform where researchers can,
- access data seamlessly from STARR,
- do computation, and
- access services such as data consultation, training and support.
Our first product offering was the second pillar of this data science platform, Nero, a HIPAA compliant research computing platform, that went live in summer of 2018. Nero is built and managed by our partners, Stanford Research Computing Center (SRCC). The SRCC team has in-depth expertise across the spectrum of HPC and data management services. The SRCC team develops, and manages the Nero platform and provides secure workspaces and research computing support to the users. Nero has a private cloud infrastructure based on open source components hosted at Stanford Research Computing Facility (SRCF) with direct connectivity to a 100 gigabit network connection to available Research and Education Networks via CENIC’s California Research and Education Network. Nero also integrates with Google Cloud Platform (GCP). The Jupyter Hub interface provides a unified experience between on-premise and cloud experiences. Google services such as BigQuery are accessed from on-premise via APIs.
Knowing that we were going to take on many data types in STARR, reaching petascale proportions, we were also keen to use a public Cloud to help us acquire the speed of experimentation. In particular, the data warehouse and related technologies are crucial to STARR. What we had seen in genomics was an unprecedented scalability and performance with Big Query for queries spanning terabytes of genomics data. Downstream integration of Big Query with a range of analytical tools is a big draw. Even better is a managed service that scales with our users, automagically. In addition BigQuery allows for cost sharing with end users. The storage costs are billed to the billing account attached to the project where data resides i.e., Research IT billing account pays for storage. The charges associated with BigQuery jobs (query) are billed to the billing account attached to the project from which user runs jobs (query), i.e., user pays for query.
We continued borrowing from genomics when for our workflow orchestration framework, we adopted Common Workflow Language (CWL) and Cromwell workflow execution engine. Cromwell was selected due to its broad range of platform support, its maturity on the GCP, and its ease of deployment. For genomics data, we had secured GCP to meet dbGaP requirements. For STARR (and Nero), we had to secure GCP to meet HIPAA requirements. Guided by ISO’s clear requirements, we partnered with Biarca for infrastructure security support, and with Atredis for penetration testing.
On-premise vs Cloud differentiators
- Cohort queries run 10-100x faster on BigQuery when compared to on-premise Oracle database (Manuscript, Supplementary Table S3)
- Burst computing during weekly production runs boot up 100s of compute cores as opposed to dozens on-premise
- Ability to integrate a variety of diverse technology stacks quickly and efficiently, such as Kubernetes, DataProc, DataFlow, leading to a faster pace of experimentation and therefore higher staff productivity and better consumer facing products.
- Ability to engage in new models of research collaborations where we can bring our collaborators (and deploy their diverse software stack) within the perimeters of our secure cloud infrastructure. This results in improved collaboration experience and enhanced patient privacy.
- Managed cloud services take away the tedium of IT, leaving the development team more time to deliver value added services.
Unlocking clinical text:
We collaborated closely with data scientist and Natural Language Processing (NLP) expert, Jose Posada, a member of Shah lab, to augment text processing features in STARR-OMOP. In particular, we have incorporated a text mining algorithm in our production pipeline for STARR-OMOP that was originally developed and used extensively by Shah lab scientists. The algorithms finds medical concepts, and can annotate whether the experience is happening to the patient (e.g. patient has diabetes), whether the experiencer is the patient (e.g. patient’s father has diabetes), whether the experience is current at the time of the patient’s visit (e.g. patient has no reported symptoms of diabetic retinopathy at this time).
Our production pipeline processes 100 million clinical notes in our EHR using burst computing to find 30 billion medical concepts. These 30 billion concepts and annotations are stored in NOTE_NLP table in OMOP and made searchable via BigQuery (Manuscript, Supplementary Section 7 for a slightly older result).
Jose, in collaboration with our engineering team, also developed an open-source clinical text de-identification pipeline, TiDE (Text DE-identification), that incorporates regular expression search, Name Entity Recognition and Hiding in Plain Sights (HIPS). Results show best-in-class precision and recall, but most importantly, the HIPS methodology brings state-of-the-art privacy protection (Manuscript, Supplementary Section 6).
TiDE enabled us to build the first pillar of our data science platform, a frequently updated pre-IRB non human subject dataset (STARR-OMOP-deid) accessible to any Stanford user with minimal paperwork (only a data use agreement needs to be signed) but under maximally secure conditions (the dataset is only accessible from a HIPAA compliant Nero Google account). The STARR-OMOP-deid dataset brings all the 100 million clinical notes and 30 billion standardized medical concepts within easy query access.
Key text processing metrics
- ~3 million patients with 100 million notes, 22 million notes have no PHI (Manuscript, Supplementary Figure S 6.4)
- 100 million notes contain ~33 billion words, nearly 4% of the words are PHI (Manuscript, Supplementary Figure S 6.4)
- 100 million notes de-identified in ~7 hours with 800 DataFlow workers at the total cost of $440 USD. The total processing time translates to 0.00025s/note which is 3 orders of magnitude less than the recently reported fastest process (0.24s/note) by Heider et al.
- 100 million notes are processed for medical concepts using 400 compute engine notes in 4 hours resulting in 30 billion searchable medical concepts in BigQuery
Changing cost-benefit ratio with cloud
Several of our engineering team members have been with us since early days of STRIDE. Adoption of cloud has resulted in a significant change in what we do and the services we bring to our research community. It is essentially about the effort it takes to do something new and the effort it takes to sustain what we have built. The cost-benefit ratio of cloud is different, compared to on-premise, it is not about reducing cost, it is about creating value.
Here are some of the key impact of cloud on our capabilities:
Replace tedium with innovation
"Don't underestimate the value of zero administration (with no tuning, indexing, managing statistics, query plans, upgrades, security patching, index corruptions due to compression bugs, motherboard power failures due to PCI boards full of SSDs, ...). We have historically spent ridiculous amounts of time optimizing databases. We have other challenges now, but our BigQuery database is not one of them". -- Garrick Olson, Infrastructure and Platform team lead, Research IT
A new data sharing paradigm
"There is also a really important external value in BigQuery being a truly universal resource. I can share a dataset in my project with anyone, anywhere, and they can join my dataset with their own (or anyone else’s) datasets. This means you never have to tackle those expensive but mundane initiatives of copying data from your database to mine. This is why I like BigQuery as a data delivery mechanism, turning what used to be data integration projects into simple permission granting." -- Garrick Olson, Infrastructure and Platform team lead, Research IT
A new ETL paradigm
"What really impressed me about BigQuery is its ability to perform transformations on large tables. The design of BigQuery makes it very efficient to do a query and modify the results in some way, even in parallel. That enables a more modern ELT* (extract, load, transform many times) model rather than the traditional ETL (extract, transform, load). While this has obvious performance and cost benefits, it is also critical for enabling research, where we need to provide different data (different multiple Ts in the ELT*) for each research endeavor. Previously it was exceedingly challenging to do this, and now we are beginning to be able to do this relatively cheaply and easily, allowing us to support research projects that previously would have simply failed for lack of resources." -- Garrick Olson, Infrastructure and Platform team lead, Research IT
Optimizing ATLAS for Google Cloud
In partnership with Stanford Center for Population Health Sciences (PHS), we have made Optum dataset, Optum DOD (Date of Death) v8.0 database, available in ATLAS. Researchers who have access to both Optum (via PHS) and STARR, can essentially run the same cohort analysis across the two datasets with touch of a button.
Notable ATLAS metrics
- ATLAS benchmarking suite using SynPUF runs 3 to 10x faster on BigQuery when compared to postgreSQL (Manuscript, Supplementary Table S9.3)
- Achilles queries run in ATLAS using STARR-OMOP data present near real time user experience. Out of 725 total queries available in Achilles, 660 queries took less than 17 seconds, and median execution time was 3 sec (Manuscript, Supplementary Table S9.1)
Bringing researcher to data:
The third pillar of our data science platform are a series of training and support initiatives, a tour de force effort from our R&D lead, Priya Desai. She joined our group as the product manager of STARR-OMOP. A data scientist by training, she focused from day one on the fundamental needs of a data scientist - data quality, data re-usability and data transparency. As her first commitment to the informatics community, she credited Nigam’s “BIOMEDIN 215: Data Science for Medicine” course. As part of the release, she generated product documentation. She also created python notebooks in Stanford gitlab that show users how to work with OMOP data model, the notebooks are written for SynPUF and seamlessly run on STARR-OMOP data. Nero team offers user support via emails, office hours, slack channel, and youtube videos. Priya’s team shares Nero office hours. Priya launched a companion Stanford slack channel to support the STARR user community on Nero. Her team now offers additional ATLAS office hours in collaboration with Odysseus. She then proceeded to launch a new data science training program, a series of day long sessions that start with basic building blocks of doing data science and builds up to more complex skills using STARR clinical text. With COVID-19 shelter-in-place, the hands-on curriculum is being converted to a series of short videos on Stanford STARR youtube channel.
Data quality and user support metrics
- STARR-OMOP is refreshed weekly and the NOTE and NOTE_NLP tables are fully populated. The pre-IRB dataset is refreshed monthly. ATLAS data is refreshed weekly.
- Data Quality Dashboard runs on the OMOP datasets with every release and the data is reviewed by the data science team prior to release.
- Over 140 data scientists now have access to pre-IRB dataset on Nero
- 50 researchers have graduated the data science training program in 2020
- Over 30 STARR and ATLAS office hours have been offered since Q1 2020.
- Research IT has recently launched Stanford STARR youtube channel for training material.