Clinical Data Warehouse Reimagined
Powered by OMOP Common Data Model and Google Cloud Platform
Nov 20, 2020: Three years ago, we embarked on the journey of re-imagining our Clinical Data Warehouse and related ecosystem of services in view of growing data science needs. It has been a year since we launched the comprehensive set of services and today, we will take the opportunity to present our journey and honor our partners who have helped clear the path on this journey.
Journey in a nutshell
- Sep 2017: STARR effort is launched in Sep 2017
- Aug 2018: Nero research computing, first customers on-boarded in Aug 2018. Now over 120 labs and 700 researchers are on Nero (link)
- Sep 2019: STARR-OMOP, alpha launched in Sep 2019 and beta launched in Nov 2019. Now over 120 data scientists have access to OMOP. (link)
- Nov 2019: OMOP data science training, alpha launched in Nov 2019 and beta launched in Feb 2020. Now over 50 users have completed the training. (link)
- Mar 2020: OMOP manuscript submitted to arxiv in Mar 2020 (link)
- Oct 2020: First peer reviewed publication using STARR-OMOP, a COVID-19 network study, published in Nature Communications in Oct 2020 (link)
In the beginning:
Stanford School of Medicine (SoM) efforts to explore the use of Artificial Intelligence (AI) in Medicine (AIM) started in the 1970s with the Stanford University Medical EXperimental computer for Artificial Intelligence in Medicine (SUMEX-AIM) project, a national computer resource (1973-1992), via the early communications networks of the 1970s – ARPANET, funded by NIH to promote applications of AI to biological and medical problems. The report, “Seeds of Artificial Intelligence”, now a historical artifact, won an award as the top federal technical publication for 1980 by the National Association of Government Communicators. The SUMEX-AIM resource resided administratively within the SoM and provided computing facilities specifically tuned to the needs of AI research and developed many tools for encouraging and facilitating community relationships among collaborating projects and medical researchers.
Fast forward to 2003, ahead of the HITECH ACT of 2009, SoM decided to invest in a research Clinical Data Warehouse. This effort resulted in HIPAA compliant STRIDE platform – An Integrated Standards-Based Translational Research Informatics Platform. STRIDE, based on on-premise Oracle technology stack, is built and managed by Research IT in partnership with Technology & Digital Solutions Platform Services team. Platform Services team has in-depth expertise in HIPAA regulatory requirements, Oracle database administration, data center security, networking, and storage. Platform Services supports the Clarity and STRIDE on-premise infrastructure, development workspaces for our team and assure 24x7 availability and data security of the STRIDE web tool. STRIDE is still going strong at Stanford, as it continues to serve 600-800 researchers annually. STRIDE integrates data from the two Hospitals, adult and children’s, and a multiplicity of feeds (Clarity, HL7), to provide an intuitive user interface to researchers who are seeking self-service chart review.
“The full benefit of the use of computers as tools of thought can come only when we learn to dissect intelligence into a portion best suited to the human being, and a portion best suited to the computer, and then find a way to mesh the two processes. The science of Artificial Intelligence is concerned with that very important task.” - Dr. Ralph Engle, Columbia University, a pioneer in computer diagnosis and developer of the HEME/HEME-2 systems for the diagnosis of blood disorders, first Rutgers AIM Workshop in New Brunswick, New Jersey in1975
Foundations of the next generation Clinical Data Warehouse:
Fast forward another dozen years to 2016, the year SoM envisions STARR (STAnford medicine Research data Repository). By now, it has become clear that Hospital data is growing at a petascale rate (imaging, pathology, computer vision) and we need a brand new set of technologies and capabilities to push the new frontiers of AI in Medicine.
We looked at data stewardship for lessons learned, specifically, complex efforts to manage and analyze data such as Genomic Data Commons and Sage Bionetwork’s Synapse platform. What emerged is the need for a comprehensive one stop data science platform where researchers can,
- access data seamlessly from STARR,
- do computation, and
- access services such as data consultation, training and support.
Our first product offering was the second pillar of this data science platform, Nero, a HIPAA compliant research computing platform, that went live in summer of 2018. Nero is built and managed by our partners, Stanford Research Computing Center (SRCC). The SRCC team has in-depth expertise across the spectrum of HPC and data management services. The SRCC team develops, and manages the Nero platform and provides secure workspaces and research computing support to the users. Nero has a private cloud infrastructure based on open source components hosted at Stanford Research Computing Facility (SRCF) with direct connectivity to a 100 gigabit network connection to available Research and Education Networks via CENIC’s California Research and Education Network. Nero also integrates with Google Cloud Platform (GCP). The Jupyter Hub interface provides a unified experience between on-premise and cloud experiences. Google services such as BigQuery are accessed from on-premise via APIs.
Knowing that we were going to take on many data types in STARR, reaching petascale proportions, we were also keen to use a public Cloud to help us acquire the speed of experimentation. In particular, the data warehouse and related technologies are crucial to STARR. What we had seen in genomics was an unprecedented scalability and performance with BigQuery for queries spanning terabytes of genomics data. Downstream integration of BigQuery with a range of analytical tools is a big draw. Even better is a managed service that scales with our users, automagically. In addition BigQuery allows for cost sharing with end users. The storage costs are billed to the billing account attached to the project where data resides i.e., Research IT billing account pays for storage. The charges associated with BigQuery jobs (query) are billed to the billing account attached to the project from which user runs jobs (query), i.e., user pays for query.
We continued borrowing from genomics when for our workflow orchestration framework, we adopted Common Workflow Language (CWL) and Cromwell workflow execution engine. Cromwell was selected due to its broad range of platform support, its maturity on the GCP, and its ease of deployment. For genomics data, we had secured GCP to meet dbGaP requirements. For STARR (and Nero), we had to secure GCP to meet HIPAA requirements. Guided by ISO’s clear requirements, we partnered with Biarca for infrastructure security support, and with Atredis for penetration testing.
On-premise vs Cloud differentiators
- Cohort queries run 10-100x faster on BigQuery when compared to on-premise Oracle database (Manuscript, Supplementary Table S3)
- Burst computing during weekly production runs boot up 100s of compute cores as opposed to dozens on-premise
- Ability to integrate a variety of diverse technology stacks quickly and efficiently, such as Kubernetes, DataProc, DataFlow, leading to a faster pace of experimentation and therefore higher staff productivity and better consumer facing products.
- Ability to engage in new models of research collaborations where we can bring our collaborators (and deploy their diverse software stack) within the perimeters of our secure cloud infrastructure. This results in improved collaboration experience and enhanced patient privacy.
- Managed cloud services take away the tedium of IT, leaving the development team more time to deliver value added services.
OHDSI partnership and OMOP Common Data Model
This year, Research IT team presented at the OHDSI 2020 symposium and participated in the study-a-thon.
Last December we launched our next generation of analytical clinical data warehouse, STARR-OMOP, using Observational Health Data Sciences and Informatics (OHDSI) Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) v 5.3.1. We chose OMOP CDM for its demonstrated applicability for many different use cases including claims, EHR, longitudinal registries and hospital transactional database. OMOP CDM has also demonstrated strong results in comparative effectiveness research with minimal information loss during data transformation, speeds up implementation of clinical phenotypes across networks, and promotes research reproducibility.
The ETL effort, from Epic Clarity to OMOP, started nearly 18 months prior to launch. Our internal support came from Prof Nigam Shah’s lab, they were early adopters of OMOP. Lab member, Juan Banda, now a faculty in computer science at Georgia State University, had developed ETLs from STRIDE to OMOP. The lab had several peer reviewed publications demonstrating success of OMOP using Stanford EHR data. Juan has since developed Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE), an R- package phenotyping framework that combines noisy labeling and anchor learning. A significant amount of prototyping, development and testing happened on Nero and GCP using STARR-OMOP. Juan guided our baby steps when we started our OMOP journey. We subsequently leaned heavily on our data partners, Odysseus, to help with the best practices of OMOP CDM and OHDSI ATLAS tool BigQuery enablement and deployment. Shah lab members also offered to be STARR-OMOP alpha users and helped us mature from alpha to beta.
"Participation in the OHDSI consortium has been a delight. We could not have felt more supported. We have leveraged community feedback via forums, leaned on subject matter expertise in the broader community, training material and much more. The public-private nature of partnership keeps the CDM and tools real and meaningful for broad spectrum analytics and that works well for an academic medical center like ours." - Priya Desai, R&D Manager, Research IT
Unlocking clinical text:
We collaborated closely with data scientist and Natural Language Processing (NLP) expert, Jose Posada, a member of Shah lab, to augment text processing features in STARR-OMOP. In particular, we have incorporated a text mining algorithm in our production pipeline for STARR-OMOP that was originally developed and used extensively by Shah lab scientists. The algorithms finds medical concepts, and can annotate whether the experience is happening to the patient (e.g. patient has diabetes), whether the experiencer is the patient (e.g. patient’s father has diabetes), whether the experience is current at the time of the patient’s visit (e.g. patient has no reported symptoms of diabetic retinopathy at this time).
Our production pipeline processes 100 million clinical notes in our EHR using burst computing to find 30 billion medical concepts. These 30 billion concepts and annotations are stored in NOTE_NLP table in OMOP and made searchable via BigQuery (Manuscript, Supplementary Section 7 for a slightly older result).
Jose, in collaboration with our engineering team, also developed an open-source clinical text de-identification pipeline, TiDE (Text DE-identification), that incorporates regular expression search, Name Entity Recognition and Hiding in Plain Sights (HIPS). Results show best-in-class precision and recall, but most importantly, the HIPS methodology brings state-of-the-art privacy protection (Manuscript, Supplementary Section 6).
TiDE enabled us to build the first pillar of our data science platform, a frequently updated pre-IRB non human subject dataset (STARR-OMOP-deid) accessible to any Stanford user with minimal paperwork (only a data use agreement needs to be signed) but under maximally secure conditions (the dataset is only accessible from a HIPAA compliant Nero Google account). The STARR-OMOP-deid dataset brings all the 100 million clinical notes and 30 billion standardized medical concepts within easy query access.
Key text processing metrics
- ~3 million patients with 100 million notes, 22 million notes have no PHI (Manuscript, Supplementary Figure S 6.4)
- 100 million notes contain ~33 billion words, nearly 4% of the words are PHI (Manuscript, Supplementary Figure S 6.4)
- 100 million notes de-identified in ~7 hours with 800 DataFlow workers at the total cost of $440 USD. The total processing time translates to 0.00025s/note which is 3 orders of magnitude less than the recently reported fastest process (0.24s/note) by Heider et al.
- 100 million notes are processed for medical concepts using 400 compute engine notes in 4 hours resulting in 30 billion searchable medical concepts in BigQuery
Changing cost-benefit ratio with cloud
Several of our engineering team members have been with us since early days of STRIDE. Adoption of cloud has resulted in a significant change in what we do and the services we bring to our research community. It is essentially about the effort it takes to do something new and the effort it takes to sustain what we have built. The cost-benefit ratio of cloud is different, compared to on-premise, it is not about reducing cost, it is about creating value.
Here are some of the key impact of cloud on our capabilities:
Replace tedium with innovation
"Don't underestimate the value of zero administration (with no tuning, indexing, managing statistics, query plans, upgrades, security patching, index corruptions due to compression bugs, motherboard power failures due to PCI boards full of SSDs, ...). We have historically spent ridiculous amounts of time optimizing databases. We have other challenges now, but our BigQuery database is not one of them". -- Garrick Olson, Infrastructure and Platform team lead, Research IT
A new data sharing paradigm
"There is also a really important external value in BigQuery being a truly universal resource. I can share a dataset in my project with anyone, anywhere, and they can join my dataset with their own (or anyone else’s) datasets. This means you never have to tackle those expensive but mundane initiatives of copying data from your database to mine. This is why I like BigQuery as a data delivery mechanism, turning what used to be data integration projects into simple permission granting." -- Garrick Olson, Infrastructure and Platform team lead, Research IT
A new ETL paradigm
"What really impressed me about BigQuery is its ability to perform transformations on large tables. The design of BigQuery makes it very efficient to do a query and modify the results in some way, even in parallel. That enables a more modern ELT* (extract, load, transform many times) model rather than the traditional ETL (extract, transform, load). While this has obvious performance and cost benefits, it is also critical for enabling research, where we need to provide different data (different multiple Ts in the ELT*) for each research endeavor. Previously it was exceedingly challenging to do this, and now we are beginning to be able to do this relatively cheaply and easily, allowing us to support research projects that previously would have simply failed for lack of resources." -- Garrick Olson, Infrastructure and Platform team lead, Research IT
Optimizing ATLAS for Google Cloud and Data Science
OHDSI toolkit performance on BigQuery has been the single most challenging aspect of our journey. While direct SQL query using BigQuery is highly performant, the toolkits do not directly use BigQuery. Instead, the tools use shared libraries such as DatabaseConnector, and SQLRenderer that translate the query to BigQuery SQL dialect.
Odysseus, developers of ATLAS, dedicated significant efforts to fix the performance issues to support Stanford in a COVID-19 study. Aside from bug fixes to the shared libraries, Odysseus and Google collaborated to remove BigQuery limits, including a number of concurrent inserts or table metadata updates. Gregory Klebanov, Founder-CTO of Odysseus, summarizes these changes on OHDSI forum.
With the help from Odysseus we implemented a new testing method for ATLAS that incorporate data science use cases, beyond the basic functionality. The cases focus on evaluating if the results make scientific sense. These cases are part of our standard operating procedure (SOP) when testing and deploying a new version of ATLAS released by the OHDSI and have enabled us to move BigQuery closer to being a fully supported backend for the entire OHDSI tool ecosystem, and not only ATLAS. We have also embraced the ATLAS execution engine. This engine allows us to fully execute estimation and prediction studies right inside ATLAS. As part of the development and adaptation of the execution engine on BigQuery we have also released public docker images so data scientists can have a fully operational working environment in a matter of minutes. Finally we have embraced the newly released ROhdsiWebApi package. This package has enabled us to bulk import more than 1000 thousands pre-defined cohorts including the ones from the OHDSI phenotype library. Some have been actively used in network studies that have produced peer-reviewed publications and numerous pre-prints, e.g., the LEGEND whose results are now published on LANCET.
"Fast and furious participation in network studies ... being able to do team science with the world has been a privilege. And, I am drinking less coffee because I am getting fewer coffee breaks ... my analysis are running in near real time" - Jose Posada, PhD, Sr. Clinical Data Scientist, Biomedical Informatics Center
In partnership with Stanford Center for Population Health Sciences (PHS), we have made Optum dataset, Optum DOD (Date of Death) v8.0 database, available in ATLAS. Researchers who have access to both Optum (via PHS) and STARR, can essentially run the same cohort analysis across the two datasets with touch of a button.
Notable ATLAS metrics
- ATLAS benchmarking suite using SynPUF runs 3 to 10x faster on BigQuery when compared to postgreSQL (Manuscript, Supplementary Table S9.3)
- Achilles queries run in ATLAS using STARR-OMOP data present near real time user experience. Out of 725 total queries available in Achilles, 660 queries took less than 17 seconds, and median execution time was 3 sec (Manuscript, Supplementary Table S9.1)
Bringing researcher to data:
The third pillar of our data science platform are a series of training and support initiatives, a tour de force effort from our R&D lead, Priya Desai. She joined our group as the product manager of STARR-OMOP. A data scientist by training, she focused from day one on the fundamental needs of a data scientist - data quality, data re-usability and data transparency. As her first commitment to the informatics community, she credited Nigam’s “BIOMEDIN 215: Data Science for Medicine” course. As part of the release, she generated product documentation. She also created python notebooks in Stanford gitlab that show users how to work with OMOP data model, the notebooks are written for SynPUF and seamlessly run on STARR-OMOP data. Nero team offers user support via emails, office hours, slack channel, and youtube videos. Priya’s team shares Nero office hours. Priya launched a companion Stanford slack channel to support the STARR user community on Nero. Her team now offers additional ATLAS office hours in collaboration with Odysseus. She then proceeded to launch a new data science training program, a series of day long sessions that start with basic building blocks of doing data science and builds up to more complex skills using STARR clinical text. With COVID-19 shelter-in-place, the hands-on curriculum is being converted to a series of short videos on Stanford STARR youtube channel.
Data quality and user support metrics
- STARR-OMOP is refreshed weekly and the NOTE and NOTE_NLP tables are fully populated. The pre-IRB dataset is refreshed monthly. ATLAS data is refreshed weekly.
- Data Quality Dashboard runs on the OMOP datasets with every release and the data is reviewed by the data science team prior to release.
- Over 140 data scientists now have access to pre-IRB dataset on Nero
- 50 researchers have graduated the data science training program in 2020
- Over 30 STARR and ATLAS office hours have been offered since Q1 2020.
- Research IT has recently launched Stanford STARR youtube channel for training material.
COVID-19 network studies
Pre-print and peer review articles in 2020 using STARR-OMOP
- Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study, Burn E., ..., Posada J. D., ..., Ryan P., Nature Communications volume 11, 2020 (link)
- Heterogeneity and temporal variation in the management of COVID-19: a multinational drug utilization study including 71,921 hospitalized patients from China, South Korea, Spain, and the United States of America, Prats-Uribe, A., ..., Posada J.D., ..., Prieto-Alhambra, D., https://doi.org/10.1101/2020.09.15.20195545 (link)
- Characteristics and outcomes of 627 044 COVID-19 patients with and without obesity in the United States, Spain, and the United Kingdom, Recalde M., ..., Posada J.D., ..., Duarte-Salles T., https://doi.org/10.1101/2020.09.02.20185173 (link)
- Characteristics, outcomes, and mortality amongst 133,589 patients with prevalent autoimmune diseases diagnosed with, and 48,418 hospitalised for COVID-19: a multinational distributed network cohort analysis, Tan E.H., ..., Posada J.D., ..., Prieto-Alhambra, D., doi: 10.1101/2020.11.24.20236802 (link)
- Use of dialysis, tracheostomy, and extracorporeal membrane oxygenation among 240,392 patients hospitalized with COVID-19 in the United States, Burn, E., ..., Posada J.D., ..., Duarte-Salles T., DOI: 10.1101/2020.11.25.20229088 (link)
- Baseline characteristics, management, and outcomes of 55,270 children and adolescents diagnosed with COVID-19 and 1,952,693 with influenza in France, Germany, Spain, South Korea and the United States: an international network cohort study, Duarte-Salles T., ..., Posada J.D., ..., Prieto-Alhambra, D., DOI: 10.1101/2020.10.29.20222083 (link)
- Baseline phenotype and 30-day outcomes of people tested for COVID-19: an international network cohort including >3.32 million people tested with real-time PCR and >219,000 tested positive for SARS-CoV-2 in South Korea, Spain and the United States, Golozar, A., ..., Posada J.D., ..., Prieto-Alhambra, D., doi: 10.1101/2020.10.25.20218875 (link)
- “Clinical characteristics, symptoms, management and health outcomes in 8,598 pregnant women diagnosed with COVID-19 compared to 27,510 with seasonal influenza in France, Spain and the US: a network cohort analysis”, Lai, L. Y. H., ..., Posada J.D., ..., Prieto-Alhambra, D., doi: https://doi.org/10.1101/2020.10.13.20211821 (link)
- An international characterisation of patients hospitalised with COVID-19 and a comparison with those previously hospitalised with influenza, Burns, E., ..., Posada J.D., ..., Ryans, P. , doi: 10.1101/2020.04.22.20074336 (link)
What lies ahead
In the last year, Research IT team has become part of Technology and Digital Solutions, a single IT organization for Stanford Healthcare and Stanford School of Medicine. This merger allows Research IT to build a better STARR platform in partnership with our clinical teams. We have integrated STARR-OMOP with radiology and bedside monitoring data (from our Children's hospital) and continue to work on other data types such as pathology.
We have started to devote our efforts in unlocking the data in flowsheets and mapping to standard concepts in the measurement table. We are also working on oncology data mapping. Finally, we are looking into data augmentation for health equity research e.g. socioeconomic, gender, non-binary gender and gender identity.
“The real testament of any infrastructure is when it brings experimentation velocity and lets the team focus on innovation. Google Cloud Platform has been foundational to Research IT being able to re-imagine our nextgen CDW and help our data science community push the frontiers of data driven research.” - Somalee Datta, PhD, Director of Research IT