OHDSI 2021 Symposium
Exploration, learning and participation
Research IT presented at the OHDSI 2021 Symposium and participated in a study-a-thon.
Oct 30, 2021: Jose Posada, Ph. D., and Priya Desai presented two posters to showcase Research IT work.
Linking Analysis Ready Multi-modal Clinical data
Priya Desai, Somalee Datta
Abstract: STAnford medicine Research data Repository or STARR, is a research ecosystem that contains a collection of linked research ready data warehouses from disparate clinical ancillary systems and a secure data science facility. The ecosystem is designed on the principles of Data Commons and contains reusable data processing pipelines, cohort and analysis tools, training, user support and much more. STARR data currently includes electronic medical records data, clinical images (radiology, cardiology) and text, bedside monitoring data, and near real time HL7 messages. Processed, “analysis ready” linked data is available for to all Stanford researchers in a “self-service” mode and currently consists of:
- De-identified Electronic Health Records (EHR) from the two Stanford hospitals and clinics in the OMOP Common Data Model (CDM).
- De-identified bedside Monitoring (Waveform) data from Stanford Children’s Hospital
Linked patient data in the ecosystem are primarily anchored using person_id, the auto generated identifier for the patient in the CDM from the OHDSI community. When the data is refreshed, the person_id stays stable. Other data such as imaging metadata from radiology (including MRI’s, X Rays, ultrasounds and CT scans), and cardiology are coming soon. These analysis-ready datasets reside in BigQuery, a cloud based data warehouse that leverages the infrastructure of the Google Cloud Platform and offers rapid SQL queries and interactive analysis of massive datasets.
ATLAS with a BigQuery backend running Execution Engine – a Software demo
Jose Posada, Priya Desai, Konstantin Yaroshovets, Gregory Klebanov
Abstract: Stanford has adopted an ecosystem view of the modern clinical research tools. Built on the foundation of STRIDE, the ecosystem has since expanded to STAnford medicine Research data Repository (STARR) ecosystem. The overall design principles of the ecosystem are based on Data Commons and includes compute and storage infrastructure, data lake, data warehouses, data processing pipelines, APIs, tools, user training, and support. Our overarching goal is to streamline science for researchers.
Backbone of the STARR ecosystem is STARR-OMOP, an analytical clinical data warehouse that uses OMOP Common Data Model. One of the reasons for Stanford to choose OMOP was OHDSI in its entirety, not just the data model, we wanted the tools, the network, the community. Another critical part of our ecosystem is our data center. The compute and storage infrastructure has grown from on-premise data center to embrace cloud, not just for its larger storage and compute capacity, but also for specialized solutions. One such specialized solution is Google BigQuery, a managed distributed data warehousing solution. Stanford had previously implemented Google Cloud BigQuery for a Big Data genetics initiative , so it was natural to try BigQuery for STARR-OMOP. BigQuery brings two very significant features, one is the fact that it is a managed service and unlike traditional databases, it doesn’t require DBA tinkering for performance. It is performance out-of-the-box. The data engineering team can focus on data standardization, completeness and quality instead of indexing, sharding, and scaling. The second big feature is the data science friendly APIs. Researchers can use their laptops or HPC environments to use their Jupyter Notebooks and never really get out of the tools they do data science with.
In a previously published manuscript, we show that ATLAS benchmarking suite using SynPUF runs 3 to 10x faster on BigQuery when compared to PostgreSQL (Manuscript, Supplementary Table S9.3). We also show that Achilles queries run in ATLAS using STARR-OMOP data present near real time user experience. Out of 725 total queries available in Achilles, 660 queries took less than 17 seconds, and median execution time was 3 sec (Manuscript, Supplementary Table S9.1). While direct or API based SQL query using BigQuery is highly performant, the OHDSI toolkits do not directly use BigQuery. Instead, the tools use shared libraries such as DatabaseConnector, and SQLRenderer that translate the query to BigQuery SQL dialect. Optimization of the OHDSI toolkits to run on BigQuery is a journey we embarked on nearly two years ago. This journey has since led to successful deployment and utilization of ATLAS at Stanford. We have also embraced the execution of ATLAS PLE and PLP analyses through ARACHNE Execution Engine. The engine allows us to fully execute estimation and prediction studies right inside ATLAS. This presentation will demonstrate Stanford ATLAS running on top of STARR-OMOP including the ARACHNE Execution Engine.