Research IT 2021 Year in Review

Celebrating our team's achievements

Collaborating and building remotely

Jan 15, 2022: With shelter-in-place continuing, Research IT team has settled down in the new remote work cluture. We always had remote members, some more of us chose to be remote to be closer to families. Our team members grew their families -  partners, babies, and pets. We continue to do cool things together. Here is a summary of some of the things we celebrated this year!



First summer internship offered by our team

Ms Sonam Welekar  joined us for a period of three months for an internship led by our NLP expert Dr. Jose Posada, now CS faculty in Universidad Del Norte in Barranquilla. We were delighted to be able to offer an internship, Research IT's first such offering, during the pandemic. 

Sonam took upon the task of a ML based re-classification of MyHealth messages.  The current workflow classification of these messages e.g. patient advice or appointment request is imperfect. For example, when a patient asks for medical advice through a message, it is possible that at the same time she is asking for scheduling appointments also. Our physician community has expressed keen interest in AI based MyHealth msg classification to reduce the imperfection but to the best of our knowledge, no one had looked at the feasibility here before.

The dataset used for this research consisted of hospital-labeled (and de-identified with our TiDE pipeline) 45 million unique messages. Her research aimed to find the error rate in pre-labeled messages by performing an error analysis of the message classification model. The state-of-the-art Sentence BERT transformer from HuggingFace was used for text processing and feature extraction. From a scale perspective, this is by far the largest such study. 

You go, Sonam! And our deep gratitude to Jose.

STARR-wave, bedside monitoring service launched

We completed the build out of our pediatric bedside monitoring datalake and data warehouse capabilities. To help disseminate the information about this service, we now have a new website. We also published a manuscript on pre-print server to share our approach with the broader community.

Specifically, the STARR data lake now has high density data from Philips Patient Information Center iX (PIC iX) patient monitoring hospital surveillance system. PIC iX devices capture patient vitals such as Heart Rate, Blood Pressure, Pulse Oximetry (SpO2), and Electrocardiogram (ECG). We get data from ~500 LPCH beds from units such as Post-Anesthesia Care Unit (PACU), Neonatal Intensive Care Unit (NICU), Operating Room (OR), and Emergency Department (ED). 

LPCH clinical systems stores about 38 weeks of data. The data is archived to STARR where it is available for downstream research. STARR now has all data, metadata and waveform, since the launch of PIC iX since 2017. 

A big shout out to Sanjay Malunjkar, our principal engineer on the project and Joe McCullagh, our product manager. Joe has since left Stanford for a new adventure.

STARR Tools moves to Cloud

The STARR "Cohort Discovery and Chart Review" Tools have been ported to Google Cloud Platform as of Aug 28. Clinical data comes from the newly re-architected SHC Clarity, augmented by historical data from Cerner and Carecast/Lastword. The cohort and chart discovery tools have offered our community an intuitive GUI and low barrier to entry. The look and feel of STARR-Tools have been modernized without any substantial changes to the functionality. In a typical year, STARR-Tools are used by 800+ researchers who work on 500+ distinct IRBs. These IRBs represent 50% of total approved IRBs at SoM at any given time.  

STARR-Tools (fka STRIDE) has been running on on-premise Oracle infrastructure since its launch in 2008. The new solution uses two databases, BigQuery for the EHR data and CloudSQL for storing the user interaction. Ability to use BigQuery for EHR data allows Research IT team to run fast nightly ETLs from source Clarity in BigQuery. 

A big shout out to Joe Pallas, our principal engineer behind the migration. He started working on the ETLs back in Feb 2020 and started working on the apps in Sep 2020. 

Digital Imaging integration with Data Lake completed

Did you say DICOM? Yes, we got it.

STARR data lake acheived complete integration with Stanford Hospital's Vendor Neutral Archive (VNA). We now have last 10 years of historical radiology data and new daily data pours into STARR data lake. The overall radiology DICOM assets exceed 2 petabytes and grows at 300-400 TB annually. The VNA also aggregates data from the adult ENT mini-PACS and the retinal fundus images are also available.

We have also achieved Query/Retrive access to pediatric and adult cardiology PACS. This Q/R access continues to be just in time.

My gratitude to Joe Mesterhazy for getting us here! And to Ryan for helping us get the data to the research community.

SHC Clarity re-architected for SoM 

For the last dozen years, Research IT had maintained a parallel workflow in Stanford School of Medicine (SoM) datacenter for generating a copy of SHC Clarity for research use. This workflow required Research IT staff to operate the Epic Clarity Console on nightly incremental updates sent by SHC team via SFTP and was high maintenance. 

In FY21, Research IT team collaborated with our TDS counterparts to re-architect the solution. The new solution delivers  a) complete data, b) high fidelity data, c) low latency data and, d) is less resource intensive to maintain. The new solution went into effect in Aug 2021 after 4 month of intensive testing.

In this new solution, we leverage the SHC Clarity Disaster Recovery (DR) node in a resourceful manner. We add an Oracle Active Data Guard license to make the DR database readable.  Subsequently, we extract data from Clarity DR server to a compressed AVRO format and then push the AVRO payload to STARR data lake on Google Cloud Platform (GCP). In GCP, we leverage AVRO support from within BigQuery to regenerate the Clarity. It takes <12 hrs to extract and push all Clarity data (~3 million patients with clinical text, flowsheets and more) to STARR on a 32 vCPU server with 64 GB RAM. 

A huge shoutout to Glenn Drayer (TDS Analytics), and Research IT's Deepa Balraj who were at the heart of this effort.

Open Source TiDE, clinical text de-identification pipeline

In 2019, with development of our new STARR-OMOP Clinical Data Warehouse, we developed a clinical text de-identification pipeline (TiDE) that leveraged hiding in plain sight to reduce re-identification probability. TiDE is now open source. The algorithm is containerized and detailed documentation is provided for end users. 

In TiDE,  "hiding in plain sight" is used where surrogates replace names and addresses in the clinical text. Names and addresses are found using Name Entity Recognition (NER) as well as by pattern matching to known PHI for the patient. We also replace other HIPAA identifiers such as telephone numbers, MRNs with realistic surrogates using text processing approaches. The clinical text looks near real to a human or AI. The open source version can be run on a laptop or a powerful server. Apache Beam is used for batch processing.  

In Research IT OMOP-deid Clinical Data Warehouse pipeline, TiDE is used in a distributed computing framework in GCP where 800 worker nodes are booted up on GCP for 6 hrs to de-identify 100 million clinical notes in <6 hrs. The open source version excludes the distributed computing framework used by Research IT since most end users are unlikely to be GCP users at this time. User can parallelize across multiple servers by sharding their clinical text across multiple servers. 

A big shout out to Jose Posada and Wencheng Li, who were our primary method and software developers. Alas, they are now both on their new adventures, away from Stanford. We are also grateful to our partners, Vertisystems who were key to the open sourcing effort.

Open Source db-to-avro, moving large database from on-premise to Cloud

Moving large Oracle and MSSQL databases from on-premise to cloud is a challenging problem. We extract the data from the database to AVRO format, a binary compressed format, and then upload the AVRO to cloud buckets. Cloud native datalakes like BigQuery, can load the AVRO trivially.

Since inception, we have sinced optimized and now open sourced the db-to-avro code. For our adult Clarity, which is a 10-12 terabyte database, we can extract the AVRO (full dump, not incremental) in 4-6 hours. The code is containerized for ease of execution and supports both Oracle and MSSQL sources.

A big shout out to Joe Mesterhazy and our business partners, Vertisystems for making this happen.

Open Source MIRC-CTP, our DICOM anonymization and filter scripts

Research IT hosts and manages one of the largest repository of DICOM images (approx 2 petabytes of data). These images can be linked to EHR data (OMOP/Clarity etc) and other clinical modalities. To support IRB driven research, we support just-in-time anonymization of the DICOMs. 

We have optimized MIRC-CTP scripts to remove metadata and pixel PHI from a large number of modalities including x-rays, MRI, Ultrasound, Cardiology images etc. The codebase is open source. These scripts are oriented towards removing PHI and images that are not useful for machine learning. Image types that are "DERIVED" or "SECONDARY" are excluded, as they are generally not useful for machine learning and are far more likely to contain pixel-PHI. The anonymization scripts are based off the DICOM-PS3.15E-Basic profile with additional rules for tags known to contain PHI. All vendor-specific (eg. odd-numbered) tags are also removed.

A big shout out to Joe Mesterhazy, for making this feasible.


Manuscript on bedside monitoring data integration with STARR data lake

For the several years, Research IT has helped LPCH archive pediatric bedside monitoring data. Earlier this year, we released the database of the histroical bedside monitoring metadata and associated waveforms.  We have now published our manuscript that describes the data processing and linking to the clinical data warehouse. 

The metrics for data collected between Feb 2017 to March 2021 show the following:

  • Total studies: ~620,000 
  • Compressed size of study folders: ~14 TB 
  • Average daily count of studies: ~400 
  • Total number of patients: ~48,000 
  • Average daily count of patients: ~280 
  • Uncompressed (Compressed) daily extract size: ~75 GB (~21 GB) 
  • Daily Philips database size: ~220 GB 
  • Average daily count of rows in alert table: ~180,000 
  • Average daily count of rows in wave sample table: ~10 million 
  • Average daily count of rows in enumeration value table: ~60 million 
  • Average daily count of rows in numeric value table: ~120 million


A huge shoutout to Sanjay Malunjkar, our technical lead, for going the extra mile.

Manuscript on flowsheet data integration with STARR-OMOP

We mapped several vital flowsheets and integrated the mapping in OMOP measurements table. This manuscript describes the method and impact to the CDM. 

We have mapped 28 most requested flowsheets in OMOP measurement tables. The newly included measurements are vitals such as blood pressure, oxygen level, heart rate, respiratory rate, measurements from Sequential Organ Failure Assessment (SOFA) score, Glasgow Coma Scale Score, Deterioration Index Score etc.

A huge shout out to Tina Seto, our lead developer, for going this extra mile of publishing her methods.

First STARR Informatics Summit hosted

Priya Desai, R&D Manager Biomedical Informatics, Research IT, Innovation and Translation (IaT) at TDS, and Nigam Shah, Professor of Biomedical Informatics, and Data Science at IaT are inviting you to the first ever STARR informatics summit.

Over 175 unique participants joined us to listen to keynote, STARR wins, and the live panel, and met with Stanford service providers, participated in workshops and much more. We thank all the participants, speakers and organizers for making this event a tremendous success. Here are the links to some of the key events:

  1. George Hripscak, Chair and Vivian Beaumont Allen Professor of Biomedical Informatics, Columbia University, Keynote (video)
  2. Priya Desai, Manager of Biomedical Informatics R&D, Celebrating Wins (video)
  3. Birju Patel, Stanford Fellow and panel moderator, Linking multimodal clinical data - Successes,   Challenges, and Barriers  (video)

Thank you, Priya! 

OHDSI 2021 Collaborator Showcase

At the 2021 OHDSI Global Symposium, Sept. 12-15, 2021, our team contributed to two collaborator showcase presentations:

  1. Linking Analysis Ready Multi-modal data (Link) in the category Observational Data Standards and Management.
  2. ATLAS with a BigQuery backend running Execution Engine – a Software demo (Link), in the category Open Source Analytics Development

Big shout out to Priya and Jose Posada (now faculty at Univ del Norte, Columbia).

OMOP enhancement, Social Determinants of Health

In collaboration with University Privacy Office (UPO), Research IT has made five digit zipcode available (ZCTA) in the de-identified STARR OMOP data.  The de-identified  OMOP data is designed to reduce the startup barrier and our goal has been to retain the data richness that is typically expected from identified data. 

Shout out to Joe Mesterhazy, and Wencheng (wishing you the best on your new adventures).

STARR Tools enhancement, external death data integrated

STARR Tools (fka STRIDE) has historically provided Social Security Administrations death data information to augment the in-hospital death information available via EHRs. This SSA death data is now part of NTIS Limited Access Death Master File (LADMF, file maintained and is made available to certified organizations through the U.S. Department of Commerce. LADMF regulations place certain restrictions on the use and disclosure of the date of death (including individual data elements, such as month, day and year of death) of individuals during the three-calendar-year period beginning on the date of the individual’s death.  STARR Tools, after a gap of several years where the data was unavailable, has now integrated the LADMF file and meets the required regulations. Note that researchers too must follow the associated regulations, inappropriate use are subject to penalties under federal regulations (15 CFR 1110.200).

A big shout out to Joe Pallas, for the effort in making an intuitive GUI to host and disseminate the regulated data for research use.

Note that a research study, titled "Alive or dead: Validity of the Social Security Administration Death Master File after 2011 (link)" concluded that researchers using the DMF may underestimate mortality.

More about STARR Tools

Support for PEDSNet Common Data Model for Pediatric Research

In a close collaboration with PI, Grace Lee, Research IT built a reusable ETL and data delivery pipeline and submitted the first data payload to CHOP in PEDSNet CDM. Since then, several payloads have been delivered representing CDM updates, vocabulary updates, data quality improvements and more. 

A big shout out to Priya Desai, Maria Diaz and Deepa Balraj.

Atropos Service launch at Stanford Health Care

Our OMOP is integral part of Atropos Health's new service at SHC. Atropos is commercialization of Stanford's GreenButton program, it is a digital consultation service that helps physicians and other providers answer previously unanswerable clinical questions using real-world clinical and administrative data. 

Research IT team built a OMOP payload delivery pipeline to meet Atropos service at Stanford. We are excited that this is the first time, Research IT generated data (not software) is being used for a clinical service. A big shout out to Garrick Olson for making this possible.

CARE-IT mobile study launched

CARE-IT mobile study went live on mHealth Platform. As a HIPPA compliant backend system, our platform enables patients to share their content with their family members or friends through CARE-IT mobile app and web app.

Shout out to Lei Wang and Garrick Olson.

REDCap external module, use of Google Cloud Storage buckets

Research IT team has developed a new REDCap external module (think of these as plug, configure and play reusable software) that allows Stanford REDCap users to upload files directly from a REDCap form/survey to a Google Storage Bucket. The EM is easy to configure - it works on forms and surveys, user can upload to multiple buckets or multiple files in the same field, user can set action tags to customize file path, user can upload and download using signed URLs etc.

Shout out to Ihab Zeedia!

SEAL Lab launches two calculators

Stroke Risk Calculator: The SEAL team works directly with clinicians to rapidly deploy clinical workflow efficiency improvements in Epic. In collaboration with hospitalist Shreya Shah,  a clinical risk calculator for stroke, called CHA2DS2-VASc was launched. With this release, rather than switching to an external web browser to input data into WebMD, clinicians can call up the SEAL “CHA2DS2-VASc Score for Atrial Fibrillation” app from within Epic Hyperspace. The app is launched within the current patient context and displays relevant information from the patient chart on the same screen as the calculator, as well as pre-selecting some of the settings. After completing the risk assessment, the resulting score and formal clinical justification can be pasted into the patient chart with just a few more clicks of a button.

Pulmonary Embolism Risk: In collaboration with hospitalist Shreya Shah, a clinical risk calculator called Wells’ Criteria for Pulmonary Embolism was launched. With this release, rather than switching to an external web browser to input data into WebMD, clinicians can call up the SEAL “Wells’ Criteria for PE” app from within Epic Hyperspace. The app is launched within the current patient context and displays relevant information from the patient chart on the same screen as the calculator, as well as pre-selecting some of the settings. After completing the risk assessment, the resulting score and formal clinical justification can be pasted into the patient chart with just a few more clicks of a button.

Shout out to Srini and Susan.

REDCap external module, deploying TensorFlow models

The partners at AIMI would like to present Stanford ML-models as downloadable, plug-and-play external modules (EMs) for all REDCap users, which also precludes the need to deal with integrating external models/code. The customer aims to write a paper encapsulating the ‘how to’ for RedCAP and ML models. Which involves setting up local REDCAP or Installing the AIMI EM on outside  institution REDCap instances.

Our team designed the UX and workflow for storing (in github) the Stanford models and implemented a general UI that can hot swap and cache different ML models to predict various ailments in chest xrays (to start).   Once the predictions are made on the local device, the predictions and the user ground truths are sent back to a Stanford hosted master REDCap project.

Big shoutout to Irvin Szeto.


REDCap support for StudyPages, Bayer DeTAP study 

The purpose of Decentralized Trial in Atrial Fibrillation Patients (DeTAP) study is to validate an approach to decentralize, or virtualize, the clinical trial experience for enrolled subjects, through the coordinated use of multiple digital health and telehealth technologies. The study aims to validate the feasibility, acceptability and best practices of coordinating/integrating several individual digital health technologies to achieve execution of high compliance, cost-efficient, and scientifically sound clinical trials. StudyPages was used to market the study and collect initial screening information, which was fed back into REDCap for further screening and consent. Participants then used the Huma app over 6 months to answer questionnaires and collect electrocardiogram data. Aside from building the REDCap project, we imported the data back into Stanford REDCap at 3 and 6 months for analysis.

Shout out to Lee Ann Yasukawa and Ryan Valentine!

CHOIR launched for pediatric patients

We launched pediatric CHOIR for Chiari in Stanford Children’s Health Neurosurgery Clinic. The pediatric neurosurgery clinic has been collecting before and after surgery data in REDCap which also includes reminders that go to the family for electronic data capture. Collecting data in CHOIR will allow tracking of patient scores and provide more efficient real time analysis of parent and child data prior to surgery and during follow-up.

Chiari malformation causes a small part of the brain to impede the normal flow of cerebrospinal fluid (CSF) from the head into the cervical spine. This abnormality can increase pressure in the brain and cause waves of CSF to pulse down the spinal column. If left untreated, Chiari can cause painful, disruptive and sometimes disabling symptoms.

We also built and implemented the CHOIR platform for the Oregon Health & Science University (OHSU) Pediatrics department. They will be registering patients to particiate in multiple parent/ child assessments.

Big shout out to Teresa Pacht, Lei Wang, Stephanie Byrne and Garrick Olson!

CHOIR launched for Valley Care Orthopedics

We have built and implemented the CHOIR platform for the orthopedics department at Valley Care.  The department will be using CHOIR to collect information from patients who have had knee or hip replacements. This data will be visible to clinicians treating the patients, and combined with Epic data for submission to the American Academy of Orthopedic Surgeons (AAOS) American Joint Replacement Registry (AJRR).

We are proud to support Valley Care, a hospital launched by residents and leaders of the Tri-Valley eager for robust local health care. In 2021, Valley Memorial Hospital, now Stanford Health Care – ValleyCare, celebrates 60 years of service to its community.

Big shout out to Teresa Pacht, Stephanie Byrne and Garrick Olson!

CHOIR enhancements for health equity

We improved CHOIR's image map handling to allow for customizing of the image.  A smaller image of the survey is now displayed on smaller devices such as phones and small tablets versus the larger display on computers.

We created a LANGUAGE table to hold EMR language code, display name and the associated ISO language code so that clinics can request CHOIR behavior based on patient’s preferred language.

Thank you, Teresa!

Our Voice enhancements

Our team developed a cloud function using Google Cloud Vision that autonomously blurs faces and license plates on user uploaded images for improved privacy. 

Our Voice is centered on an easy-to-use Discovery Tool app written by Research IT which allows neighborhood residents to become “citizen scientists” who walk their neighborhoods, take geotagged photos, and make audio recordings. Uploaded data is stored in REDCap, the database is customized for the project by Research IT, and can be discussed with other citizen scientists, and potential solutions presented to city planners, policymakers, and others to determine where repairs and other changes would make the most difference. The data is also shared with researchers.

We are delighted that the Our Voice research initiative was awarded ISP STAR award in 2021 by Stanford. This honor recognizes individuals and teams from the Stanford Medicine community who, through their extraordinary efforts, embody the strategic priorities of our Integrated Strategic Plan (ISP): Value Focused, Digitally Driven, and Uniquely Stanford. 

A huge shout out to Irvin Szeto, and Jordan Schultz. 

Alliance Sleep Questionnaire (ASQ) modernized

ASQ has been collecting patient sleep surveys since 2011. It was re-written from the ground up to use a modern & more secure technology stack, latest UI paradigm and provide multi-lingual support.

Poor sleep is linked to many medical consequences: depression, cardiovascular problems, diabetes, obesity, and more. Sleepiness leads to costly accidents and workplace errors. The mysteries of sleep limit our ability to identify etiologies and treat sleep disorders effectively. The ASQ (Alliance Sleep Questionnaire) is an online branching logic questionnaire developed at Stanford that patients fill out prior to their visit at the Stanford Sleep Clinic. The ASQ gives clinicians a comprehensive overview of an individual’s sleep complaints before the appointment begins. An abbreviated version of the questionnaire allows monitoring of patient symptoms over time. 

Big thanks to Glenna Mayo and Sanjay Malunjkar.

R2P2, our new customer portal launched

To facilitate better tracking, communication and support of research projects, Research IT, in collaboration with TDS PMO has launched a new customer facing portal. From this portal, researchers will be able to see all their research requests with Research IT and manage collaborators, track issues, and communicate with RIT personnel.

A big shout out to Andy, Stephanie and Ihab.