Defining & Managing Data Extracts
A data extract associates a data delivery list with a specific data model, data options, and delivery method. Before creating a data extract, you must already have a saved list as described in Creating & Managing Lists.
Defining a Data Extract
There are several ways to creating a new extract:
- Through the "New Extract" button on the Managing Extract page
- Through the "Save and Create Extract" button on Create a List
- Through the "Create new extract" button from a saved cohort on the Lists Management page
- Through editing a previously created extract
- Through STARR Chart Review. If Data Delivery is invoked through STARR Chart Review, a Data Delivery cohort will automatically be created.
Step 1 - Select a List
Data extracts must be associated with a list of patients or accession numbers. The "List" dropdown gives you access to all the lists that have been created by your or other users that are listed on your IRBs and DPAs.
Step 2 - Select a Data Model
Three data models are available through Data Delivery. OMOP and STRIDE are data models for electronic medical records and are discussed in depth here. In general, smaller studies may find STRIDE easier to work with. Large studies will require OMOP. The DICOM data model allows users to request 10 or less DICOM images which have been scrubbed for phi for download.
Requirements & Limitations
- See Data Delivery FAQs for data model size limits.
- OMOP
- At this time, OMOP extracts are fully identified, therefore projects that wish to use OMOP must be approved for all PHI categories. Projects that do not wish to use identified data must use STRIDE.
- OMOP extracts may only be delivered to Nero datasets.
- DICOM
- DICOM extracts are only available for cohorts that are compose of Accession Number List sources.
- If a DICOM extract includes multiple accession number sources, the union of these sources will be used as the list for the DICOM extract.
- DICOM extracts will be scrubbed for PHI.
- DICOM extracts can not be delivered to Nero.
Step 3 - Select the Delivery/Destination (OMOP or STRIDE only)
DICOM extracts must be downloaded.
For STRIDE or OMOP data, there are two delivery options —Downloads or Nero datasets.
Download
EMR data is downloaded in CSV format compatible with all spreadsheet applications. While this is convenient, CSV downloads have the following limitations:
- Can only be used for the In-House STRIDE data model
- PHI download has extra compliance requirements
For extracts that elect to use Download to deliver their data, the system will send you an email when your files are ready for download. You will then be able to download your CSV files from the Extracts Management page.
Nero BigQuery Dataset
Nero is a highly secure compute environment for large data sets which is available to the Stanford community. Nero offers several benefits over downloads :
- Support for OMOP extracts
- Better PHI security compared to laptops and local compute environments
- Support for large cohorts
- Up to 100K records for the STRIDE data model
- Unlimited cohort sizes for the OMOP data model
- Select your Nero project. All the Nero projects associated with your sunet id should be available in the Nero project dropdown. If the data contains PHI, the selected Nero project must be PHI approved, as indicated by "-phi-" in the project name.
- Give your dataset a name. Dataset name should only contain letters, numbers and underscores ("_"). The data delivery tool will automatically generate a prefix for your dataset indicating whether or not the extracted data contains PHI and the IRB or DPA number.
- Select an overwrite policy. If the dataset already exists in the Nero project you have selected, choose whether this extract should fail in order to preserve the existing dataset, or if the dataset should be overwritten.
The data will be delivered to your named dataset automatically once the data extract process is complete. You will receive an email when the data has been delivered.
Step 4 - PHI Options (STRIDE Only)
Select only the minimum amount of PHI needed for your research in accordance with the HIPAA principle of Minimum Necessary.
At this time, PHI can not be requested for DICOM extracts.
OMOP extracts are fully identified and require access to ALL PHI categories that may be present.
PHI options require the following conditions to be met:
- Your IRB or DPA has been approved for PHI
- Your destination/delivery is approved for PHI
- For CSV download, you must be listed on the IRB and have been approved for a PHI download exemption
- For Nero project, your project is PHI approved and has "phi" in the project name
You may observe some of the PHI options are disabled, as denoted by the use of strikethrough font. In the example above, "Other PHI" is not available as an option, which means you will not be able to view Accession Numbers or any other identifiers that fall into the "Other" category in HIPAA. This means that the DPA associated with this IRB does not have the "Other PHI" box checked.
If you wish to enable a PHI option, you must modify your IRB, as described in this STARR Tools compliance walkthrough.
Step 5 - Date Mask Options (STRIDE Only)
If you choose to not work with real dates (i.e. you did not check the "Dates" option for PHI), you are given the choice of how to scrub dates from the data.
Dates can either be systematically shifted from their original value, or replaced by the patient age at event along with the year the event took place. OMOP and DICOM dates are always scrubbed by date shifting.
When date shifting is selected as the scrubbing technique for dates, all dates for a given patient are shifted by the same amount, in order to preserve the exact timeline for that patient. Different shift values are used for different patients.
When "age at event" is selected, the date of service or encounter is converted into patient age in years, represented as a floating point number with sufficient precision to pin down to the minute when the event occurred in the patient timeline.
Further information on the techniques used to scrub PHI from free text is available on this page of the STARR Tools site and in this white paper.
Step 6 - SSA Death Data (OMOP or STRIDE)
STARR contains death data obtained from the Social Security Administration (SSA).
This data cannot be disclosed to outside collaborators.
The SSA Death Data Option is available as an option for the In-House STRIDE data model, but is included by default in OMOP data. If you plan on disclosing any of the data obtained with this tool outside of Stanford, we encourage you to use the STRIDE data model with the SSA Death Data box unchecked.
STARR contains dates of death recorded at both hospitals, which can be disclosed to collaborators once suitable legal agreements are in place. However if the patient died after leaving Stanford, it can be challenging to determine their current vital status, even with the SSA death data, which is less complete than you might think.
You cannot assume your patients are still living if neither Stanford nor SSA has a record of their death.
Step 7 - Clinical Category Options (OMOP or STRIDE)
Select "All Available EMR Data" to request all available EMR data allowable by your DPA. Note that "Confidential/Psych notes" are not included in this option, even if allowable by your DPA. To access confidential notes, these must be explicitly selected under "Selected Data Types".
OMOP Data Options
The OMOP data model does not deliver clinical notes by default. If you require clinical notes, check "Clinical Notes" under "Selected Data Types". "Confidential/Psych Notes" require "Clinical Notes" to be selected as a prerequiste.
In-House STRIDE Clinical Categories
If you have selected STRIDE as your data model, you can specify which types of clinical data you want to obtain. The categories are the same as offered by data export in the STARR Chart Review Tool.
Some categories may be disabled. "Confidential/Psych Notes" are always disabled by default. If you require "Confidential/Psych Notes", the "Clinical Notes" category must be selected first as a prerequiste. If your DPA is approved for confidential notes, then the "Confidential/Psych Notes" option will be enabled once "Clinical Notes" has been selected.
Other categories that are disabled were not included in your DPA. To enable these categories, update your IRB or DPA.
Step 8 - Limit by Date Range (STRIDE Only)
The next option is to set a date range on the clinical elements of the extract. This option is only applicable for STRIDE extracts. This is useful for subsetting the data extract into more manageable chunks if extract size is particularly large.
Step 9 - Save and Run
You can now save and run your data extract. You will be asked for a preferred email address to receive notifications regarding your extract request. You will be emailed when the extraction process completes, as it can take a while, particularly for large patient lists. Once you have saved your extract, you will be rerouted to the Extracts Management page where you can track your extract status.
Managing Extracts
All of the extracts created by you, as well as those created by collaborators on your IRBs and DPAs are listed on the Extract Management Page.
From this page, you can create new extracts as described in Defining a Data Extract. You can also search your previously created extracts, edit and resubmit them, download or redeliver to Nero, or delete them.
Extract data is deleted after five days. This means that data will no longer be available for download or redelivery. Extracts will also be automatically removed from the default view, unless set as a "Favorite". See the "Display Options" section below for more information about "Favorite" extracts.
Extracts are ordered by submission date (with most recent first) by default, however the table can be sorted by "favorite" status, extract ids, cohort names, data model or status. The search box allows you to filter the cohort table by keywords. In addition, the "Submitted By" and "List" columns have additional filters which allow you to filter by creator and IRB/DPA.
Display Options
There are several display options for the extract summary table.
- "Recent" extracts are those that has been created within the last five days and are either in queue, in process, or have data available.
- "Favorites" are extracts are those that have been marked by the user with a yellow star. This is useful for extracts that are frequently re-run. Note that the data for favorite extracts will be still be deleted after five days.
- "Expired" extracts are older than five days. Data from these extracts is no longer available for download or redelivery.
- "All" displays both Recent and Expired extracts.
By default, the extract list displays Recent and Favorite extracts.
Status
The status column contains a lot of information, including processing status, download/delivery status and "stale" status.
Processing Status
Extracts go through several stages during processing:
- Not yet started - The extract is waiting to begin processing. There may be other extracts ahead of it.
- In progress - The extract has begun processing. This could take several hours.
- Ready - The extract has completed successfully.
- Failed - The extract was not able to run to completion
Download Status
Extracts that selected "Download" as delivery will be sent an email once extract processing is complete. The status message will include "Download needed" to notify the user that the extract has not been downloaded. DICOM extracts will also indicate the number of files that need to be downloaded
During download, the download status displays the "Download in progress." message.
After successful download, the download message will report the date and time of the last download with the sunet id of the last downloader.
Delivery Status (Nero Only)
Extracts that selected a Nero project as the destination will not be sent an email until delivery is complete.
"Stale" Status
Extracts are labeled as "Stale" when the cohort associated with the extract was updated while the extract was in-process or after the extract was complete. You may want to rerun your extract to ensure that it contains data from the updated cohort.
Actions
Extract actions include:
Redeliver to Nero (Nero only)
Download files (Download delivery only)
Rerun extract with same parameters. This will create a new extract.
Change the extract settings and rerun. This will create a new extract.
Remove the extract from "Recent"
DICOM Downloads
DICOM extracts will require downloading multiple files. When the "Download" button is clicked for a DICOM extract, a list of the files in the DICOM extract will be displayed as a subtable. Each DICOM request will generate a manifest file that summarizes the status of each accession number request, as well as a codebook.csv file that includes the original accession numbers with their anonymized counterparts. Each requested accession number will result in an anonymization report as well as a zip file if the DICOM image was successful scrubbed for PHI.
Some accession numbers may result in a "Not Found" or "Filtered" status. "Not Found" DICOM files may be located in a different PACS server and inaccessible to Data Delivery. "Filtered" DICOM files were found, but unable to be scrubbed of PHI and could not be delivered. "Success" status indicates that the DICOM was found and at least some of the images were successfully scrubbed for PHI. Please note that delivered zip files may not include all original images if some could not be scrubbed successfully. Anonymization reports include statistics regarding the number of images that could not included in the delivered DICOM.