PHS offers a diverse collection of high-value data sets and partners with several data custodians around the world.
Effective population health research requires rich and diverse data with opportunities for linkage to social and environmental determinants of health and long-term follow-up. Stanford PHS offers access to a growing portfolio of population-level data to Stanford researchers and affiliates. These diverse data are a catalyst for transdisciplinary research and enable researchers to study a myriad of vital outcomes.
An overview of the datasets available at PHS is below.
DATASET | DATASET TYPE | SMALLEST GEO UNIT | SAMPLE SIZE | DATE RANGE |
American Family Cohort (AFC) | EMR - Primary Care | Census Block | 8 million | 2010 - 2024 |
MarketScan | Claims - Commercially Insured | Metropolitan Area | 149 million | 2006 - 2022 |
Medicare 20% RIF | Claims - Medicare | 9 digit zip | 11 million | 2006 - 2020 |
Medicaid 100% RIF | Claims - Medicaid | 5 digit zip | Over 100 million | 2011 - 2019 |
SEER and CA Cancer Registry - CMS linked data |
SEER and CA Cancer Registry will do linkages w/CMS | 5 digit zip | Varies | Varies |
Aarhus Danish Registers | National cohort, Surveys Administrative data, Biologic samples | Census Block | 5 million | 1968 - 2020 |
Data Portal
The Data Core at the Stanford Center for Population Health Sciences offers researchers:
A central hub to efficiently access, link, visualize and analyze data from a wide variety of sources;
A library of data assets to facilitate transdisciplinary population health science projects and collaboration.
Powered by Redivis, our Data Portal includes tools optimized for large health datasets that can query billions of records in seconds. Because these are high-value health data, you will need to complete several requirements to ensure responsible use of sensitive data. You can read more about requirements in the access section of each dataset on the PHS Data Portal.
Getting started
On our Data Portal, you can apply for membership and access, explore datasets, and use the Redivis tool to identify your analytical sample. After cutting your analytical sample and learning about the data, you can run your analyses on a variety of compliant, secure computational environments. We encourage you to use the Redivis Native Jupyter Notebooks. Please consult our PHS Documentation resources to get started.
New Data Portal Training
Learn how to utilize Redivis, a data platform used to store and query data on the PHS Data Portal, for every stage of your analytical workflow. This presentation showcases common methodologies in working with large claims datasets, including scalable cohort generation and analytical workflows in R, Python, Stata and SAS. The session concludes with an exploration of using modern ML techniques to classify patient notes and other unstructured data.
Getting help
We offer several sources of support.
PHS data docs
Read all about how working with PHS data from start to finish in this step-by-step guide, including more information about our systems and FAQs.
Slack user channel
Your second line of support is more interactive and great for quick questions that you can't resolve from the PHS data docs alone. You can also search the channels for your questions as it may have been asked before.
Office hours
Your third line of support is to schedule a meeting with us. We are happy to sit down with you for more complicated questions and issues that are best resolved in conversation.
We are happy to support your data questions and suggestions.