The Cancer Data Science (CDS) Shared Resources Core provides researchers with access to specialized oncology clinical databases, integrated clinical and molecular data, and advanced tools like large language models for efficient data extraction. CDS offers support for data analysis, cohort development, and manuscript review, along with subsidized technology assistance to enhance research capabilities and drive innovation.
Develop and Manage Central Oncology Research Databases
- Build and maintain specialized oncology databases, including the Stanford Cancer Institute Research Database (SCIRDB) on Google Cloud Platform, to advance STARR Oncology Research.
Create and Support Data Infrastructure for Comprehensive Cancer Research
- Design, develop, and maintain integrated disease-specific databases (e.g., hematology, bone marrow)
- Develop and expand Oncoshare models
- Oncoshare is a unique data linkage model that aims to build integrated databases for cancer research that link EHRs to a State Cancer Registry (as part of the national SEER registry) to capture comprehensive tumor characteristics, first-course treatments, and long-term follow-up outcomes.
- Current Oncoshare models exist for breast and lung. Stanford Cancer Institute investigators who are interested in utilizing the curated databases from Oncoshare can submit a data request form to enhance collaboration among Stanford Cancer Institute members by leveraging the growing Oncoshare network.
- We plan to expand the Oncoshare model to additional cancer types (e.g., prostate, colorectal, gastric, brain) for pilot studies and support R01 submissions of Stanford Cancer Institute investigators
- Integrate, improve, and maintain existing specialized databases (e.g., biospecimen, pathology, or radiation-related data) locally developed across different labs/departments by linking them to comprehensive clinical data into the core’s central infrastructure.
- Develop a unified data science platform that merges molecular and clinical data with bioinformatics pipelines, featuring analysis in R and Python and interactive dashboards in R Shiny.
Integrate large language models (LLMs) to enhance natural language processing capabilities to curate detailed data from free-text clinical notes for cancer research
The Core will facilitate curating detailed clinical variables by developing and deploying cutting-edge LLMs from pathology, radiology, and progress notes (e.g., ECOG performance status, PD-L1, detailed smoking history, lines of therapy)
Requesting Specific Key Variables: Core members are encouraged to reach out for assistance in extracting specific key variables that are critical for their research. Please submit your requests using the following form.
We welcome contributions from core members who have developed new LLM models that can enhance our capabilities. If you have a model that you believe could be beneficial for large-scale extraction of variables, please contact us to discuss how it can be integrated into our processes.
Collaboration Opportunities: By collaborating with the core, members can leverage advanced LLM technologies to streamline data extraction processes, ultimately enhancing the quality and impact of their cancer research.
Access to Subsidized TDS Research Technology Support
- Stanford Cancer Institute members receive up to eight hours of subsidized support annually from the TDS Research Technology team.
- Services include cohort derivation, dataset provisioning, exploratory data analysis, manuscript review, and access to the SCI oncology diagnosis biostatistics console.