The Han Lab Research

Our research aims to develop innovative computational methods for assessing optimal screening and surveillance strategies for patients with complex diseases, particularly cancer. We focus on identifying the etiologic and causal factors that contribute to disease risks and progression. By emphasizing personalized screening, we create comprehensive risk prediction models that integrate genetic, environmental, and clinical risk factors, utilizing diverse data sources such as electronic health records, cancer registries, national claims databases, epidemiologic cohort data, and population-level genomic databases. Our lab encompasses various research areas, including statistical genetics, microsimulation modeling, molecular epidemiology, counterfactual analysis for health policy, machine learning, and risk prediction modeling for survival data. Ultimately, we aim to develop efficient computation models that integrate these factors to effectively stratify patient risks and enhance monitoring over time.

Our methodological interests span a variety of areas, including dynamic risk prediction modeling for time-to-event data (statistical learning and machine learning, including deep-learning approaches), the development of microsimulation techniques for health policy modeling, and statistical methods to identify gene-gene and gene-environment interactions using germline mutation data, as well as ligand-receptor interactions using spatial transcriptomic data. Additionally, we explore advanced machine learning methods for causal discovery, including a robust deep learning-based causal estimation framework that integrates local scores, conditional independence, and variable attributes in complex biological systems. More details in each domain are as follows:

Electronic PhenotypingData Integration and LLM-based Pipeline:
We are developing a comprehensive electronic phenotyping platform that integrates electronic health records (EHR) data from both academic (Stanford Health Care) and community-based healthcare systems (Sutter Health) across Northern California, linked to the California Cancer Registry through the Oncoshare-Lung initiative. By consolidating diverse clinical sources—including structured diagnosis codes, labs, medication records, and unstructured pathology and radiology reports, and—we built a unified database that overcomes fragmentation, missingness, and heterogeneity inherent to multi-system EHR environments. Building on this foundation, we have implemented a large language model (LLM)-based pipeline that extracts intricate cancer-related phenotypes such as biomarkers, actionable tumor mutations, and progression events from unstructured notes. This pipeline standardizes data across modalities, enhances phenotype completeness and accuracy, and enables scalable phenotyping across large cohorts. Uniquely, we further enrich patient-level data by linking longitudinal geocoded address histories to external sources such as the U.S. Census, American Community Survey (ACS), and Environmental Protection Agency (EPA) datasets. This allows us to incorporate neighborhood-level exposures—including air pollution, housing stability, and social determinants of health (SDOH)—into the phenotyping process, offering a more holistic view of patients’ clinical and environmental risk profiles. Our framework substantially expands the phenotyping landscape beyond conventional clinical variables and supports advanced cancer risk modeling and real-world evidence generation.

Dynamic risk prediction modeling for time-to-event data: Integrating longitudinal patient data from multiple sources—including tumor registries, geospatial data, longitudinal treatment histories, medical records, and patient outcome surveys—forms the foundation of our dynamic risk prediction modeling efforts. We utilize statistical learning approaches, such as landmark supermodels, to capture time-dependent effects and longitudinal changes in risk factors, along with high-dimensional feature selection through regularization techniques. Additionally, we employ deep learning-based models that incorporate various embedding techniques to harmonize data that is irregularly measured across different features and modalities, including imaging, text, procedures, diagnoses, and laboratory results within electronic health records (EHRs). By analyzing this complex and diverse dataset, we continuously update observed predictor trajectories, allowing for more accurate and timely risk assessments. Our efforts also include evaluating dynamic predictive performance using time-dependent metrics. Ultimately, this research aims to enhance our understanding of cancer dynamics and improve patient care through data-driven insights, enabling more personalized and effective treatment strategies.

Microsimulation modeling for counterfactual analysis and health policy: Developing a comprehensive microsimulation model to evaluate the effectiveness and cost-effectiveness of various cancer screening strategies is a key focus of our research. By simulating the natural history of cancer and the impact of screening interventions on population health, these models help inform evidence-based guidelines and policies. The goals of microsimulation include identifying optimal screening protocols, understanding the trade-offs between benefits and harms, and assessing the long-term outcomes of different screening approaches. The insights gained from such models have significant impacts on public health by guiding decision-making, improving cancer prevention strategies, and ultimately reducing cancer morbidity and mortality in the population.

Statistical Methods for Detecting Biological Interactions: Gene-Gene, Gene-Environment, and Ligand-Receptor Dynamics: Our efforts to develop statistical methods for detecting biological interactions focus on both gene-gene and gene-environment interactions using germline mutation or genome-wide association study (GWAS) data, as well as ligand-receptor interactions through spatial transcriptomic data. For gene-gene and gene-environment interactions, we employ a unified framework that integrates a class of disease risk models to model the joint effects of genes and environmental exposures. We utilize various joint effects beyond traditional logit models, including an additive model and leveraging the gene-environment independence assumption to further improve statistical power, similar to case-only designs. To minimize false positives, we employ rigorous methods utilizing shrinkage empirical Bayes-type estimators and incorporates biologically plausible constraints to enhance the power of tests and reduce false positives. In the context of ligand-receptor interactions, we develop a graph-based cell-cell communication simulator for spatial transcriptomics data that integrates gene regulatory networks. Furthermore, we create tools for screening ligand-receptor interactions across multiple spatial transcriptomic samples using consensus metrics, enabling a comprehensive analysis of cellular communication dynamics. Collectively, these methods provide robust frameworks for understanding complex biological interactions and their implications for health and disease.

Causal estimation through machine learning-based approaches: As part of exploratory research efforts, we are developing a novel causal estimation framework that leverages the powerful classification capabilities of deep neural networks to identify causal patterns across diverse data types, addressing the limitations of traditional scoring techniques and independence tests. Our framework integrates multiple local causality estimation scores, independence tests, and variable attributes to capture a wide range of causal mechanisms. By providing a comprehensive and adaptable approach to causal relationship identification, our framework has the potential to advance research and enhance our understanding of complex biological systems.

Want to learn more?

Read our publications

Research Resources