Current Research and Scholarly Interests
My laboratory develops innovative machine learning methods to predict and decode biological sequences, molecular interactions, and genetic variation. We have pioneered deep learning models and interpretation frameworks that decode DNA and RNA sequence syntax governing context-specific transcription factor binding, RNA binding protein interactions, chromatin accessibility, histone modifications, transcription initiation, gene expression, alternative polyadenylation, and RNA editing. Using these approaches, we have built regulatory models across thousands of cellular contexts in humans and mice, elucidating dynamic regulation during differentiation and cellular reprogramming. Our methodological contributions span regulatory element mapping, deciphering the cis-regulatory code, long-range regulatory interaction modeling, and predictive regulatory network construction. We have adapted protein language models to predict and design transcription factor effector domains and developed machine learning frameworks leveraging T-cell and B-cell repertoire sequences for disease diagnostics.
I have extensive leadership experience in collaborative genomics consortia. As principal investigator, I led integrative analyses for the Encyclopedia of DNA Elements (ENCODE) consortium and the Roadmap Epigenomics Project. Currently, I serve as steering committee co-chair of the Impact of Genomic Variation on Function (IGVF) consortium and co-lead the Data Analysis and Coordination Center for the Multi-omics in Health and Disease (MOHD) consortium. My team has developed standardized processing and quality control pipelines for bulk and single-cell molecular profiling data across ENCODE, Roadmap, IGVF, and MOHD initiatives.
Translating our regulatory models to biomedical applications, we dissect functional genetic variation in rare and complex diseases using large biobanks and genome sequencing projects. Our disease-focused collaborations span colorectal cancer (GECCO and HTAN consortia), cardiometabolic disorders (AMP-CMD, CZI Seed networks), neurodegenerative diseases (ADSP consortium), and neuropsychiatric conditions (PsychENCODE consortium).
We have also developed widely-used software tools and web portals for mining and visualizing large-scale regulatory genomics data, facilitating community access to our resources and findings.
I have successfully mentored over 35 graduate students and postdocs who have gone on to leadership positions in academia (faculty at Carnegie Mellon, Michigan State, Memorial Sloan Kettering) and industry (Genentech, Illumina, NVIDIA), demonstrating our lab's commitment to training the next generation of computational biologists.