Interpretable Machine Learning for Variants in Disease

Ian Dunham (EMBL-EBI) & Anshul Kundaje (Stanford)

Background

Recent analyses have shown that drug targets with genetics data supporting their therapeutic hypothesis are more likely to lead to drugs that advance through clinical trials. Hence human genetic data is a valuable piece of causal evidence to improve the choice of targets to enter drug discovery.

Years of progress in human genetics and functional genomics have provided a rich resource in disease and trait associations from Genome Wide Association Studies (GWAS) and extensive data on gene regulatory elements in many cell types. Most variants underlying GWAS are not protein coding changes, but instead are presumed to affect regulatory elements resulting in gene expression. However, a fundamental problem remains in how to effectively bring these data together to identify the causal gene underlying each GWAS trait association.

Project

Machine learning approaches to decipher cell context-specific effects of human disease variation

The aim of this project will be to develop methods to score and interpret the cell context-specific effects of common, rare and de novo variants from neural network models of diverse regulatory profiling experiments.

Open Targets has developed a high throughput approach to analysing GWAS data to identify putative causal genes and nominate them as drug targets. In brief Open Targets Genetics combines genetic fine mapping from GWAS summary statistics with colocalization analysis to identify causal variants, and then uses a machine learning (ML) model to implicate likely causal genes. The current model (L2G) is trained on distance, variant effect predictions, quantitative gene expression variation (eQTLs) and functional genomics data including promoter capture hiC. However, the model does not take into account the likely tissue or cell of interest for the disease nor the cellular origin of the functional genomics data.

The Kundaje lab has developed state-of-the-art deep learning models and interpretation frameworks that can map regulatory DNA sequence to regulatory profiling experiments at single nucleotide resolution, dissect higher order cis-regulatory sequence syntax and predict the impact of mutations and variants in regulatory DNA sequence. These models can be trained on compendia of bulk and single cell regulatory profiling experiments spanning diverse disease relevant cell types and tissues. Such models could be used to derive highly context-specific variant effect scores for common, rare and de-novo variants associated with disease. The goal of this project will be design new approaches to integrate these scores into variant and target gene prioritization models in the context of Open Targets.

References: