We study estimation of causal effects when the dependence of treatment assignments on unobserved confounding factors is bounded. First, we quantify bounds on the conditional average treatment effect under a bounded unobserved confounding model, first studied by Rosenbaum for the average treatment effect. Then, we propose a semi-parametric model to bound the average treatment effect and provide a corresponding inferential procedure, allowing us to derive confidence intervals of the true average treatment effect. Our semi-parametric method extends Chernozhukov et al.'s double machine learning method for the average treatment effect, which assumes all confounding variables are observed. As a result, our method allows applications in problems involving covariates of a higher dimension than traditional sensitivity analyses, e.g., covariate matching, allow. We complement our methodological development with optimality results showing that in certain cases, our proposed bounds are tight. In addition to our theoretical results, we perform simulation and real data analyses to investigate the performance of the proposed method, demonstrating the accuracy of the new confidence intervals in practical finite sample regimes.
Droplet single-cell RNA-sequencing (dscRNA-seq) has enabled rapid, massively parallel profiling of transcriptomes. However, assessing differential expression across multiple individuals has been hampered by inefficient sample processing and technical batch effects. Here we describe a computational tool, demuxlet, that harnesses natural genetic variation to determine the sample identity of each droplet containing a single cell (singlet) and detect droplets containing two cells (doublets). These capabilities enable multiplexed dscRNA-seq experiments in which cells from unrelated individuals are pooled and captured at higher throughput than in standard workflows. Using simulated data, we show that 50 single-nucleotide polymorphisms (SNPs) per cell are sufficient to assign 97% of singlets and identify 92% of doublets in pools of up to 64 individuals. Given genotyping data for each of eight pooled samples, demuxlet correctly recovers the sample identity of >99% of singlets and identifies doublets at rates consistent with previous estimates. We apply demuxlet to assess cell-type-specific changes in gene expression in 8 pooled lupus patient samples treated with interferon (IFN)-β and perform eQTL analysis on 23 pooled samples.
The cost for developing a new drug has been increasing dramatically over the last forty years. Many reasons can be attributed to this. The major challenge is that easyto-solve diseases have already been tackled and now, more advanced technologies and scientific breakthroughs are needed to treat the diseases for which there is high medical need. On this talk, I’ll showcase how we are using human genetic data from large-scale studies to identify opportunities to validate and repurpose already existing drugs.
As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, Gov. Newsom recently proposed "data dividend" whereby consumers are compensated by companies for the data that they generate. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, our experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it gives actionable insights on what types of data benefit or harm the prediction model; 2) weighting training data by Shapley value improves domain adaptation. This is joint work with Amirata Ghorbani.
In the second part, Abubakar Abid, my PhD student, will present Gradio, a new framework to efficiently share and test ML models in the wild.
We are using functional genomics (i.e. gene expression, methylation, RNA-seq) to identify large effect rare variants that influence human traits. These variants would be challenging to identify through genome-wide association studies (GWAS) due to the low numbers of observations of each allele and the large number of variants. By looking at functional genomics outliers, our general approach is to identify rare variants in specific genes and individuals with outlier levels to use as a candidate subset for trait association testing; thereby reducing the multiple testing burden. This has been a promising approach but is more ad hoc in its application. When is regression sufficient? When should one consider using molecular outliers? What constitutes an outlier? How do we effectively integrate across multiple layers of molecular data? Let’s discuss!
Health is the most important demand for humans. Long and healthy life is one of the primary research subjects in human health research. However, it is difficult to accurately access health status at a very early stage, with the aim of determining appropriate interventions to maintain good health and wellbeing. Therefore, it is essential to optimize human health management polices and assess the risk factors associated with health status. Human health management is the process and means for health risk factor monitoring, prognostics, intervention and control based on our knowledge of human health and prevention using non-clinical and clinical linkage data. Some symptoms that could indicate potential advanced disease or chronic disease can often be ignored or missed. This will lead to serious delay in clinical diagnosis and timely treatment intervention. Subsequently, it will increase the medical treatment costs as well as increasing the patient’s physical, mental and financial burden.
Our study aims to develop a systematic approach, which integrates statistical and artificial intelligent health big data modeling into optimal health management decision-making with mobile application. By developing statistical modeling method for health big data on early diagnosis, prevention and intervention, we are developing a multi stage delay-time model to investigate risk factors and predict heath status at an earlier stage of disease/illness progression using linked clinical and non-clinical data. In this talk, we will present our recent research outcomes and discuss the challenges for the future study.
Over the past decade genome-wide association studies (GWAS) have found thousands of variants associated with hundreds of phenotypes. The conventional GWAS approach evaluates each variant individually. However, these almost always have a small effect on a given phenotype. To address this limitation, one can instead combine variants together into a polygenic risk score (PRS), which can be more strongly associated with phenotypes. This suggests that a PRS may be useful for predicting phenotypes. For example, indicating which individuals are at a substantially increased risk of cancer and should undergo more active screening). While there is much hope and excitement surrounding the potential use of PRS, there also remain a number of unanswered questions and concerns with this approach. I will consider some of the critical issues, including how to best construct PRS from GWAS data, with an application to multiple different cancers in the UK Biobank cohort.
We present a new method for design problems wherein the goal is to maximize or specify the value of one or more properties of interest. For example, in protein design, one may wish to find the protein sequence that maximizes fluorescence. We assume access to one or more, potentially black box, stochastic "oracle" predictive functions, each of which maps from input (e.g., protein sequences) design space to a distribution over a property of interest (e.g. protein fluorescence). At first glance, this problem can be framed as one of optimizing the oracle(s) with respect to the input. However, many state-of-the-art predictive models, such as neural networks, are known to suffer from pathologies, especially for data far from the training distribution. Thus we need to modulate the optimization of the oracle inputs with prior knowledge about what makes `realistic' inputs (e.g., proteins that stably fold). Herein, we propose a new method to solve this problem, Conditioning by Adaptive Sampling, which yields state-of-the-art results on a simulated protein fluorescence problem, as compared to other recently published approaches. Formally, our method achieves its success by using model-based adaptive sampling to estimate the conditional distribution of the input sequences given the desired properties.
23andMe’s mission is to help people access, understand, and benefit from the human genome. In this talk, I will provide an overview of research studies conducted at 23andMe, and outline how we engage our customers in scientific research via the the 23andMe service. Using this approach, 23andMe has developed the world’s largest consented, re-contactable database for genetic research, with more than 5 million customers, a research consent rate over 80%, and over one billion phenotypic data points. I will discuss how the 23andMe Research team has leveraged this database to drive scientific discovery that can lead to novel therapies offering benefits for patients.
Recent years have witnessed increasing empirical successes in reinforcement learning. However, many statistical questions about reinforcement learning are not well understood even in the most basic setting. For example, how many sample transitions are needed and sufficient for estimating a near-optimal policy for Markov decision problem (MDP)? In the first part, we survey recent advances on the methods and complexity for Markov decision problems (MDP) with finitely many state and actions - a most basic model for reinforcement learning. In the second part we study the statistical state compression of general finite-state Markov processes. We propose a spectral state compression method for learning state features and aggregation structures from data. The state compression method is able to “ sketch” a black-box Markov process from its empirical data, for which both minimax statistical guarantees and scalable computational tools are provided. In the third part, we propose a bilinear primal-dual pi learning method for learning the optimal policy of MDP, which utilizes given state features. The method is motivated from a saddle point formulation of the Bellman equation. Its sample complexity depends only on the number of parameters and is variant with respect to the dimension of the problem, making high-dimensional reinforcement learning possible using “small” data.
Randomized controlled trials (RCT's) are an indispensable source of information about efficacy of treatments in almost any disease area. With the availability of multiple treatment options, comparative effectiveness research (CER) is gaining importance for better and informed health care decisions. However design and analysis of effectiveness trial is much more complex than the efficacy trial. The effect of including an active comparator arm/s in a RCT is immense. This gives rise to superiority and non-inferiority trials. The non-inferiority (NI) RCT design plays a fundamental role in CER, which will be also focus of this talk. In the past decade many statistical methods have been developed, though largely in the Frequentist setup. However, availability of historical placebo-controlled trial is useful and if integrated in the current NI trial design, can provide better precision for CER. This may reduce sample size burden and improves statistical power significantly in current trial. Bayesian paradigm provides a natural path to integrate historical as well as current trial data via sequential learning in the NI setup. In this talk we will discuss both fraction margin and fixed margin based Bayesian approach for three-arm NI trial. We will also discuss some interesting open problems related to CER using NI trial.
Tissue composition is a major determinant of phenotypic variation and a key factor influencing disease outcomes. Although single-cell RNA sequencing has emerged as a powerful technique for characterizing cellular heterogeneity, it is currently impractical for large sample cohorts and cannot be applied to fixed specimens collected as part of routine clinical care. Over the last decade, a number of computational techniques have been described for dissecting cellular content directly from genomic profiles of mixture samples. In this talk, I will review key computational and statistical considerations for “digital cytometry” applications. I will also discuss basic and translational efforts from our group to leverage cell signatures derived from diverse sources, including single-cell reference profiles, to infer cell type abundance and cell type-specific gene expression profiles from bulk tissue transcriptomes. Digital cytometry has the potential to augment single cell profiling efforts, enabling cost-effective, high throughput tissue characterization without the need for antibodies, disaggregation, or viable cells.
Across modern biology, open-source scientific software is increasingly critical for progress. At the Chan Zuckerberg Initiative, we are supporting this ecosystem through both grantmaking and software development. I'll describe specific ongoing efforts for the storage, analysis, and visualization of cell biology sequencing and imaging data. I'll also highlight our broader ideas for building and supporting data sharing and open ecosystems for computational biology more generally.
The HIV Prevention Trial Network (HPTN) 052 Study is a Phase III, controlled, randomized clinical trial to assess the effectiveness of immediate versus delayed antiretroviral therapy strategies on sexual transmission of HIV-1 (Cohen, et al., 2016). It was selected by the “Science Magazine” as the Scientific Breakthrough of the Year 2011 (Alberts, 2011). In this talk, we will focus on the design and methods that underlie this landmark study in HIV Treatment-as-Prevention, and discuss the lessons that we have learned for future prevention research. References: Alberts, B (2011) Science breakthroughs, Science, 334: 1604; Cohen, MS, Chen, YQ, McCauley, M, et al. (2016) Antiretroviral therapy for the prevention of HIV transmission. New England Journal of Medicine, 375: 830-839.
Recently, there has been a surge in applying statistical learning methods in healthcare, to build models for predicting adverse outcomes from patient covariates. These predictions are then used to optimize allocation of scarce resources or treatment decisions. However, when the treatments are new, decision-making should optimize a trade-off between two objectives: (1) learning decision outcomes as functions of individual-specific covariates (exploration) and (2) maximizing benefit of the decisions. Current literature on this problem, theory of contextual multi-armed bandits, focuses on algorithms that rely on forced-exploration to address this trade-off. However, forced-exploration can be considered costly or unethical in certain decision-making tasks (e.g., hospital quality improvement initiatives). In this talk, we first introduce an algorithm that leverages freeexploration from patient covariates and achieves rate optimal objective. We also show, empirically, that the algorithm significantly reduces exploration, compared to the existing benchmarks. Next, we focus on settings when past data on decision outcomes is available. Motivated by recent literature on low-rank matrix estimation, we design algorithms that avoid unnecessary exploration by targeting the learning towards shared similarities among decisions or patients. We then demonstrate performance of the proposed methods to estimate the personalized effect of a glucose inhibitor drug (Metformin) for pre-diabetic treatment.
Rapid advances in genomic technologies have led to a wealth of diverse data, from which novel discoveries can be gleaned through the application of robust statistical and computational methods. Here we describe GeneFishing, a computational approach to reconstruct context-specific portraits of biological processes by leveraging genegene co-expression information. GeneFishing incorporates multiple high-dimensional statistical ideas, including dimensionality reduction, clustering, subsampling and results aggregation, to produce robust results. To illustrate the power of our method, we applied it using 21 genes involved in cholesterol metabolism as “bait”, to “fish out” (or identify) genes not previously identified as being connected to cholesterol metabolism. Using simulation and real datasets, we found the results obtained through GeneFishing were more interesting for our study than those provided by related geneprioritization methods. In particular, application of GeneFishing to the GTEx liver RNAseq data not only re-identified many known cholesterol-related genes, but also pointed to glyoxalase I (GLO1) as a novel gene implicated in cholesterol metabolism. In a follow-up experiment, we found that GLO1 knock-down in human hepatoma cell lines increased levels of cellular cholesterol ester, validating a role for GLO1 in cholesterol metabolism. In addition, we performed pan-tissue analysis by applying GeneFishing on various tissues and identified many potential tissue-specific cholesterol metabolism related genes. GeneFishing appears to be a powerful tool for identifying novel related components of complex biological systems and may be employed across a wide range of applications.
With the development of rapid, low-cost and readily available sequencing technologies, there is a need for quantitative methods to help interpret sequence datasets and relate them to the dynamics of biological systems. Trees (in the sense of graphs with no cycles) are a mainstay of how we represent and understand sequence data. I will introduce several flavours of trees with their motivating applications, and will describe a metric -- in the sense of a true distance function -- on unlabelled binary trees; this metric is derived from polynomials on the unlabelled trees. In the second part of the talk I will describe inference tools using trees, in the context of infectious disease: we use a mapping between phylogenetic trees and transmission trees to construct a Bayesian MCMC approach to estimate who infected whom and when. I will describe extensions of this inference approach to simultaneous reconstruction of outbreaks in different clusters, and conclude with a description of open problems and challenges in this area.
Brain functional connectivity maps the intrinsic functional architecture of the brain through correlations in neurophysiological measures of brain activities. Accumulated evidences have suggested that it holds crucial insights of pathologies of a wide range of neurological disorders. Brain functional connectivity analysis is at the foreground of neuroscience research, and is drawing increasing attention in the statistics field as well. A connectivity network is characterized by a graph, where nodes represent brain regions, and links represent statistical dependence that is often encoded by partial correlation. Such a graph is inferred from the matrixvalued neuroimaging data such as electroencephalography and functional magnetic resonance imaging. In this talk, we examine a number of statistical problems arising in brain connectivity analysis, including multigraph penalized estimation, graph-based hypothesis testing, and dynamic connectivity network modeling.
In this talk, I will first present a nonparametric time-varying coefficient model for the analysis of panel count data. We extend the traditional panel count data models by incorporating B-splines estimates of time-varying coefficients. We show that the proposed model can be implemented using a nonparametric maximum pseudo-likelihood method. We further examine the theoretical properties of the estimators of model parameters. The operational characteristics of the proposed method are evaluated through a simulation study. For illustration, we analyze data from a study of childhood wheezing, and describe the time-varying effect of an inflammatory marker on the risk of wheezing. I will also present some collaborative research work as a biostatistician in China.
As genetic technology advances, a lot more information about each individual is collected in clinical trials and the ideal of personalized medicine or precision medicine becomes an important consideration in drug development. How to identify the right drug for the right population based on the growing body of information (genetic, disease history, demographic, etc.)? Gene signature development is a critical component to address this question. Many statistical methods such ridge regression, elastic net, classification tree, machine learning, etc., have been used for this purpose, but reproducibility of the findings remains a major challenge. On another front, the traditional clinical trial design is inefficient in the sense that one trial only tests one drug in one population. More advanced designs such as population adaptation, basket trial, perpetual platform design, are needed to test multiple drugs in multiple populations simultaneously to advance precision medicine. These challenges make statistics and statisticians more critical in drug development. This presentation will introduce the challenges in modern drug development with some solutions and many unsolved questions.
The false discovery rate (FDR) is a popular error criterion for large-scale multiple testing problems. A notable pitfall of the FDR is that filtering (i.e. subsetting) the rejection set post hoc might invalidate the FDR guarantee. In some applied settings, however, filtering is standard practice. For example, post hoc filtering is often employed in gene ontology enrichment analysis (where hypotheses have a directed acyclic graph structure) to remove redundancy among the set of rejected hypotheses (for example, via the REVIGO software). We propose Focused BH, a filter-aware extension of the BH procedure. Assuming the filter can be specified in advance, Focused BH takes as input this filter as well as a set of p-values and outputs a rejection set. This rejection set, when filtered, provably controls the FDR. Existing domain-specific filters can be easily integrated into Focused BH, allowing scientists to continue the practice of filtering without sacrificing rigorous Type I error control.
Stochastic processes are widely used to model the dynamics of biological processes evolving on networks. Complexity reduction for such models aims to capture the essential dynamics of the process via a simpler representation, with minimal loss of accuracy. The stochastic shielding approximation is a novel dimension reduction method that has been used to simplify stochastic network models arising in neuroscience, such as randomly gated ion channel models, but applies broadly to many biological systems. In this talk, I will describe the stochastic shielding approximation and our related edge importance measure which allows us to rank each noise source according to its contribution to the observed variability. The approximation works by replacing the lowest ranked stochastic transitions with deterministic ones, and doesn't significantly affect the variability of the observed variables. I will also explore the robustness of the method under conditions of timescale separation and population sparsity.
For the past 15 years, with colleagues at Kaiser Permanente Northern California Division of Research, we have developed the Research Program on Genes, Environment and Health. Starting in 2009, we performed genome-wide genotyping on over 100,000 KPNC members. These individuals have been KPNC members for over 23 years, on average. During the same period of time, KPNC has employed comprehensive electronic health records, which we have linked to the genetic data for a variety of studies. These data can be used to address a broad array of questions of interest in genetic epidemiology, such as: population demographics and the relationship between self-identified race/ethnicity and genetic ancestry; genetic ancestry and disease prevalence; heritability of diseases and risk factors; gene discovery; gene characterization; and pharmacogenetics. Examples of each will be provided.
Molecular biology is now a leading example of a data intensive science, with both pragmatic and theoretical challenges being raised by data volumes and dimensionality of the data. These changes are present in both “large scale” consortia science and small scale science, and across now a broad range of applications – from human health, through to agriculture and ecosystems. All of molecular life science is feeling this effect.
As molecular techniques – from genomics through transcriptomics and metabolomics – drop in price and turn around time there is a wealth of opportunity for clinical research and in some cases, active changes clinical practice even at this early stage. The development of this work requires inter-disciplinary teams spanning basic research, bioinformatics and clinical expertise.
This shift in modality is creating a wealth of new opportunities and has some accompanying challenges. In particular there is a continued need for a robust information infrastructure for molecular biology and clinical research. This ranges from the physical aspects of dealing with data volume through to the more statistically challenging aspects of interpreting it.
A particular opportunity is the switch from research commissioning genomic measurement to healthcare centric genomic measurement. This is occurring in a number of countries worldwide, including Australia, Denmark, Finland, France, United Kingdom and United States.
Coupled with this though are important aspects of communication of the results of this information, in particular in areas closer to policy and politics, for example, the concepts of “ethnicity and race” with respect to genetics. Here the scientific area of endeavour interacts with many active societal discussions.
I will provide an overview of this area, highlighting the role of EMBL’s European Bioinformatics Institute and the Global Alliance for Genomics and Health, and then using an exemplar from my own research group on imaging genetics. Finally I will discuss my view on how to present the intersection of this field with broad societal concepts such as ethnicity or race, and the potential pitfalls in these discussions, and I welcome feedback and discussion in particular on this latter topic.
Sequencing of ancient human DNA (aDNA) samples has helped delineate our demographic and migratory history, as well as our relationship with extinct subspecies. A challenge of aDNA sequencing is that DNA is subject to various damage processes post-mortem. One of the most prevalent damage signatures, deamination of cytosines, is a blessing in disguise due to its dependence on DNA methylation (DNAme). DNAme, which occurs predominately on the C of CpGs (a cytosine-guanine dinucleotide), is an important DNA modification involved in gene regulation, specifically silencing. Careful computational analyses of these damage patterns in different genomic regions is therefore informative of gene regulatory status pre-mortem in aDNA samples. This opens up the possibility of investigating gene regulatory changes between ancient and modern humans, as well as estimating age-at-death, since there is substantial evidence for coordinated age-related changes in DNAme. I will discuss the statistical challenges, especially for low coverage sequencing data, and present our method, MAnDIBlE (Methylation of ANcient Dna Inferred by BinomiaL Expectation-propagation).
Whereas genetic studies of complex traits have primarily examined populations of European ancestry, a few recent studies have generated an abundance of whole-genome data from multi-ethnic populations. These data offer an unprecedented opportunity for the identification of novel rare and common genetic variation underlying phenotypic diversity among populations, as well as the potential to provide new insights into health disparities of minority populations for many complex diseases. However, beyond the obvious computational issues of analyzing millions of genetics variants from whole-genome data in tens of thousands of individuals, there are substantial statistical challenges for complex trait mapping in multi-ethnic samples with heterogenous genomes from a variety of study designs. In this talk, some existing challenges and new statistical approaches for improved complex trait mapping in multi-ethnic populations will be presented, with applications to the two largest multi-ethnic genetic studies in the U.S. to date: the Trans-Omics for Precision Medicine (TOPMed) program and the Population Architecture using Genomics and Epidemiology (PAGE) study.
We will present multiple ways in which healthcare data is acquired and machine learning methods are currently being introduced into clinical settings:
Modelling the prediction of disease, including Sepsis, and ways in which the best treatment decisions for Sepsis patients can be made, from electronic health record (EHR) data using Gaussian processes and deep learning methods.
Predicting surgical complications and transfer learning methods for combining databases.
Using mobile apps and integrated sensors for improving the granularity of recorded health data for chronic conditions.
Current work in these areas will be presented and the future of machine learning contributions to the field will be discussed.
Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum likelihood theory provides asymptotic distributions for the maximum likelihood estimate (MLE) and the likelihood ratio test (LRT), which are universally used for inference. Our findings reveal, however, when the number of features p and the sample size n both diverge, with the ratio p/n converging to a positive constant, classical results are far from accurate. For a certain class of logistic models, we observe, (1) the MLE is biased, (2) variability of the MLE is much higher than classical results and (3) the LRT is not distributed as a Chi-Squared. We develop a new theory that quantifies the asymptotic bias and variance of the MLE, and characterizes asymptotic distribution of the LRT under certain assumptions on the distribution of the covariates. Empirical results demonstrate that our predictions are extremely accurate in finite samples. These novel predictions depend on the underlying regression coefficients through a single scalar, the overall signal strength, which can be estimated efficiently. This theory also yields tools for characterizing asymptotic properties of penalized likelihood based estimators in the aforementioned high-dimensional regime. This is based on joint work with Emmanuel Candes and Yuxin Chen.