Doctor of Philosophy, Tsinghua University, Computer Science and Technology (2017)
Computational Biology, Machine Learning
Innovative analytical frameworks are required to capture the complex gene-environment interactions. We investigate the insufficiency of commonly used models for disease genome analysis and suggestconsidering genetic interactions in complex diseases. For non-genetic factors, we study the emerging wearable technologies that have enabled quantification of physiological and environmental factors at an unprecedented breadth and depth. We propose a Bayesian framework to hierarchically model personalized gene-environmental interaction to enable precision health and medicine.
View details for PubMedID 30901546
A key aspect of genomic medicine is to make individualized clinical decisions from personal genomes. We developed a machine-learning framework to integrate personal genomes and electronic health record (EHR) data and used this framework to study abdominal aortic aneurysm (AAA), a prevalent irreversible cardiovascular disease with unclear etiology. Performing whole-genome sequencing on AAA patients and controls, we demonstrated its predictive precision solely from personal genomes. By modeling personal genomes with EHRs, this framework quantitatively assessed the effectiveness of adjusting personal lifestyles given personal genome baselines, demonstrating its utility as a personal health management tool. We showed that this new framework agnostically identified genetic components involved in AAA, which were subsequently validated in human aortic tissues and in murine models. Our study presents a new framework for disease genome analysis, which can be used for both health management and understanding the biological architecture of complex diseases. VIDEO ABSTRACT.
View details for PubMedID 30193110
Human immunodeficiency virus type 1 (HIV-1) genome integration is closely related to clinical latency and viral rebound. In addition to human DNA sequences that directly interact with the integration machinery, the selection of HIV integration sites has also been shown to depend on the heterogeneous genomic context around a large region, which greatly hinders the prediction and mechanistic studies of HIV integration.We have developed an attention-based deep learning framework, named DeepHINT, to simultaneously provide accurate prediction of HIV integration sites and mechanistic explanations of the detected sites. Extensive tests on a high-density HIV integration site dataset showed that DeepHINT can outperform conventional modeling strategies by automatically learning the genomic context of HIV integration from primary DNA sequence alone or together with epigenetic information. Systematic analyses on diverse known factors of HIV integration further validated the biological relevance of the prediction results. More importantly, in-depth analyses of the attention values output by DeepHINT revealed intriguing mechanistic implications in the selection of HIV integration sites, including potential roles of several DNA-binding proteins. These results established DeepHINT as an effective and explainable deep learning framework for the prediction and mechanistic study of HIV integration.DeepHINT is available as an open-source software and can be downloaded from https://github.com/nonnerdling/DeepHINT.Supplementary data are available at Bioinformatics online.
View details for PubMedID 30295703
Decoding the spatial organizations of chromosomes has crucial implications for studying eukaryotic gene regulation. Recently, chromosomal conformation capture based technologies, such as Hi-C, have been widely used to uncover the interaction frequencies of genomic loci in a high-throughput and genome-wide manner and provide new insights into the folding of three-dimensional (3D) genome structure. In this paper, we develop a novel manifold learning based framework, called GEM (Genomic organization reconstructor based on conformational Energy and Manifold learning), to reconstruct the three-dimensional organizations of chromosomes by integrating Hi-C data with biophysical feasibility. Unlike previous methods, which explicitly assume specific relationships between Hi-C interaction frequencies and spatial distances, our model directly embeds the neighboring affinities from Hi-C space into 3D Euclidean space. Extensive validations demonstrated that GEM not only greatly outperformed other state-of-art modeling methods but also provided a physically and physiologically valid 3D representations of the organizations of chromosomes. Furthermore, we for the first time apply the modeled chromatin structures to recover long-range genomic interactions missing from original Hi-C data.
View details for DOI 10.1093/nar/gky065
View details for PubMedID 29408992
Characterizing the binding behaviors of RNA-binding proteins (RBPs) is important for understanding their functional roles in gene expression regulation. However, current high-throughput experimental methods for identifying RBP targets, such as CLIP-seq and RNAcompete, usually suffer from the false negative issue. Here, we develop a deep boosting based machine learning approach, called DeBooster, to accurately model the binding sequence preferences and identify the corresponding binding targets of RBPs from CLIP-seq data. Comprehensive validation tests have shown that DeBooster can outperform other state-of-the-art approaches in RBP target prediction. In addition, we have demonstrated that DeBooster may provide new insights into understanding the regulatory functions of RBPs, including the binding effects of the RNA helicase MOV10 on mRNA degradation, the potentially different ADAR1 binding behaviors related to its editing activity, as well as the antagonizing effect of RBP binding on miRNA repression. Moreover, DeBooster may provide an effective index to investigate the effect of pathogenic mutations in RBP binding sites, especially those related to splicing events. We expect that DeBooster will be widely applied to analyze large-scale CLIP-seq experimental data and can provide a practically useful tool for novel biological discoveries in understanding the regulatory mechanisms of RBPs. The source code of DeBooster can be downloaded from http://github.com/dongfanghong/deepboost.
View details for DOI 10.1093/nar/gkx492
View details for PubMedID 28575488
View details for DOI 10.1007/s40484-017-0092-7
Ribosome stalling is manifested by the local accumulation of ribosomes at specific codon positions of mRNAs. Here, we present ROSE, a deep learning framework to analyze high-throughput ribosome profiling data and estimate the probability of a ribosome stalling event occurring at each genomic location. Extensive validation tests on independent data demonstrated that ROSE possessed higher prediction accuracy than conventional prediction models, with an increase in the area under the receiver operating characteristic curve by up to 18.4%. In addition, genome-wide statistical analyses showed that ROSE predictions can be well correlated with diverse putative regulatory factors of ribosome stalling. Moreover, the genome-wide ribosome stalling landscapes of both human and yeast computed by ROSE recovered the functional interplays between ribosome stalling and cotranslational events in protein biogenesis, including protein targeting by the signal recognition particles and protein secondary structure formation. Overall, our study provides a novel method to complement the ribosome profiling techniques and further decipher the complex regulatory mechanisms underlying translation elongation dynamics encoded in the mRNA sequence.
View details for DOI 10.1016/j.cels.2017.08.004
View details for PubMedID 28957655
Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g. GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification.We have developed a deep learning-based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames on gene expression and the mutational effects influencing translation initiation efficiency.TITER is available as an open-source software and can be downloaded from https://github.com/zhangsaithu/titer .firstname.lastname@example.org or email@example.com.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btx247
View details for PubMedID 28881981
Modeling the structural ensemble of intrinsically disordered proteins (IDPs), which lack fixed structures, is essential in understanding their cellular functions and revealing their regulation mechanisms in signaling pathways of related diseases (e.g., cancers and neurodegenerative disorders). Though the ensemble concept has been widely believed to be the most accurate way to depict 3D structures of IDPs, few of the traditional ensemble-based approaches effectively address the degeneracy problem that occurs when multiple solutions are consistent with experimental data and is the main challenge in the IDP ensemble construction task. In this article, based on a predefined conformational library, we formalize the structure ensemble construction problem into a least squares framework, which provides the optimal solution when the data constraints outnumber unknown variables. To deal with the degeneracy problem, we further propose a regularized regression approach based on the elastic net technique with the assumption that the weights to be estimated for individual structures in the ensemble are sparse. We have validated our methods through a reference ensemble approach as well as by testing the real biological data of three proteins, including alpha-synuclein, the translocation domain of Colocin N, and the K18 domain of Tau protein.
View details for DOI 10.1089/cmb.2015.0184
View details for Web of Science ID 000376080500002
View details for PubMedID 27159632
View details for PubMedCentralID PMC4876552
RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this paper, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp.
View details for DOI 10.1093/nar/gkv1025
View details for Web of Science ID 000371519700003
View details for PubMedID 26467480
View details for PubMedCentralID PMC4770198
Online social networks (OSNs) are changing the way in which the information spreads throughout the Internet. A deep understanding of the information spreading in OSNs leads to both social and commercial benefits. In this paper, we characterize the dynamic of information spreading (e.g., how fast and widely the information spreads against time) in OSNs by developing a general and accurate model based on the Interactive Markov Chains (IMCs) and mean-field theory. This model explicitly reveals the impacts of the network topology on information spreading in OSNs. Further, we extend our model to feature the time-varying user behaviors and the ever-changing information popularity. The complicated dynamic patterns of information spreading are captured by our model using six key parameters. Extensive tests based on Renren's dataset validate the accuracy of our model, which demonstrate that it can characterize the dynamic patterns of video sharing in Renren precisely and predict future spreading tendency successfully.