Doctor of Philosophy, Peking University (2016)
Bachelor of Science, Wuhan University (2011)
Hua Tang, Postdoctoral Faculty Sponsor
My research aims to develop efficient computational approaches for constructing meaningful networks using multi-omics data. The rapid development of sequencing technologies produces a lot of multi-omics data. We could construct much more stable and accurate biological networks from these multi-omics data than only one-omic data. Then we can make further analysis on the association between networks and diseases from the system views.
Approximately 13% of African-American individuals carry two copies of the APOL1 risk alleles G1 or G2, which are associated with 1.5-2.5 fold increased risk of chronic kidney disease (CKD). There have been conflicting reports as to whether an association exists between APOL1 risk alleles and cardiovascular disease, independent of the effects of APOL1 on kidney disease. We sought to test the association of APOL1 G1/G2 alleles with coronary artery disease (CAD), peripheral artery disease (PAD), and stroke among African American individuals in the Million Veteran Program (MVP).We performed a time-to-event analysis of retrospective electronic health record (EHR) data using Cox proportional hazard and competing risks Fine and Gray sub-distribution hazard models. The primary exposure was APOL1 risk allele status. The primary outcome was incident CAD amongst individuals without CKD during the 12.5 year follow up period. Separately we analyzed the cross-sectional association of APOL1 risk allele status with lipid traits and 115 cardiovascular diseases using phenome-wide association.Among 30,903 African American MVP participants, 3,941 (13%) carried the two APOL1 risk allele high-risk genotype. Individuals with normal kidney function at baseline with two risk alleles had slightly higher risk of developing CAD compared to those with no risk alleles (Hazard Ratio (HR): 1.11, 95% Confidence Interval (CI): 1.01-1.21, p=0.039). Similarly, modest associations were identified with incident stroke (HR: 1.20, 95% CI: 1.05-1.36, p=0.007) and PAD (HR: 1.15, 95% CI:1.01-1.29, p=0.031). When modeling both cardiovascular and renal outcomes, APOL1 was strongly associated with incident renal disease, while no significant association with the cardiovascular disease endpoints could be detected. Cardiovascular phenome-wide association analyses did not identify additional significant associations with cardiovascular disease subsets.APOL1 risk variants display a modest association with cardiovascular disease and this association is likely mediated by the known APOL1 association with CKD.
View details for DOI 10.1161/CIRCULATIONAHA.118.036589
View details for PubMedID 31337231
Large-scale multi-ethnic cohorts offer unprecedented opportunities to elucidate the genetic factors influencing complex traits related to health and disease among minority populations. At the same time, the genetic diversity in these cohorts presents new challenges for analysis and interpretation. We consider the utility of race and/or ethnicity categories in genome-wide association studies (GWASs) of multi-ethnic cohorts. We demonstrate that race/ethnicity information enhances the ability to understand population-specific genetic architecture. To address the practical issue that self-identified racial/ethnic information may be incomplete, we propose a machine learning algorithm that produces a surrogate variable, termed HARE. We use height as a model trait to demonstrate the utility of HARE and ethnicity-specific GWASs.
View details for DOI 10.1016/j.ajhg.2019.08.012
View details for PubMedID 31564439
View details for DOI 10.4310/SII.2018.v11.n4.a10
Building gene co-expression network (GCN) from gene expression data is an important field of bioinformatic research. Nowadays, RNA-seq data provides high dimensional information to quantify gene expressions in term of read counts for individual exons of genes. Such an increase in the dimension of expression data during the transition from microarray to RNA-seq era made many previous co-expression analysis algorithms based on simple univariate correlation no longer applicable. Recently, two vector-based methods, SpliceNet and RNASeqNet, have been proposed to build GCN. However, they failed to work when sample size is less than the number of exons.We develop an algorithm called VCNet to construct GCN from RNA-seq data to overcome this dimensional problem. VCNet performs a new statistical hypothesis test based on the correlation matrix of a gene-gene pair using the Frobenius norm. The asymptotic distribution of the new test is obtained under the null model. Simulation studies demonstrate that VCNet outperforms SpliceNet and RNASeqNet for detecting edges of GCN. We also apply VCNet to two expression datasets from TCGA database: the normal breast tissue and kidney tumour tissue, and the results show that the GCNs constructed by VCNet contain more biologically meaningful interactions than existing methods.VCNet is a useful tool to construct co-expression network.VCNet is open source and freely available from https://github.com/wangzengmiao/VCNet under GNU LGPL firstname.lastname@example.org ; email@example.com.
View details for DOI 10.1093/bioinformatics/btx131
View details for PubMedID 28334366
The increasing quality and the reducing cost of high-throughput sequencing technologies for 16S rRNA gene profiling enable researchers to directly analyze microbe communities in natural environments. The direct interactions among microbial species of a given ecological system can help us understand the principles of community assembly and maintenance under various conditions. Compositionality and dimensionality of microbiome data are two main challenges for inferring the direct interaction network of microbes. In this article, we use the logistic normal distribution to model the background mechanism of microbiome data, which can appropriately deal with the compositional nature of the data. The direct interaction relationships are then modeled via the conditional dependence network under this logistic normal assumption. We then propose a novel penalized maximum likelihood method called gCoda to estimate the sparse structure of inverse covariance for latent normal variables to address the high dimensionality of the microbiome data. An effective Majorization-Minimization algorithm is proposed to solve the optimization problem in gCoda. Simulation studies show that gCoda outperforms existing methods (e.g., SPIEC-EASI) in edge recovery of inverse covariance for compositional data under a variety of scenarios. gCoda also performs better than SPIEC-EASI for inferring direct microbial interactions of mouse skin microbiome data.
View details for DOI 10.1089/cmb.2017.0054
View details for PubMedID 28489411
View details for DOI 10.1214/17-EJS1331
Epistatic miniarrary profile (EMAP) studies have enabled the mapping of large-scale genetic interaction networks and generated large amounts of data in model organisms. It provides an incredible set of molecular tools and advanced technologies that should be efficiently understanding the relationship between the genotypes and phenotypes of individuals. However, the network information gained from EMAP cannot be fully exploited using the traditional statistical network models. Because the genetic network is always heterogeneous, for example, the network structure features for one subset of nodes are different from those of the left nodes. Exponentialfamily random graph models (ERGMs) are a family of statistical models, which provide a principled and flexible way to describe the structural features (e.g. the density, centrality and assortativity) of an observed network. However, the single ERGM is not enough to capture this heterogeneity of networks. In this paper, we consider a mixture ERGM (MixtureEGRM) networks, which model a network with several communities, where each community is described by a single EGRM.
View details for DOI 10.1109/TCBB.2017.2743711
View details for PubMedID 28858811
View details for DOI 10.1111/1462-2920.14004
An essential component of precision medicine is the ability to predict an individual's risk of disease based on genetic and non-genetic factors. For complex traits and diseases, assessing the risk due to genetic factors is challenging because it requires knowledge of both the identity of variants that influence the trait and their corresponding allelic effects. Although the set of risk variants and their allelic effects may vary between populations, a large proportion of these variants were identified based on studies in populations of European descent. Heterogeneity in genetic architecture underlying complex traits and diseases, while broadly acknowledged, remains poorly characterized. Ignoring such heterogeneity likely reduces predictive accuracy for minority individuals. In this study, we propose an approach, called XP-BLUP, which ameliorates this ethnic disparity by combining trans-ethnic and ethnic-specific information. We build a polygenic model for complex traits that distinguishes candidate trait-relevant variants from the rest of the genome. The set of candidate variants are selected based on studies in any human population, yet the allelic effects are evaluated in a population-specific fashion. Simulation studies and real data analyses demonstrate that XP-BLUP adaptively utilizes trans-ethnic information and can substantially improve predictive accuracy in minority populations. At the same time, our study highlights the importance of the continued expansion of minority cohorts.
View details for PubMedID 28757202
View details for PubMedCentralID PMC5544393
Direct analysis of microbial communities in the environment and human body has become more convenient and reliable owing to the advancements of high-throughput sequencing techniques for 16S rRNA gene profiling. Inferring the correlation relationship among members of microbial communities is of fundamental importance for genomic survey study. Traditional Pearson correlation analysis treating the observed data as absolute abundances of the microbes may lead to spurious results because the data only represent relative abundances. Special care and appropriate methods are required prior to correlation analysis for these compositional data.In this article, we first discuss the correlation definition of latent variables for compositional data. We then propose a novel method called CCLasso based on least squares with [Formula: see text] penalty to infer the correlation network for latent variables of compositional data from metagenomic data. An effective alternating direction algorithm from augmented Lagrangian method is used to solve the optimization problem. The simulation results show that CCLasso outperforms existing methods, e.g. SparCC, in edge recovery for compositional data. It also compares well with SparCC in estimating correlation network of microbe species from the Human Microbiome Project.CCLasso is open source and freely available from https://github.com/huayingfang/CCLasso under GNU LGPL firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv349
View details for Web of Science ID 000362845400013
View details for PubMedID 26048598
Understanding of the RNA editing process has been broadened considerably by the next generation sequencing technology; however, several issues regarding this regulatory step remain unresolved--the strategies to accurately delineate the editome, the mechanism by which its profile is maintained, and its evolutionary and functional relevance. Here we report an accurate and quantitative profile of the RNA editome for rhesus macaque, a close relative of human. By combining genome and transcriptome sequencing of multiple tissues from the same animal, we identified 31,250 editing sites, of which 99.8% are A-to-G transitions. We verified 96.6% of editing sites in coding regions and 97.5% of randomly selected sites in non-coding regions, as well as the corresponding levels of editing by multiple independent means, demonstrating the feasibility of our experimental paradigm. Several lines of evidence supported the notion that the adenosine deamination is associated with the macaque editome--A-to-G editing sites were flanked by sequences with the attributes of ADAR substrates, and both the sequence context and the expression profile of ADARs are relevant factors in determining the quantitative variance of RNA editing across different sites and tissue types. In support of the functional relevance of some of these editing sites, substitution valley of decreased divergence was detected around the editing site, suggesting the evolutionary constraint in maintaining some of these editing substrates with their double-stranded structure. These findings thus complement the "continuous probing" model that postulates tinkering-based origination of a small proportion of functional editing sites. In conclusion, the macaque editome reported here highlights RNA editing as a widespread functional regulation in primate evolution, and provides an informative framework for further understanding RNA editing in human.
View details for DOI 10.1371/journal.pgen.1004274
View details for Web of Science ID 000335499600032
View details for PubMedID 24722121