Methods to increase reproducibility in differential gene expression via meta-analysis.
Nucleic acids research
2017; 45 (1)
Multicohort analysis reveals baseline transcriptional predictors of influenza vaccination responses.
2017; 2 (14)
Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a 'silver standard' of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini-Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size.
View details for DOI 10.1093/nar/gkw797
View details for PubMedID 27634930
View details for PubMedCentralID PMC5224496
EMPOWERING MULTI-COHORT GENE EXPRESSION ANALYSIS TO INCREASE REPRODUCIBILITY.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2016; 22: 144-153
Annual influenza vaccinations are currently recommended for all individuals 6 months and older. Antibodies induced by vaccination are an important mechanism of protection against infection. Despite the overall public health success of influenza vaccination, many individuals fail to induce a substantial antibody response. Systems-level immune profiling studies have discerned associations between transcriptional and cell subset signatures with the success of antibody responses. However, existing signatures have relied on small cohorts and have not been validated in large independent studies. We leveraged multiple influenza vaccination cohorts spanning distinct geographical locations and seasons from the Human Immunology Project Consortium (HIPC) and the Center for Human Immunology (CHI) to identify baseline (i.e., before vaccination) predictive transcriptional signatures of influenza vaccination responses. Our multicohort analysis of HIPC data identified nine genes (RAB24, GRB2, DPP3, ACTB, MVP, DPP7, ARPC4, PLEKHB2, and ARRB1) and three gene modules that were significantly associated with the magnitude of the antibody response, and these associations were validated in the independent CHI cohort. These signatures were specific to young individuals, suggesting that distinct mechanisms underlie the lower vaccine response in older individuals. We found an inverse correlation between the effect size of signatures in young and older individuals. Although the presence of an inflammatory gene signature, for example, was associated with better antibody responses in young individuals, it was associated with worse responses in older individuals. These results point to the prospect of predicting antibody responses before vaccination and provide insights into the biological mechanisms underlying successful vaccination responses.
View details for DOI 10.1126/sciimmunol.aal4656
View details for PubMedID 28842433
META-ANALYSIS OF CONTINUOUS PHENOTYPES IDENTIFIES A GENE SIGNATURE THAT CORRELATES WITH COPD DISEASE STATUS.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2016; 22: 266-275
A major contributor to the scientific reproducibility crisis has been that the results from homogeneous, single-center studies do not generalize to heterogeneous, real world populations. Multi-cohort gene expression analysis has helped to increase reproducibility by aggregating data from diverse populations into a single analysis. To make the multi-cohort analysis process more feasible, we have assembled an analysis pipeline which implements rigorously studied meta-analysis best practices. We have compiled and made publicly available the results of our own multi-cohort gene expression analysis of 103 diseases, spanning 615 studies and 36,915 samples, through a novel and interactive web application. As a result, we have made both the process of and the results from multi-cohort gene expression analysis more approachable for non-technical users.
View details for PubMedID 27896970
View details for PubMedCentralID PMC5167529
Validating single-cell genomics for the study of renal development
2014; 86 (5): 1049-1055
The utility of multi-cohort two-class meta-analysis to identify robust differentially expressed gene signatures has been well established. However, many biomedical applications, such as gene signatures of disease progression, require one-class analysis. Here we describe an R package, MetaCorrelator, that can identify a reproducible transcriptional signature that is correlated with a continuous disease phenotype across multiple datasets. We successfully applied this framework to extract a pattern of gene expression that can predict lung function in patients with chronic obstructive pulmonary disease (COPD) in both peripheral blood mononuclear cells (PBMCs) and tissue. Our results point to a disregulation in the oxidation state of the lungs of patients with COPD, as well as underscore the classically recognized inammatory state that underlies this disease.
View details for PubMedID 27896981
Origin and Consequences of the Relationship between Protein Mean and Variance
2014; 9 (7)
Single-cell genomics will enable studies of the earliest events in kidney development, although it is unclear if existing technologies are mature enough to generate accurate and reproducible data on kidney progenitors. Here we designed a pilot study to validate a high-throughput assay to measure the expression levels of key regulators of kidney development in single cells isolated from embryonic mice. Our experiment produced 4608 expression measurements of 22 genes, made in small cell pools, and 28 single cells purified from the RET-positive ureteric bud. There were remarkable levels of concordance with expression data generated by traditional microarray analysis on bulk ureteric bud tissue with the correlation between our average single-cell measurements and GUDMAP measurements for each gene of 0.82-0.85. Nonetheless, a major motivation for single-cell technology is to uncover dynamic biology hidden in population means. There was evidence for extensive and surprising variation in expression of Wnt11 and Etv5, both downstream targets of activated RET. The variation for all genes in the study was strongly consistent with burst-like promoter kinetics. Thus, our results can inform the design of future single-cell experiments, which are poised to provide important insights into kidney development and disease.
View details for DOI 10.1038/ki.2014.104
View details for Web of Science ID 000344446000025
View details for PubMedID 24759149
Performance of Common Analysis Methods for Detecting Low-Frequency Single Nucleotide Variants in Targeted Next-Generation Sequence Data
JOURNAL OF MOLECULAR DIAGNOSTICS
2014; 16 (1): 75-88
Cell-to-cell variance in protein levels (noise) is a ubiquitous phenomenon that can increase fitness by generating phenotypic differences within clonal populations of cells. An important challenge is to identify the specific molecular events that control noise. This task is complicated by the strong dependence of a protein's cell-to-cell variance on its mean expression level through a power-law like relationship (σ2∝μ1.69). Here, we dissect the nature of this relationship using a stochastic model parameterized with experimentally measured values. This framework naturally recapitulates the power-law like relationship (σ2∝μ1.6) and accurately predicts protein variance across the yeast proteome (r2 = 0.935). Using this model we identified two distinct mechanisms by which protein variance can be increased. Variables that affect promoter activation, such as nucleosome positioning, increase protein variance by changing the exponent of the power-law relationship. In contrast, variables that affect processes downstream of promoter activation, such as mRNA and protein synthesis, increase protein variance in a mean-dependent manner following the power-law. We verified our findings experimentally using an inducible gene expression system in yeast. We conclude that the power-law-like relationship between noise and protein mean is due to the kinetics of promoter activation. Our results provide a framework for understanding how molecular processes shape stochastic variation across the genome.
View details for DOI 10.1371/journal.pone.0102202
View details for Web of Science ID 000339992600010
View details for PubMedID 25062021
Population-based rare variant detection via pooled exome or custom hybridization capture with or without individual indexing
Next-generation sequencing (NGS) is becoming a common approach for clinical testing of oncology specimens for mutations in cancer genes. Unlike inherited variants, cancer mutations may occur at low frequencies because of contamination from normal cells or tumor heterogeneity and can therefore be challenging to detect using common NGS analysis tools, which are often designed for constitutional genomic studies. We generated high-coverage (>1000×) NGS data from synthetic DNA mixtures with variant allele fractions (VAFs) of 25% to 2.5% to assess the performance of four variant callers, SAMtools, Genome Analysis Toolkit, VarScan2, and SPLINTER, in detecting low-frequency variants. SAMtools had the lowest sensitivity and detected only 49% of variants with VAFs of approximately 25%; whereas the Genome Analysis Toolkit, VarScan2, and SPLINTER detected at least 94% of variants with VAFs of approximately 10%. VarScan2 and SPLINTER achieved sensitivities of 97% and 89%, respectively, for variants with observed VAFs of 1% to 8%, with >98% sensitivity and >99% positive predictive value in coding regions. Coverage analysis demonstrated that >500× coverage was required for optimal performance. The specificity of SPLINTER improved with higher coverage, whereas VarScan2 yielded more false positive results at high coverage levels, although this effect was abrogated by removing low-quality reads before variant identification. Finally, we demonstrate the utility of high-sensitivity variant callers with data from 15 clinical lung cancers.
View details for DOI 10.1016/j.jmoldx.2013.09.003
View details for Web of Science ID 000328926500010
View details for PubMedID 24211364
Detection of rare genomic variants from pooled sequencing using SPLINTER.
Journal of visualized experiments : JoVE
Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.We report methodology for pooled hybridization capture with pre-enrichment, indexed multiplexing of up to 48 individuals or non-indexed pooled sequencing of up to 92 individuals with as little as 70 ng of DNA per person. Modified solid phase reversible immobilization bead purification strategies enable no sample transfers from sonication in 96-well plates through adapter ligation, resulting in 50% less library preparation reagent consumption. Custom Y-shaped adapters containing novel 7 base pair index sequences with a Hamming distance of ≥2 were directly ligated onto fragmented source DNA eliminating the need for PCR to incorporate indexes, and was followed by a custom blocking strategy using a single oligonucleotide regardless of index sequence. These results were obtained aligning raw reads against the entire genome using Novoalign followed by variant calling of non-indexed pools using SPLINTER or SAMtools for indexed samples. With these pipelines, we find sensitivity and specificity of 99.4% and 99.7% for pooled exome sequencing. Sensitivity, and to a lesser degree specificity, proved to be a function of coverage. For rare variants (≤2% minor allele frequency), we achieved sensitivity and specificity of ≥94.9% and ≥99.99% for custom capture of 2.5 Mb in multiplexed libraries of 22-48 individuals with only ≥5-fold coverage/chromosome, but these parameters improved to ≥98.7 and 100% with 20-fold coverage/chromosome.This highly scalable methodology enables accurate rare variant detection, with or without individual DNA sample indexing, while reducing the amount of required source DNA and total costs through less hybridization reagent consumption, multi-sample sonication in a standard PCR plate, multiplexed pre-enrichment pooling with a single hybridization and lesser sequencing coverage required to obtain high sensitivity.
View details for DOI 10.1186/1471-2164-13-683
View details for Web of Science ID 000312962400001
View details for PubMedID 23216810
Rare Variants in APP, PSEN1 and PSEN2 Increase Risk for AD in Late-Onset Alzheimer's Disease Families
2012; 7 (2)
As DNA sequencing technology has markedly advanced in recent years(2), it has become increasingly evident that the amount of genetic variation between any two individuals is greater than previously thought(3). In contrast, array-based genotyping has failed to identify a significant contribution of common sequence variants to the phenotypic variability of common disease(4,5). Taken together, these observations have led to the evolution of the Common Disease / Rare Variant hypothesis suggesting that the majority of the "missing heritability" in common and complex phenotypes is instead due to an individual's personal profile of rare or private DNA variants(6-8). However, characterizing how rare variation impacts complex phenotypes requires the analysis of many affected individuals at many genomic loci, and is ideally compared to a similar survey in an unaffected cohort. Despite the sequencing power offered by today's platforms, a population-based survey of many genomic loci and the subsequent computational analysis required remains prohibitive for many investigators. To address this need, we have developed a pooled sequencing approach(1,9) and a novel software package(1) for highly accurate rare variant detection from the resulting data. The ability to pool genomes from entire populations of affected individuals and survey the degree of genetic variation at multiple targeted regions in a single sequencing library provides excellent cost and time savings to traditional single-sample sequencing methodology. With a mean sequencing coverage per allele of 25-fold, our custom algorithm, SPLINTER, uses an internal variant calling control strategy to call insertions, deletions and substitutions up to four base pairs in length with high sensitivity and specificity from pools of up to 1 mutant allele in 500 individuals. Here we describe the method for preparing the pooled sequencing library followed by step-by-step instructions on how to use the SPLINTER package for pooled sequencing analysis (http://www.ibridgenetwork.org/wustl/splinter). We show a comparison between pooled sequencing of 947 individuals, all of whom also underwent genome-wide array, at over 20kb of sequencing per person. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent. This method can be easily scaled up to any number of genomic loci and any number of individuals. By incorporating the internal positive and negative amplicon controls at ratios that mimic the population under study, the algorithm can be calibrated for optimal performance. This strategy can also be modified for use with hybridization capture or individual-specific barcodes and can be applied to the sequencing of naturally heterogeneous samples, such as tumor DNA.
View details for DOI 10.3791/3943
View details for PubMedID 22760212
Rare missense variants in CHRNB4 are associated with reduced risk of nicotine dependence
HUMAN MOLECULAR GENETICS
2012; 21 (3): 647-655
Pathogenic mutations in APP, PSEN1, PSEN2, MAPT and GRN have previously been linked to familial early onset forms of dementia. Mutation screening in these genes has been performed in either very small series or in single families with late onset AD (LOAD). Similarly, studies in single families have reported mutations in MAPT and GRN associated with clinical AD but no systematic screen of a large dataset has been performed to determine how frequently this occurs. We report sequence data for 439 probands from late-onset AD families with a history of four or more affected individuals. Sixty sequenced individuals (13.7%) carried a novel or pathogenic mutation. Eight pathogenic variants, (one each in APP and MAPT, two in PSEN1 and four in GRN) three of which are novel, were found in 14 samples. Thirteen additional variants, present in 23 families, did not segregate with disease, but the frequency of these variants is higher in AD cases than controls, indicating that these variants may also modify risk for disease. The frequency of rare variants in these genes in this series is significantly higher than in the 1,000 genome project (p = 5.09 × 10⁻⁵; OR = 2.21; 95%CI = 1.49-3.28) or an unselected population of 12,481 samples (p = 6.82 × 10⁻⁵; OR = 2.19; 95%CI = 1.347-3.26). Rare coding variants in APP, PSEN1 and PSEN2, increase risk for or cause late onset AD. The presence of variants in these genes in LOAD and early-onset AD demonstrates that factors other than the mutation can impact the age at onset and penetrance of at least some variants associated with AD. MAPT and GRN mutations can be found in clinical series of AD most likely due to misdiagnosis. This study clearly demonstrates that rare variants in these genes could explain an important proportion of genetic heritability of AD, which is not detected by GWAS.
View details for DOI 10.1371/journal.pone.0031039
View details for Web of Science ID 000301977500027
View details for PubMedID 22312439
High-throughput discovery of rare insertions and deletions in large cohorts
2010; 20 (12): 1711-1718
Genome-wide association studies have identified common variation in the CHRNA5-CHRNA3-CHRNB4 and CHRNA6-CHRNB3 gene clusters that contribute to nicotine dependence. However, the role of rare variation in risk for nicotine dependence in these nicotinic receptor genes has not been studied. We undertook pooled sequencing of the coding regions and flanking sequence of the CHRNA5, CHRNA3, CHRNB4, CHRNA6 and CHRNB3 genes in African American and European American nicotine-dependent smokers and smokers without symptoms of dependence. Carrier status of individuals harboring rare missense variants at conserved sites in each of these genes was then compared in cases and controls to test for an association with nicotine dependence. Missense variants at conserved residues in CHRNB4 are associated with lower risk for nicotine dependence in African Americans and European Americans (AA P = 0.0025, odds-ratio (OR) = 0.31, 95% confidence-interval (CI) = 0.31-0.72; EA P = 0.023, OR = 0.69, 95% CI = 0.50-0.95). Furthermore, these individuals were found to smoke fewer cigarettes per day than non-carriers (AA P = 6.6 × 10(-5), EA P = 0.021). Given the possibility of stochastic differences in rare allele frequencies between groups replication of this association is necessary to confirm these findings. The functional effects of the two CHRNB4 variants contributing most to this association (T375I and T91I) and a missense variant in CHRNA3 (R37H) in strong linkage disequilibrium with T91I were examined in vitro. The minor allele of each polymorphism increased cellular response to nicotine (T375I P = 0.01, T91I P = 0.02, R37H P = 0.003), but the largest effect on in vitro receptor activity was seen in the presence of both CHRNB4 T91I and CHRNA3 R37H (P = 2 × 10(-6)).
View details for DOI 10.1093/hmg/ddr498
View details for Web of Science ID 000299351000015
View details for PubMedID 22042774
TATA is a modular component of synthetic promoters
2010; 20 (10): 1391-1397
Pooled-DNA sequencing strategies enable fast, accurate, and cost-effect detection of rare variants, but current approaches are not able to accurately identify short insertions and deletions (indels), despite their pivotal role in genetic disease. Furthermore, the sensitivity and specificity of these methods depend on arbitrary, user-selected significance thresholds, whose optimal values change from experiment to experiment. Here, we present a combined experimental and computational strategy that combines a synthetically engineered DNA library inserted in each run and a new computational approach named SPLINTER that detects and quantifies short indels and substitutions in large pools. SPLINTER integrates information from the synthetic library to select the optimal significance thresholds for every experiment. We show that SPLINTER detects indels (up to 4 bp) and substitutions in large pools with high sensitivity and specificity, accurately quantifies variant frequency (r = 0.999), and compares favorably with existing algorithms for the analysis of pooled sequencing data. We applied our approach to analyze a cohort of 1152 individuals, identifying 48 variants and validating 14 of 14 (100%) predictions by individual genotyping. Thus, our strategy provides a novel and sensitive method that will speed the discovery of novel disease-causing rare variants.
View details for DOI 10.1101/gr.109157.110
View details for Web of Science ID 000284835000010
View details for PubMedID 21041413
Cardiac signaling genes exhibit unexpected sequence diversity in sporadic cardiomyopathy, revealing HSPB7 polymorphisms associated with disease
JOURNAL OF CLINICAL INVESTIGATION
2010; 120 (1): 280-289
The expression of most genes is regulated by multiple transcription factors. The interactions between transcription factors produce complex patterns of gene expression that are not always obvious from the arrangement of cis-regulatory elements in a promoter. One critical element of promoters is the TATA box, the docking site for the RNA polymerase holoenzyme. Using a synthetic promoter system coupled to a thermodynamic model of combinatorial regulation, we analyze the effects of different strength TATA boxes on various aspects of combinatorial cis-regulation. The thermodynamic model explains 75% of the variance in gene expression in synthetic promoter libraries with different strength TATA boxes, suggesting that many of the salient aspects of cis-regulation are captured by the model. Our results demonstrate that the effect of changing the TATA box on gene expression is the same for all synthetic promoters regardless of the arrangement of cis-regulatory sites we studied. Our analysis also showed that in our synthetic system the strength of the RNA polymerase-TATA interaction does not alter the combinatorial interactions between transcription factors, or between transcription factors and RNA polymerase. Finally, we show that although stronger TATA boxes increase expression in a predictable fashion, stronger TATA boxes have very little effect on noise in our synthetic promoters, regardless of the arrangement of cis-regulatory sites. Our results support a modular model of promoter function, where cis-regulatory elements can be mixed and matched (programmed) with outcomes on expression that are predictable based on the rules of simple protein-protein and protein-DNA interactions.
View details for DOI 10.1101/gr.106732.110
View details for Web of Science ID 000282375000009
View details for PubMedID 20627890
The RhoU/Wrch1 Rho GTPase gene is a common transcriptional target of both the gp130/STAT3 and Wnt-1 pathways
2009; 421: 283-292
Sporadic heart failure is thought to have a genetic component, but the contributing genetic events are poorly defined. Here, we used ultra-high-throughput resequencing of pooled DNAs to identify SNPs in 4 biologically relevant cardiac signaling genes, and then examined the association between allelic variants and incidence of sporadic heart failure in 2 large Caucasian populations. Resequencing of DNA pools, each containing DNA from approximately 100 individuals, was rapid, accurate, and highly sensitive for identifying common and rare SNPs; it also had striking advantages in time and cost efficiencies over individual resequencing using conventional Sanger methods. In 2,606 individuals examined, we identified a total of 129 separate SNPs in the 4 cardiac signaling genes, including 23 nonsynonymous SNPs that we believe to be novel. Comparison of allele frequencies between 625 Caucasian nonaffected controls and 1,117 Caucasian individuals with systolic heart failure revealed 12 SNPs in the cardiovascular heat shock protein gene HSPB7 with greater proportional representation in the systolic heart failure group; all 12 SNPs were confirmed in an independent replication study. These SNPs were found to be in tight linkage disequilibrium, likely reflecting a single genetic event, but none altered amino acid sequence. These results establish the power and applicability of pooled resequencing for comparative SNP association analysis of target subgenomes in large populations and identify an association between multiple HSPB7 polymorphisms and heart failure.
View details for DOI 10.1172/JCI39085
View details for Web of Science ID 000273495700031
View details for PubMedID 20038796
Quantification of rare allelic variants from pooled genomic DNA
2009; 6 (4): 263-265
STAT3 (signal transducer and activator of transcription 3) is a transcription factor activated by cytokines, growth factors and oncogenes, whose activity is required for cell survival/proliferation of a wide variety of primary tumours and tumour cell lines. Prominent among its multiple effects on tumour cells is the stimulation of cell migration and metastasis, whose functional mechanisms are however not completely characterized. RhoU/Wrch1 (Wnt-responsive Cdc42 homologue) is an atypical Rho GTPase thought to be constitutively bound to GTP. RhoU was first identified as a Wnt-1-inducible mRNA and subsequently shown to act on the actin cytoskeleton by stimulating filopodia formation and stress fibre dissolution. It was in addition recently shown to localize to focal adhesions and to Src-induced podosomes and enhance cell migration. RhoU overexpression in mammary epithelial cells stimulates quiescent cells to re-enter the cell cycle and morphologically phenocopies Wnt-1-dependent transformation. In the present study we show that Wnt-1-mediated RhoU induction occurs at the transcriptional level. Moreover, we demonstrate that RhoU can also be induced by gp130 cytokines via STAT3, and we identify two functional STAT3-binding sites on the mouse RhoU promoter. RhoU induction by Wnt-1 is independent of beta-catenin, but does not involve STAT3. Rather, it is mediated by the Wnt/planar cell polarity pathway through the activation of JNK (c-Jun N-terminal kinase). Both the so-called non-canonical Wnt pathway and STAT3 are therefore able to induce RhoU, which in turn may be involved in mediating their effects on cell migration.
View details for DOI 10.1042/BJ20090061
View details for Web of Science ID 000268088100015
View details for PubMedID 19397496
Genome-wide discovery of functional transcription factor binding sites by comparative genomics: The case of Stat3
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2009; 106 (13): 5117-5122
We report a targeted, cost-effective method to quantify rare single-nucleotide polymorphisms from pooled human genomic DNA using second-generation sequencing. We pooled DNA from 1,111 individuals and targeted four genes to identify rare germline variants. Our base-calling algorithm, SNPSeeker, derived from large deviation theory, detected single-nucleotide polymorphisms present at frequencies below the raw error rate of the sequencing platform.
View details for DOI 10.1038/NMETH.1307
View details for Web of Science ID 000264738800013
View details for PubMedID 19252504
The identification of direct targets of transcription factors is a key problem in the study of gene regulatory networks. However, the use of high throughput experimental methods, such as ChIP-chip and ChIP-sequencing, is limited by their high cost and strong dependence on cellular type and context. We developed a computational method for the genome-wide identification of functional transcription factor binding sites based on positional weight matrices, comparative genomics, and gene expression profiling. The method was applied to Stat3, a transcription factor playing crucial roles in inflammation, immunity and oncogenesis, and able to induce distinct subsets of target genes in different cell types or conditions. A newly generated positional weight matrix enabled us to assign affinity scores of high specificity, as measured by EMSA competition assays. Phylogenetic conservation with 7 vertebrate species was used to select the binding sites most likely to be functional. Validation was carried out on predicted sites within genes identified as differentially expressed in the presence or absence of Stat3 by microarray analysis. Twelve of the fourteen sites tested were bound by Stat3 in vivo, as assessed by Chromatin Immunoprecipitation, allowing us to identify 9 Stat3 transcriptional targets. Given its high validation rate, and the availability of large transcription factor-dependent gene expression datasets obtained under diverse experimental conditions, our approach appears to be a valid alternative to high-throughput experimental assays for the discovery of novel direct targets of transcription factors.
View details for DOI 10.1073/pnas.0900473106
View details for Web of Science ID 000264790600031
View details for PubMedID 19282476