Blog and lab updates

Please stay tuned for blog posts and research related updates from the lab. 

By Manuel A. Rivas

We spent some time exploring database options for uploading summary statistics, performing inference, and supporting web querying simultaneously. Below is a summary of performance comparisons. TL;DR - SciDB exceeded our expectations

Read more here.

By Chris DeBoever and Manuel A. Rivas

Protein truncating variants (PTVs) are genetic variants that are predicted to shorten protein-coding sequence of genes. PTVs can cause loss-of-function for the gene that harbors the PTV but may also lead to dominant negative or gain-of-function effects. When PTVs cause loss-of-function, they provide naturally occurring gene knock-downs or knockouts that offer a great opportunity for us to explore which genes are important for different diseases. For instance, landmark papers in 2005 and 2006 by Helen Hobbs, Jonathan Cohen and company found that PTVs in PCSK9 were associated with severely reduced levels of LDL cholesterol and decreased risk for coronary heart disease. This led to the development of a drug that successfully reduces LDL cholesterol levels and the risk of adverse cardiovascular events. Naturally occurring PTVs such as these that cause loss-of-function of a gene and are protective against disease are particularly interesting because they provide in vivo validation of safety and efficacy and may be relatively simple to target with drugs. See the timeline for a review of what we’ve learned about PTVs over the last decade.

Identifying PTV associations to human diseases is a useful way to identify drug targets and gain disease insights. To that end, we have developed an app, as part of Global Biobank Engine, for the scientific community that evaluates  power to detect PTV associations in genome-wide association studies (GWAS) in different study designs across different populations. Because PTVs may knock down/out genes or radically alter protein sequence, PTVs can have strong effects on disease predisposition and will therefore be relatively infrequent due to selection. However, due to human demographic history, we expect that PTVs will be at different frequencies in different populations. The advent of large genotyping and sequencing studies such as the UK Biobank and the NHGRI’s Genome Sequencing Program provides a novel opportunity to study PTVs in different populations. Our app estimates the power to detect associations for different genes based on empirical estimates of the collapsed allele frequency of PTVs in those genes in different populations from gnomAD and the UK Biobank. We anticipate that this app will be useful for study design and look forward to improving it with input from the community and as additional reference data resources are put forward in the community.

We are also excited to announce some new resources for studying PTVs in the UK Biobank including browsing association results for PTVs on the Global Biobank Engine (including support for inspecting PTV cluster plots); PTV annotations and filters for the PTVs included on the UK Biobank array (see here); and a new preprint offering a first look at PTV associations in the biobank (see here). Stay tuned for more features, detailed methods descriptions, and results in the near future! Please drop by our gitter or tweet us (@manuelrivascruz, @cdeboever3, @yk_tani, @Vashishtrv) with your suggestions. Please follow us with the hashtag #biobankengine and #globalbiobankengine.

Mosaic Mutations in Blood DNA Sequence Are Associated with Solid Tumor Cancers

By Mykyta Artomov and Manuel A. Rivas

on “Mosaic Mutations in Blood DNA Sequence Are Associated with Solid Tumor Cancers”, Mykyta Artomov, Manuel A. Rivas, Giulio Genovese, Mark J. Daly Genomic Medicine, In Press.

Traditionally, DNA-samples for genetic studies are derived from blood. Only white blood cell lineages contribute DNA to the sample and according to the structure of the hematopoiesis in humans all of them originate from hematopoietic stem cell (HPSC) progenitors (Fig. 1).



Figure 1. Scheme of human hematopoiesis.

Sporadic, age-related or induced by environmental impact mutations in long-term HPSCs result in rise of cell sub-populations with mutated genotype [1]. Such non-inherited mutations that are found in only some sub-populations of cells are called mosaic. Process of new sub-populations emergence is called clonal expansion and has recently been causally linked to development of leukemia (specifically with mutations in four genes: PPM1D, TET2, DNMT3A, ASXL1) [2]. Furthermore, analysis of mosaic mutations in blood of breast and ovarian cancer patients demonstrated, that mosaic protein-truncating variants (PTVs) in PPM1D are strongly associated with cancer status [3]. Interestingly, vast majority of PTVs clustered in the last exon of this gene and functional credentialing confirmed a “gain-of-function” effect of such mutations resulting in suppression of p53 activity due to increased functionality of mutated PPM1D.

As a side note, the PPM1D PTVs in Ruark et al. were mosaic variants detected in a pooled sequencing study design by Syzygy - a slow program, but it usually exhibited impressive performance (see here and here and here). Syzygy was developed by Mark and me (MR) during the emergence of "next-generation" sequencing technologies - we were looking at 36 base-pair single end reads with pretty high error rates. It was a pretty active time at the Broad Institute where a group of us, informally called "The Math Team" spent a good deal of time modeling and analyzing errors,  which led to the popular Genome Analysis Toolkit (GATK) (see here and here and here for early look at data). It was an awesome time with a common shared mission across the field, i.e. "how do we interpret this noisy data"? :-) 

OK, now back to the mosaicism story. Coherent observations in blood and solid-tumor cancer types prompt questions about the generalizability of such observations across cancer types. We used TCGA exome sequencing (~8,000 samples) to compare 22 different cancer phenotypes with more than 6,000 controls using a case-control study design and demonstrate that mosaic protein truncating variants in these genes are also associated with solid-tumor cancers.

Since our controls were, on average, roughly 10 years younger than the cancer cohort and age has been shown to be a strong predictor of the existence of somatic mosaic mutations, inclusion of age in the association model is critical. Older samples expectedly have higher probability of finding a mosaic variant (Fig. 2).

Figure 2. Probability of observing mosaic protein truncating variant with respect to age of DNA sampling.

We looked into tumor DNA of the mosaic PTV carriers and consistently with Ruark et al. [3did not find evidence of blood-detected variants presence in the tumor. Thus, a classical definition of cancer driver mutations is not applicable in this case.

We followed the “gain-of-function” hypothesis for mosaic PTVs and observed same enrichment in our dataset – of 18 mosaic PTVs in PPM1D, 17 were in the last exon of the gene. ASXL1 follows the same pattern as PPM1D - 35 out of 40 PTVs in this gene are found in the last exon. TET2 has strong enrichment of exon 3 – 44 out of 50 PTVs. This is intriguing because TET2 transcript ENST00000305737 has 3 exons and demonstrates enrichment of the last exon. Moreover, this transcript is mostly expressed in whole blood and EBV-transformed lymphocytes according to GTEx database. DNTM3A has no known pattern of mosaic PTVs distribution within the gene. Genovese et al, reported enrichment of the last exons of DNMT3A with mosaic missense mutations in leukemia cases. We observed similar enrichment in exons 17-23. However, no further studies are available to confirm whether missense mutations in this region also have ‘gain-of-function’ effect similar to the other candidate genes (Fig. 3).

Figure 3. Exon specificity of mosaic protein truncating variants.

A key question that was raised in a community with respect to association of mosaic mutations with cancer is whether such events are precursors or result of the disease onset (or disease treatment). This hypothesis is testable, however, there are several pitfalls that cannot be overcome at this moment. There are no available cohorts of cancer patients DNA from whom would have been collected prior and after treatment. Given the frequency with which we observe mosaic mutations making a conclusive statement will require a large cohort of samples that is unlikely to become available in the near future. Alternatively, very large cohorts with a long-term clinical follow up after DNA sample collection could be a solution to this. Two largest cohorts available to date - FINRISK and Swedish Biobank. In our study we attempted to utilize these resources. Swedish Biobank data replicates our observation - people with prior history of solid-tumor cancer have higher burden of mosaic mutations in the candidate genes. Since the cohort is not ascertained for cancer and has substantial amount of middle-aged samples (that are too young to have mosaic mutations) we did not have enough statistical power to differentiate pre- and post- treatment conditions.

It is hard to reach conclusive results with available clinical datasets at this point because previous reports are controversial while clinical studies suggest a strong relevance of chemotherapy to the mosaic mutations burden [4,5], at the same time genetic analysis of GWAS data shows no association between mosaic events and cancer treatment regimens [6].

While etiology of cancer disorders is similar each specific cancer type still has a contribution of the disease-specific biological factors. In this case, we would observe non-random rate of mosaic PTVs in different cancer types in each of four candidate genes. We first evaluated burden of mosaic mutations in each cancer type compared to the expected binomial distribution (Fig. 4A), then looked into gene-specificity (Fig. 4B). Mosaic PTVs are very rare, so we are limited by statistical power in this analysis. However, we do see that ovarian cancer is specific to PPM1D as it was observed earlier and cutaneous melanoma cohort has major contribution of ASXL1 – a known member of BAP1-complex recurrently mutated in melanoma tumors.

Figure 4. Testing for unusual burden of mosaic protein truncating variants. (A) Empirical significance of burden observed in all genes. (B) Empirical significance of burden observed in individual genes.

Observed gene specificity of clonal expansion events in different cancer types is likely to be driven by disease-specific biological pathways linking mosaic mutations burden and cancer that is yet to be discovered, and we look forward to continue research in this direction.


1. Morrison, S. J. & Weissman, I. L. The long-term repopulating subset of hematopoietic stem cells is deterministic and isolatable by phenotype. Immunity 1, 661–73 (1994).

2. Genovese, G. et al. Clonal Hematopoiesis and Blood-Cancer Risk Inferred from Blood DNA Sequence. N. Engl. J. Med. 371, 2477–2487 (2014).

3. Ruark, E. et al. Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer. Nature 493, 406–410 (2012).

4. Pharoah, P. D. P. et al. PPM1D Mosaic Truncating Variants in Ovarian Cancer Cases May Be Treatment-Related Somatic Mutations. J. Natl. Cancer Inst. 108, djv347 (2016).

5. Swisher, E. M. et al. Somatic Mosaic Mutations in PPM1D and TP53 in the Blood of Women With Ovarian Carcinoma. JAMA Oncol. 2, 370 (2016).

6. Jacobs, K. B. et al. Detectable clonal mosaicism and its relationship to aging and cancer. Nat. Genet. 44, 651–658 (2012).



Using the Global Biobank Engine: Case Study I (Asthma genetics)

By Greg McInnes and Manuel Rivas

Asthma is an autoimmune disorder that affects about 25 million people in the United States. The CDC observed an increase in prevalence from 7.3% to 8.4% of Americans suffering from asthma between 2001 and 2010 [1]. It is a disease that can be lifelong and life-threatening if not properly treated or if an individual suffering from asthma gets sick with another disease, such as the flu.

Over the last decade, numerous genome-wide association studies (GWAS) have been performed on individuals with asthma and many genes have been implicated [2]. All sorts of interesting studies have been done, including a GWAS testing for wheeze phenotypes [3]. However, to date much of the heritability known to be associated with asthma has not been characterized and the culprit genes or variants have not been properly identified. This has been a major challenge with GWAS in general, not just with this particular disease. One factor that has also likely played a role is that genotyping chips used for GWAS have mainly tested common variants where LD patterns introduce challenges in fine-mapping to a single causal gene or variant

Recently, several large scale biobank initiatives have begun making data available to researchers that may be able to help us explore these hard-to-study rare variants. One such biobank is the UK Biobank, which offers genotyping data for roughly half a million individuals with rich phenotype data. This resource offers a unique opportunity to study common disorders from the perspective of rare variants because the genotyping was done using the Affymetrix Biobank Array (which includes many rare, coding variants). Using the Global Biobank Engine (GBE) we can explore the data to get a sense of what is there.

In this post, we will use GBE to dive into an asthma GWAS using the UK Biobank data and see if we're able to uncover any new associations that have not been observed before.

Asthma genome-wide association study (GWAS)

The Global Biobank Engine offers the capability to explore genetic associations accross numerous phenotypes that we have manually grouped from the biobank participants. Individual variants, genes, and phenotypes can all be explored from the search bar, but since we're interested in asthma, we'll navigate to the custom asthma grouping by clicking the `Browseable Phenotypes` button below the search bar, and select `Common Diseases -> Asthma`. This will redirect us to a Manhattan plot of the GWAS results from the 10,608 asthma cases identified in the UK Biobank cohort. We can immediately see that there are several peaks in the Manhattan plot.


The obvious place to start investigating is that really high peak on chromosome 6. By clicking directly on the variant takes us to the variant page where we can see more information about this particular variant.

Initial Protein truncating variant (PTV) browsing

An activity that is of interest to the Rivas Lab is the identification of predicted protein-truncating genetic variants (PTVs) with protective effects (analogous to the p.R179X PTV in RNF186 which confers protection against ulcerative colitis). From the Manhattan page we can click on the 'PTV' button to include PTVs only in the table. 



Now here's something interesting, we see an asthma associated variant with a p-value of 2.4x10-5, an odds ratio of 0.55, and an alternate allele frequency of 0.0037 in ExAC. This suggests that rs146597587 is a rare variant, significantly associated with asthma, that confers a protective effect to carriers. Clicking the link to the variant page we can find links to ClinVar, ExAC, UCSC, dbSNP, and the IBD browser.  I'm interested in the population frequency for this variant, so lets go take a look at the ExAC browser.



Although IL33 has previously been associated with asthma, it appears we may have identified a rare variant in a splice acceptor region that confers some protective effect to carriers. While we were exploring the UK Biobank data we were made aware of a wonderful study by DeCODE Genetics (which did not include UK Biobank) reporting an association of rs146597587-C with lower eosinophil counts (β = -0.21 SD, P = 2.5×10–16, N = 103,104), and reduced risk of asthma (OR = 0.47; 95%CI: 0.32, 0.70, P = 1.8×10–4, N cases = 6,465, N controls = 302,977). Based on the genetic evidence from these two studies it appears we have a strong replication signal for the protective effect of a PTV in IL33 against asthma. Interestingly, a search for 'IL33 inhibitors pharmaceutical companies' leads to search results indicative of a few pharmaceutical companies actively interested in inhibition of IL33 for the treatment of asthma and COPD

By Manuel Rivas

The Rivas Lab is happy to release the Global Biobank Engine, which will be continuously updated with new inference methods from the lab.


The aim of GBE is to provide researchers the ability to explore the effect of genetic variation across multiple diseases analogous to the what was done by the ExAC consortium with variant frequency data across multiple populations, and what is shaping up to be impressive sharing of resources by the inflammatory bowel disease community (IBD exomes browser, IBD fine-mapping browser), the type 2 diabetes genetics community (T2D portal), and the Genotype-Tissue Expression project (GTEx portal).


The strategy for methods development, data analysis, and publication in the lab will be evolving as we bring more members on board. In general, we will provide an initial pre-alpha stage roll out of methods, tools, and results. Then, we will move forward with posting on bioRxiv. Finally, submission in a peer-reviewed journal. Our hope is that the community will be able to engage with us across a broad range of activities.

Why "Engine"?

The initial release of GBE supports only browser-like features. In the next version of the release (anticipated in the next couple of months) we will have support for researchers to interactively explore the summary statistic data by applying inference methods currently under development.

Demo exploration