Global science consortium hails completion of massive $288 million effort to create genomic encyclopedia

- By Krista Conger

SEPT. 5, 2012

BY KRISTA CONGER

Steven Artandi

Michael Snyder

The completion of the human genome sequencing project in 2003 revealed what many scientists already knew to be true: more than 90 percent of the newly sequenced DNA had no known function. In fact, only about 1.5 percent encodes instructions for proteins that do the work of the cell.

Now a five-year collaboration of more than 440 scientists in 32 labs around the world has pulled back the curtain on this so-called “junk DNA” to reveal a complex interplay among regulatory regions, proteins and RNA molecules that governs when and how genes are expressed.

“Until now, we had only the nucleotide sequence,” said Stanford geneticist Michael Snyder, PhD, one of the leaders of the mammoth effort, which was funded and coordinated by the National Human Genome Research Institute. “Now we have the beginnings of a regulatory network, or wiring diagram, for a human being. This global overview will help us understand how changes in the genome cause disease, and also to see how an individual’s unique genetic code may affect his or her health in meaningful ways.” In fact, more than 80 percent of so-called “junk DNA” was found to have some type of biological function.

Annotating these regions will be useful not just for mapping and understanding the function of disease-causing variants, but it will also significantly advance the ability of researchers and clinicians to interpret the whole-genome sequences of individuals. Earlier this year, Snyder made headlines with a study in which he used his whole-genome sequence and many other biological measurements to predict that he would develop type-2 diabetes.

The completion of the project, known as the Encyclopedia of DNA Elements, or ENCODE, announced Sept. 5 with the simultaneous publication of 30 papers in three journals: Nature, Genome Biology and Genome Research. Six review articles will also appear in the Journal of Biological Chemistry, as well as other, affiliated papers in Science, Cell and other journals.

In particular, Snyder and his Stanford colleagues published a number of papers, including two that describe the creation and use of a database that integrates ENCODE and other DNA data with genetic variations known or suspected to be associated with specific diseases or health conditions. The database, called RegulomeDB, allows researchers to quickly identify which of several potential culprits is likely to be the cause of a disease based on their predicted function as outlined by ENCODE. The method not only identifies the variation, but also suggests a functional hypothesis for researchers to test. (An example would be, “The binding of this protein is likely to be affected by this change in the DNA sequence; therefore this protein may be important in this disease.”)

“Until now, everyone has just been looking under the proverbial lamppost for the causes of disease,” Snyder, a senior member of the ENCODE Consortium and chair of genetics at the Stanford University School of Medicine, said of the focus on well-studied protein-coding regions. “But 85 percent of variants identified through genome-wide association studies, or GWAS, lie outside these regions. Now we can greatly expand our studies to the rest of the genome.”

Although nearly every cell in the body contains an identical complement of genes in the form of DNA, the way they use that DNA differs. In particular, no cell expresses, or makes, every one of the 20,000 possible proteins encoded by the entire genome: Some cells express proteins necessary to make and extrude hair, while others become bone, muscle or nerve cells. The key to these differences lies in the control sequences — the switches that turn protein production on and off — that surround and envelop the protein-coding regions.

This is, of course, not the first time researchers have looked outside of protein-coding sequences to understand biology and disease. On a case-by-case basis, they’ve been doing this — and making meaningful discoveries — for years. Since the completion of the Human Genome Project, we’ve been learning about epigenetics (a term describing how modifications to DNA or its packaging proteins affect gene expression) and about a dizzying variety of new types of RNA molecules (from tiny microRNAs, to long non-coding RNA) that regulate genes. But these advances have been made on a piecemeal basis by individual groups of scientists.

Now, the ENCODE project has incorporated multiple types of data from across the genome to create a comprehensive, multipurpose reference source for scientists. As its name indicates, it’s an encyclopedia of information that will advance research in an untold number of areas. Researchers can now simply look up their region of interest in the genome to find, among other things, whether it contains protein-binding sites, areas that affect DNA structure, or regions that code for proteins or regulatory RNA molecules.

ENCODE scientists incorporated more than 1,600 individual experiments in 147 cell types. Their results paint a picture of a genome that, rather than existing as a static sequence of nucleotides, is instead as complex and as busy as our nation’s largest airports. Casual observers will notice first the airplanes taking off and landing, but a closer inspection reveals the planes’ activities are dependent on myriad other workers, from air traffic controllers, to ticketing agents, baggage cart drivers and all the other support systems that keep things running smoothly.

ENCODE researchers mapped more than 4 million binding sites throughout the genome. In the process, they discovered that more than 80 percent of so-called “junk DNA” has some type of function — serving either as binding sites for transcription factors or other regulatory molecules, as locations for DNA modifications or playing a role in managing chromatin structure.

An effort to understand some of these behind-the-scenes functions in the human genome began in early 2001 in Snyder’s lab, then at Yale University, using microarray technology he developed in collaboration with the laboratory of Stanford biochemist Patrick Brown, MD, PhD. “We began using these techniques to map regulatory regions on a large scale, first in yeast, and then in humans,” said Snyder. “Now we can understand the regulatory code, we can map genetic variants and begin to assign functional meaning to them, and, as whole-genome sequencing becomes more common, we can use this information to annotate our personal genomes.”

ENCODE was launched as a pilot project in 2003, and as a full-scale effort in 2007. The NHGRI has invested approximately $288 million in the project and in ENCODE-related technology development and model organism research since 2003.

The ENCODE Consortium placed the resulting data sets — as soon as they were verified for accuracy, prior to publication — in several databases that can be freely accessed on the Internet. These data sets can be accessed through the ENCODE project portal (www.encodeproject.org) as well as at the University of California-Santa Cruz genome browser (http://genome.ucsc.edu/ENCODE/), the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html/) and the European Bioinformatics Institute (http://useast.ensembl.org/Homo_sapiens/encode.html).

“The ENCODE catalog is like Google Maps for the human genome,” Elise Feingold, PhD, an NHGRI program director who helped start the ENCODE Project, said in a press release by the institute. “Simply by selecting the magnification in Google Maps, you can see countries, states, cities, streets, even individual intersections, and by selecting different features, you can get directions, see street names and photos, and get information about traffic and even weather. The ENCODE maps allow researchers to inspect the chromosomes, genes, functional elements and individual nucleotides in the human genome in much the same way.”

According to the press release from the NHGRI, “The ENCODE data are so complex that the three journals have developed a pioneering way to present the information in an integrated form that they call ‘threads.’ Since the same topics were addressed in different ways in different papers, the new website, http://www.nature.com/encode/, will allow anyone to follow a topic through all of the papers in the ENCODE publication set in which it appears, by clicking on the relevant ‘thread’ at the Nature ENCODE explorer page. For example, thread No. 1 compiles figures, tables, and text relevant to genetic variation and disease from several papers and displays them all on one page. ENCODE scientists believe this will illuminate many biological themes emerging from the analyses.”

In addition to the primary paper in Nature summarizing the findings, of which Snyder is one of several hundred authors, Stanford postdoctoral scholar Alan Boyle, PhD, and graduate student Marc Schaub, are the first authors of two of the 30 total papers published about the project on Sept. 5. Their papers describe the creation and use of the database RegulomeDB, which integrates ENCODE and other DNA data with genetic variations known or suspected to be associated with specific diseases or health conditions.

About Stanford Medicine

Stanford Medicine is an integrated academic health system comprising the Stanford School of Medicine and adult and pediatric health care delivery systems. Together, they harness the full potential of biomedicine through collaborative research, education and clinical care for patients. For more information, please visit med.stanford.edu.