Main

Genomics Archives

February 21, 2008

Top-down mapping of gene regulatory pathways

Trey Ideker videoIn a very recent lecture (see full video from NIH VideoCasting) given for the NIH Systems Biology Special Interest Group, Trey Ideker presents a great overview of the various strategies his group has been developing in the recent years in order to integrate multiple types of large scale datasets. While one of the most pervasive 'meme' about high-throughput measurement is that they are "notoriously unreliable" (see Hakes et al, 2008, for a recent example), Trey beautifully illustrates how predictive computational models and novel biological insights can be generated by sophisticated data integration strategies. Three types of applications are presented in his talk:

  1. mapping of transcriptional response pathways
  2. functional mapping of protein complexes
  3. disease diagnosis and stratification

In the last section, Trey presents the study recently published in Molecular Systems Biology (Chuang et al, 2007, video: 00hr:39min:15sec) where the information provided by microarray expression profiling is superposed to a protein-protein physical interaction network to identify 'subnetwork' biomarkers that classify metastatic vs non-metastatic breast tumors.

November 20, 2007

Personal genomics for a fistful of dollars

The wave of personal genomics is progressing rapidly. A string of four papers appeared recently (Porreca et al, 2007, Albert et al, 2007, Okou et al 2007, Hodges et al, 2007) reporting on microarrray-based technologies that enable the enrichment of selected genomic fragments in a single massively multiplexed reaction, thus greatly facilitating subsequent resequencing of pre-defined portions of the human genome (eg all coding exons). These technologies are expected to reduce dramatically the cost of targeted resequencing of individual genomes.

On the commercial front, deCODE and 23andMe have launched their personal genome service offering genome-wide SNPs profiling for a little less than $1,000 (NYT articles: Nicholas Wade, Amy Harmon, or Wired, ScienceRoll, Sandra, DNA and You).

The chips used by 23andMe are the "Illumina HumanHap550+ BeadChip, which reads more than 550,000 SNPs (single nucleotide polymorphisms) plus a 23andMe custom-designed set that analyzes more than 30,000 additional SNPs." The profile provided by deCODEme includes "over one million variants across the genome."

So what do you think?

October 26, 2007

The broken double helix

watson2.jpg Contrary to what Charlotte Hunt-Grubbe predicted in her interview published in the Sunday Times (The elementary DNA of Dr Watson), James Watson did not "enthusiastically counter the inevitable criticisms" that arose from his unacceptable comments on racial differences in intelligence. After being suspended, he apologized and finally resigned yesterday as Chancellor of Cold Spring Harbor Laboratory (Watson's statement).

It is striking to observe that these very sad events occur in the current context of a literal explosion of studies in human genetics and genomics. Thus, it is only a few months ago that Watson's and Venters' personal genome sequences have been released, while an uninterrupted stream of new genome-wide association studies are being published. If we just consider some of the papers that appeared in Nature and Nature Genetics the last few weeks, we see an impressive concentration of genome-wide studies on human genetic variation, addressing the genetic basis of highly visible phenotypes like skin, eye, and hair color, the impact of geographical location, revealing evidence of positive selection and analyzing heritability of gene expression in human populations:

The extraordinary development of the field of human genomics will inevitably lead to important questions on the social and ethical implications of this research. If anything, "Watson's folly" might be a warning that we may expect to see in the future more confrontations between racist ideologies and scientific discoveries. Beyond the issues surrounding ethnicity, one can also anticipate that tense debates will arise as how to define the line that separates "patient stratification" from mere genetic "discrimination" of human beings.

A cardinal value in Science, perhaps even above openness, is the ability of critical reasoning. This implies rigor and depth with very little place for unsubstantiated provocations. In this regard, I disagree with PZ Myers (Pharyngula), when he writes that the prompt decision of CSHL to suspend Watson appears as a "declaration that their director must be an inoffensive, mealy-mouthed mumbler who never challenges (even stupidly)". I do hope that there is an alternative to inoffensiveness but debates on these very sensitive issues and at this level of responsibility and visibility require the highest scientific and ethical standards, and we should definitely expect much more from our prestigious leaders than being "challenging" just by making outrageous statements...

Note: publication of this post was unfortunately delayed due to technical problems

(Illustration drawn after http://www.nlm.nih.gov/visibleproofs/media/detailed/vi_a_206.jpg)

September 7, 2007

How do we get from the Jimome & Craigome to systems biology?

by George M Church, live from the 9th International Meeting on Human Genome Variation and Complex Genome Analysis, Sep 6-8, 2007 in Barcelona.

Although Jim Watson's genome hasn't been through peer review yet, and Craig Venter’s genome doesn’t have a slick web browser like Jim’s genome yet, we’ve seen enough to ask – what next? Someone at the meeting today got some laughs accidentally when they said that they were comparing Craig’s genome to the human genome. Clearly this is a time requiring great caution. So our first question is: where are we with these first two complete diploid genomes? Well, they’re neither complete nor the first. The Craigome has over 4500 gaps (a bit more than the 341 gaps in the haploid 2004 HGP genome). The first human diploid sequence nod goes to the 269 HapMap genomes published in Oct 2005. Nevertheless we now have the first two non-anonymous personal genomes (hopefully millions someday). Oh, and what is it with press-release that our genomes have higher variation than previously thought? The 0.5% variation observed includes a near-perfect fit to the long-known 0.06% SNP frequency, a 0.08% frequency of smaller indels about twice that seen in 330 genes from Seattle studies, and the remainder being copy-number variants (CNV) 87% of which have been described previously. Just like the number of genes in the genome in 2001, the beauty and the news is in the details not in the summary stats.

We can get from genome variations to systems biology “with focused population association studies, animal models, and functional genomics on the cells from the subjects” (Church 2005). To do genome-wide association studies (GWAS), we must ask where the technology costs are leading? Given the drop in price between the arrivals of the two genomes in the NCBI Personal Genomics directory -- Craig on June 27 at a cost of $70M, and Jim nine days later at a cost of $1M, an over-zealous extrapolationist might be disappointed that the $1K genome did not arrive on July 25. Seriously now, the point is that neither study is inexpensive enough to scale to genome-wide association studies. SNP-chips at $250 each are scaleable, but tend to miss new and/or rare SNPs and small indels. Next generation sequencing and short-read-pairs (Shendure et al 2005) may bring down costs by a factor of 10. Read-pairs seem ripe to become the method of choice for CNVs, smaller indels, and even inversions. Enrichment by hybridization for at least one read to be in an exon or cis-regulatory site might bring costs down another factor of 50. Even if these GWAS studies efficiently get us beyond “linked alleles” to “causative alleles”, they will generate gloriously more hypotheses than they test.

So, back to the other routes to systems biology, animal or cell models could be made to test the 4 million variants per genome (and combinations; oh my!) -- clearly indicating a need for automated homologous recombination methods and/or prioritization of these tests using the third route to systems biology -- “personal functional genomics”. Unlike the HapMap genomes, the Jimome and Craigome are not yet accompanied by extensive phenotypic trait data, nor any cell lines to do so. SNPs and CNVs that affect RNA levels have been elegantly mapped by Spielman et al. 2007 and Stranger et al 2007. Most effects map close to the transcription start sites. Assaying RNA by these standard assays or next-generation sequencing (Kim et al. 2007) from individuals enables comparisons of sum of the two allelic expression levels from the two types of homozygotes (AA & aa) and the heterozygote (Aa) in a variety of different genetic backgrounds and cell-states. In contrast, genome-wide, allele-specific RNA assays would measure the expression from each haplotype in a heterozygote under what is the most ideally identical background state arrangeable. The missing technology is one to gain access to all human tissues (since the list of volunteers for brain biopsies is short). Yet another reason that we will be watching for methods to derive pluripotent stem cells from adult human tissues. Personal functional genomics assays on such personal cell lines are likely to arrive much earlier than (indeed pave the way for) therapeutic applications.


Church GM (2005) The Personal Genome Project. Mol Syst Biol 1:2005.0030

IHGSC et al. (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945.

International HapMap Consortium. (2005) A haplotype map of the human genome. Nature 437(7063):1299-320.

Kim JB, Porreca GJ, Gorham JM, Church GM, Seidman CE, Seidman JG (2007) Polony multiplex analysis of gene expression (PMAGE) in a mouse model of hypertrophic cardiomyopathy. Science 316(5830):1481-4.

Levy et al. (2007) The Diploid Genome Sequence of an Individual Human. PLoS 5:e254.

Shendure, J, Porreca, GJ, Reppas, NB, Lin, X, McCutcheon, JP, Rosenbaum, AM, Wang, MD , Zhang, K, Mitra, RD, Church, GM (2005) Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome. Science 309(5741):1728-32.

Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG. (2007) Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 39(2):226-31.

Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P, Hurles ME, Dermitzakis ET. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007 315(5813):848-53.

September 5, 2007

J Craig Venter's Genome

personal-genome.jpg Many others have abundantly commented on the publication of Craig Venter's genome this week in PLOS Biology (Levy et al, 2007). The sequence of his full diploid genome (HuRef) reveals that the degree of genetic variability between maternal and paternal chromosomes is much higher (0.5%) than expected. Part of this variability is due to insertion/deletions (indels represent only 22% of variant events but amount to up to 75% of the variant nucleotides), alterations that are typically missed by SNP genotyping (SNPs represent 78% of the variants). Copy number variations (62, amounting to 10Mb) are also reported, albeit not determined by sequencing but via microarray genomic hybridization. With regard to variability in gene exons, the analysis shows that at least 44% of the genes are in the heterozygous state.

Beside its scientific content, the psychological impact of this study is considerable. The small interactive "toy" map (see illustration, modified from Levy et al, 2007) published on the PLOS Biology website is a particularly strong symbol: the entire genome of an individual human being displayed on a single page! The fact that a genome is a beautiful linear structure and can be displayed in such a simple and compact way, on a single poster, inevitably triggers the reflex to zoom in, focus on a favorite gene and speculate on the resulting phenotype. This almost unavoidable fascination for a linear interpretation of a linear structure (one gene maps to one disease) is illustrated by the many comments on Craig Venter's "genetic destiny", "wet earwax", predispositions (or lack thereof) to "alcoholism, coronary artery disease, obesity, Alzheimer’s disease, antisocial behavior and conduct disorder" and last but not least, to his blue eyes. Even if Venter himself makes it clear that it "is an impressive array of large sets of genes together with environmental conditions that will determine life outcomes" (Anderson Cooper Blog), it remains that it is very hard to visualize this reality in an intuitive way. The PLoS poster shows a linear map, not an intricate probabilistic network of any sort. The educational efforts required to change in the general public the easy linear representation into a more "integrative" view is certainly going to be a major but decisive challenge, if, with the advent of personal genomics, individuals will be expected to exert more control and responsibility over their own health. Will Systems Biology manage to enter in our daily lives?

Of course, this would imply deciphering the multigenic basis of complex human traits in the first place. But it is precisely this lack of current knowledge on the genotype-phenotype relationship which represents one the strongest scientific incentives to sequence many more individual human genomes and correlate them with the respective medical, physiological and environmental parameters. In this regard, the availability of much cheaper and more efficient sequencing technologies (eg. enabling sequencing, within 10 days, of 100 human genomes at 98% completion, 10-5 accuracy and for $10'000 per genome, as challenged by the Archon X prize foundation, or allowing sequencing 1% of the genome for $1000 as in George Church's Personal Exome Project) may well represent an even more revolutionary advance than the first individual human genome published this week. As George Church wrote in his Editorial (The Personal Genome Project, Church (2005), Molecular Systems Biology 2005.0030),

Ready access to highly integrated and comprehensive human genome and phenome data sets is extremely important and increasingly feasible technically [...] As DNA is only a small part of destiny, personal genomics might fruitfully de-emphasize 'prediction' and focus on augmenting systems biology interpretations and prioritizations of actual day-to-day measurements of our physiological states.

June 19, 2007

E. coli counts in base 117

Finding general laws on the organization principle of living organisms is a particularly difficult task in biology but certainly a central one in systems biology. Part of the difficulty in this endeavor is probably linked to the fact that "by its very nature, life is both contingent and particular, each organism the product of eons of tinkering, of building on what had accumulated over the course of a particular evolutionary trajectory" (Keller, 2007, see also our post). Such laws are thus particularly significant when they emerge from evolutionary constraints alone. In a recent paper published in PNAS, Matthew Wright and colleagues may well provide such an example by looking at the "chromosomal periodicity of evolutionarily conserved gene pairs" (Wright et al, 2007).

Using a comparative genomic approach, Wright and colleagues selected pairs of genes based on two simple criteria: 1) the genes of a pair have to have a tendency to be close together; 2) one gene of the pair should tend to be present only if the other gene is also present. Searching more than 100 bacterial genomes, 22'500 statistically conserved gene pairs could be identified. Looking at the distribution of distances between genes in a pair and at the density of conserved pair along E. coli chromosome, a strikingly regular pattern emerged: conserved pairs appear to be localized as clusters that are regularly spaced over the entire chromosome, with a regular inter-cluster interval of 117kb. In addition, this regular positional pattern correlates with the pattern of log-phase transcriptional activity along the chromosome: both positional and transcriptional grids are almost perfectly aligned, with the same 117kb periodicity (see figure below, from Figure 3b in Wright et al, Copyright 2007 The National Academy of Sciences of the USA).

thumb070618.jpg

The interpretation offered to explain these findings is that the regular spacing of conserved gene pairs may reveal underlying regularities of the structural spatial organization of E. coli chromosome. Specifically, a solenoid-like model with regular 117kb loops would imply that conserved pairs are preferentially located on one face of the chromosome. Correlation between the positional grid and the longitudinal profile of transcriptional activity suggests that this arrangement is coupled to functionally important characteristics (eg diffusion properties of the RNA polymerase or existence of transcription factories)

A patterned structure with a similar periodicity has been suggested by previous studies based the analysis of sequence features or on the profile transcriptional activity along the bacterial chromosome (Jeong et al, 2004, Carpentier et al, 2005, Allen et al, 2006). What is remarkable in the study by Wright and colleagues, is that the 117kb periodicity emerges so clearly by using solely evolutionary conservation criteria: chromosomal proximity and phylogenetic co-occurrence. The evolutionary forces that operate on a wide variety of genomes are thus able to reveal constraints on the overall structural organization of an entire bacterial chromosome. In turn, this finding implies that strong evolutionary selective pressure operate to shape the long-range organization of chromosomes. How general is this 117kb-periodicity law? Wright and colleagues were able to find a similar arrangement in C. crescentus and it will be interesting to see if a similar organization is observed in other genomes. Direct investigation of chromosomal conformation in vivo may also shed more light on the physical and functional mechanisms that explain the deep link between evolutionary conservation of local properties and a global architectural principle of a bacterial genome.

May 16, 2007

The Human (Genetic) Disease Network

thumb070516.jpgThe relationship between genetic mutations and human diseases is often complex and ambiguous: a given disease can be associated with mutations in distinct genes and, conversely, mutations in a given gene can be associated with several diseases. Can this many-to-many relationship be exploited to construct a human disease network and extract information on the human disease landscape?

In their work just published in PNAS, Albert-László Barabasi, Marc Vidal and colleagues reconstruct such a "diseasome" network in which disorders are linked to the respective associated disease genes (Goh et al, 2007 PNAS). Two projections of the network are presented: a) the Human Disease Network (HDN), in which diseases are connected to each other if they share a common disease gene; b) the Disease Gene Network (DGN), in which genes are connected if they are associated with a common disease. The HDN has a giant component comprising almost half of the diseases, in which some classes of disorders cluster naturally (eg cancers or neurological disorders, but not metabolic disorders). The DGN, when integrated with functional annotations, expression and protein-protein interaction data, provides a first step towards a "network-based explanation of the emergence of complex polygenic disorders" in the sense that it reveals, perhaps not too surprisingly, how functionally related genes can lead to similar disorders.

The authors also look at the centrality of human disease genes in the protein-protein interaction network. An interesting twist comes when human disease genes are separated into essential and non-essential classes, according to the lethal or non lethal mouse phenotype resulting from the knockout of the respective orthologous genes. While essential genes tend to be associated with hubs in the interactome, disease genes that are non-essential (representing 78% of all disease genes) do not display a higher connectivity than non-disease genes. A somewhat complementary conclusion was recently reached by Lu and colleagues when looking at changes in gene expression in a mouse model of asthma: genes whose expression is the most affected by the disease have low connectivity while genes coding for hub proteins tend to display stable expression levels (Lu et al, 2007 Mol Syst Biol 3:98).

Reading this work, two main questions come to my mind:

First, if a majority of disease genes are not more central than non-disease genes, what will be the "network-based explanation" for the mere fact that they are implicated in a human disease? What kind of model will be needed to achieve this fundamental prediction?

Second and on a more general note, it looks to me that system-level approaches will be needed to integrate the environmental causes to human disease. While there is no question about the power of genetics and genomics to provide a global view on human diseases, I find it useful to remember that, as Jeremy Nicholson emphasizes,

the majority of people in the world die from what are, in the broadest sense, environmental causes. (Nicholson 2007, Mol Syst Biol 2:52)

Concrete achievements of Systems Biology in addressing significant human health problems may well require strong research efforts to bring system-level understanding into the impact of environmental factors on disease. This way, the Human Genetic Disease Network might ultimately be extended to a true Human Genetic Disease Network.

April 26, 2007

A Human Microbiome Project?

(via Jonathan A. Eisen, The Tree of Life)
What are the areas that will deeply transform biomedical research over the next decade? One of the possible areas identified for inclusion in the NIH Roadmap is research on the Microbiome (the entire set of microbial species living in the human body). A string of recent studies have revealed a profound impact of the enormously complex mammalian microbiome (Gill et al, 2006) on the metabolism and immune status of the host (for a few examples: Backhed et al, 2004, Dumas et al, 2006, Turnbaugh et al, 2006, Kitano & Oda, 2006, Nicholson et al, 2005). In his blog, J Eisen reports on some of the discussions held at an NIH sponsored workshop on the necessity of a Human Microbiome Project and lists possible research avenues for such a program. From his post:

1. Sequence many "reference genomes." By reference genomes here I mean genomes of cultured isolates that are closely related to organisms known in various human locations.
2. Do metagenomic sequencing of a variety of human mirobiome samples.
3. Conduct large scale human microbiome diversity studies. This could involve rRNA PCR surveys as well as some amount of genome sequencing.
4. Develop the computational tools needed to analyze the massive amounts of data that will come out.
5. Encourage the development of new methods to aid in studies of the microbiome.

Perhaps one would like to add that an understanding of the symbiotic relationship between host and microbiome will also require the development of experimental approaches to manipulate the microbiome and measure its impact on the host physiology.

A friend of mine asked me recently what field might strike the popular consciousness in the coming years. Could it be that it will be the realization that we are all "superorganisms" (Lederberg, 2000) and that our health does not only depend on our personal genome (Church 2005) and our environment, but also on the extended genome provided by our very private microbiome?

March 5, 2007

Catch me if you can: VelociMouse unleashed

"Germline transmission": how many time have mouse geneticists prayed to the Gods to decipher this magical message from their Southern blots and PCRs when trying to generate a knockout line?

thumb070305.jpg

Hopefully the recent technology (the "VelociMouse method") developed by Regeneron Pharmaceuticals will definitively overcome this hurdle (Poueymirou et al, 2007): by disrupting the zona pellucida with help of a laser, ES cells could be injected into 8-cell stage embryos, resulting in F0 generation founder mice entirely derived from the ES cells, thus directly amenable to phenotypic analysis and breeding.

Together with the large-scale gene targeting in ES cells (Valenzuela et al, 2003, Auwerx et al, 2004, Austin et al, 2004,Grimm, 2006), this technological advance may well represent a major step towards a high-throughput functional genomics of a mammalian species.

February 2, 2007

Functional genomics of the neuron

Several recent publications seem to give a clear signal that the time has come for a functional genomic approach of key neuronal functions, such as neuronal differentiation or synaptic plasticity.

  • The Allen Institute for Brain Science in Seattle has completed the Allen Brain Atlas (Lein et al, 2007, see also our N&V by Sebastian Jessberger and Fred H Gage), cataloging the expression patterns of 20'000 genes in serial in situ hybridization sections and providing an exemplary web interface to query and retrieve the information
  • Neurons are notoriously difficult cultivate and transfect, making it difficult to probe gene function in a high-thgoughput fashion. Michael Greenberg describes in Neuron (Paradis et al, 2007) the results of a systematical RNAi screen to evaluate the function of roughly 150 genes in synapse formation.
  • Aplysia californica has been extremely useful as a model in identifying the signaling processes underlying synaptic plasticity and delineating the molecular mechanisms involved in learning and memory. The laboratory of Eric Kandel at Columbia University has now characterized the neuronal transcriptome of Aplysia (Moroz et al, 2006) not only in several distinct ganglia but also in individual identified neurons and even in neuronal processes from cells known to support local protein synthesis at their synaptic terminals.

January 19, 2007

Analyzing time-series expression data

tree-like Ziv Bar-Joseph and colleague describe their new method Dynamic Regulatory Events Miner (DREM) to analyze time-series gene expression data and combine them with static ChIP-chip experiments. The expression profiles are modeled using an extension of Hidden Markov Model that enforces a tree structure onto the expression profiles. The technique allows to deduce the condition-specific or time-dependent activity of transcription factors that explain the observed expression profiles.

sharp transitionsIn their analysis of developmental time-series of gene expression in Drosophila, Peer Bork and colleagues apply a more drastic principle to identify robust groups of genes that correlate with major development phases. They required "four points of low expression and four subsequent points of high expression (or vice versa) even if the amplitude change was relatively low (see Materials and methods). This type of convolution not only requires a sharp increase or decrease of expression, but also that the change in transcript level is consistent over a period of time, thereby reducing the rate of false positives owing to individual outliers."