Send your posts to emailaddress.jpg

Subscribe

About Computational approaches

This page contains an archive of all entries posted to The Seven Stones in the Computational approaches category. They are listed from oldest to newest.

Biological approaches is the previous category.

Forum is the next category.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
embo_logo.gif npg_logo.gif
Powered by
Movable Type 3.33

Main

Computational approaches Archives

June 4, 2010

[Research highlight] Cis-regulatory evolution, not so mysterious after all?

Animal genomes are littered with conserved non-coding elements (CNEs)—most of which represent evolutionarily constrained cis-regulatory sequences—however, it is often not clear why these sequences are so exceptionally conserved, since anecdotal examples have shown that orthologous CNEs can have divergent functions in vivo (Strähle and Rastegar 2008; Elgar and Vavouri 2008). In an article recently published in Molecular Biology & Evolution, Ritter et al. compare the functional activities of 41 pairs of orthologous conserved non-coding elements (CNEs) from humans and zebrafish (2010). Interestingly, sequence similarity was found to be a poor predictor of which CNEs had conserved function. In contrast, the authors found that measuring transcription factor binding site change, instead of simple sequence divergence, improves their ability to predict functional conservation. While this set of tested CNEs remains relatively small, these results are encouraging because they suggest that as scientists move from phenomenological measures of CNE evolution to models based explicitly on binding site evolution, the patterns of cis-regulatory evolution observed within animal genomes should become far less mysterious.


Elgar G, Vavouri T (2008) Tuning in to the signals: noncoding sequence conservation in vertebrate genomes. Trends Genet 24: 344–352

Ritter DI, Li Q, Kostka D, Pollard KS, Guo S, Chuang JH (2010) The Importance of Being Cis: Evolution of Orthologous Fish and Mammalian Enhancer Activity. Mol Biol Evol advance online publication May 21

Strähle U, Rastegar S (2008) Conserved non-coding sequences and transcriptional regulation. Brain Res Bull 75: 225–230

February 2, 2009

Keystone Symposium - Omics Meets Cell Biology (II)

thumb090202a.jpg Before I carry on with a summary of the second part of the Keystone Symposium 'Omics Meets Cell Biology', I should clarify that this post and the previous one dedicated to this conference are not intended to provide an comprehensive account of all the talks but rather to communicate some general (and subjective) impressions of the meeting. To keep these posts reasonably short (and sometimes due to a lack of memory...), I had to omit several of the excellent presentations given at this meeting. The full program and complete list of speakers is available at the Keystone Symposium website.

Many of the presentations given during the second part of the meeting reported findings derived from cell-based high- or medium-throughput functional screens, most of them relying on RNAi-mediated knock-down. Here is an overview of the screens presented during this meeting, illustrating by their diversity in scope and scale the versatility of this method:

Focus # genes tested Type Speaker
autophagy 21'000? RNAi M Lipinski
sensory organ dev. 20'000 RNAi J Mummery-Widmer
cell polarity 16'000 RNAi J Ahringer
imatinib modifiers 9500 (pooled) RNAi D Sabatini
viral entry 4000 RNAi L Pelkmans
cell-cell contacts 2000 RNAi T Pawson
cell migration 1000 RNAi J Brugge
centrosome 113 RNAi L Pelletier
bipolar spindle 45 RNAi R Medema
DNA repair RNAi D Durocher
neuronal differentiation 700 TF overexpression M Snyder
gene-centered TF location yeast 1-hybrid library M Walhout
protein degradation reporter library S Elledge

Perhaps not surprisingly, many speakers emphasized that RNAi screens invariably need to be followed up by time-consuming and tedious validations. The off-target problem in mammalian cell-based RNAi screens appears also to be taken very seriously and it was reported that from 4-7 siRNA directed against the same gene were necessary to reach a good level of confidence. In view of the increasing number of RNAi-based functional screens, standards for the description of such experiments (eg. MIARE, MIACA) are likely to become increasingly useful.

In systems biology, network models are often central for the interpretations of omics data related to molecular interactions and they allow to generate biological insights which are different from those derived from the more classical screening-mechanistic-dissection paradigm. In this regard, Uwe Sauer presented exciting work on the relationship between transcriptional regulatory networks, protein expression and the state of the yeast metabolic network. Using a combination of genetic approach and drug perturbations, a series of parallel 'fluxomic' and metabolomic measurements revealed that metabolic fluxes, in contrast to metabolite concentrations, remain robust to perturbations and are apparently affected only by a handful of transcription factors in a given condition at steady state. At the computational level, integration of different types of data represents significant challenges. For example, it is far from trivial to find ways to exploit the information contained in interaction networks and integrate it with other types of large-scale molecular measurements. Trey Ideker exposed an efficient solution to this problem within the context of microarray profiling of breast cancers and showed that expression data can be combined with information on protein physical interactions to define improved and biologically meaningful pathway-based biomarkers for the classification of metastatic vs non-metastatic tumors.

While superposing parallel datasets leads to a 'vertical' integration of networks, Marian Walhout presented an approach to integrate 'horizontally' transcriptional and miRNA-dependent regulatory links and map a composite transcription factor/miRNA regulatory network in Caenorhabditis elegans. In this elegant work, the yeast one-hybrid assay was used as a gene-centric screening method to identify regulatory links between hundreds of transcription factors and promoters of both miRNA genes and genes encoding transcription factors. Closing the loop, the network was completed by computationally predicting the transcription factors potentially targeted by miRNAs. Interestingly, the resulting network showed numerous composite motifs including negative feedback loops (TF --> miR --| TF), which are otherwise under-represented in pure transcriptional regulatory neworks.

Completion of network models may require tedious and repetitive work. To the question "who will fill the gaps?", Steve Oliver replied: "a Robot Scientist". He showed that an actual implementation of such a robot is able to iteratively use a computational model of the yeast metabolic network to automatically design informative experiments, perform them and use the results to extend the model. In an effort to provide a genome-scale overview of the molecular interactions that underly regulation of gene expression, Tim Hughes presented a variety of microarray-based technologies to systematically map transcription factor-DNA, nucleosome-DNA and protein-RNA interactions. The latter results were particularly intriguing given that the high-throughput identification of targets of RNA-binding proteins remains a relatively unexplored route and may reveal novel insights into the complexity of post-transcriptional regulation.

To conclude on a somewhat different note, it was also interesting to observe that an increasing number of studies were accompanied by extensive web resources providing access to the respective datasets:

Resource Lab
PhophoPep R Aebersold
Human Protein Atlas M Uhlen
3Dcomplexes.org S Teichmann
Nature Cell Migration Gateway J Brugge
EDGEdb.org M Walhout
CellCircuits T Ideker
STRING C von Mering

This situation underscores the need of a proper infrastructure to host and share (or publish?) large datasets in biology and the central role of web technologies in this regard. In view of the proliferation of biological databases, I wonder whether it might be helpful to have general recommendations on some minimal requirements for this type of databases--eg. type of searching, visualization, data integration functionalities, existence of a (web) APIs, download of datasets, possibility to integrate external datasets, etc...? Or would perhaps something like a 'Minimum Information About a Biological Database' be useful to specify the capabilities of databases? One may also dream that these databases will become progressively interoperable and eventually include web-based APIs facilitating programmatic access to the information stored, ultimately sending Omics in the Cloud...

thumb090202b.jpg And, oh yes, the slopes were very nice, even though, I have to admit the air was thin and a little fresh...

January 30, 2009

Keystone Symposium - Omics Meets Cell Biology (I)

pic1-small.JPGAt the Keystone Symposium 'OMICS Meets Cell Biology', held this week in Breckenridge, Colorado, attendees had initially to face two major challenges: the first was to survive the cocktail mixing jet lag and altitude sickness and the second one--oh, it hurts!-- was to resist the temptation to just forget all about science and focus exclusively on the concepts revolving around snow, slopes and fun sports...

In any case, those who survived this harsh test were highly rewarded by attending an extremely exciting meeting, organized by Ruedi Aebersold and Tony Pawson, showcasing the impact of genome-wide and high-throughput technologies, the so-called 'omics', in cell biology.

After the two first days of the meeting, dedicated to 'cell signaling' and 'sub-cellular organization', a series of impressive talks had already delivered a clear and strong message: beyond generating comprehensive 'part lists', omics data lead to important and novel biological insights when integrated with functional and phenotypic data and when applied in experiments addressing well defined aspects of the biology of the system under study. This was particularly well illustrated in the talks dedicated to signaling, which all reported on analyses of well defined systems: ephrin-Eph receptor bidirectional signaling in cell-cell contact (T. Pawson), insulin signaling and growth regulation (E. Hafen), notch signaling and sensory organ development (J. Mummery-Widmer), cytokines and hepatotoxicity (B. Cosgrove), Rho signaling & cell migration (C. Bakal).

I have the feeling that this transition from descriptive catalogs to functional and mechanistc insights can be envisioned as the result, at least in part, of two series of developments:

First, experimental design is evolving and an increasing number of projects combine and integrate functional readouts with genetic approaches and high-throughput molecular measurements. For example, Tony Pawson illustrated how the integration of quantitative (SILAC) proteomics, phenotypic siRNA screens and protein complex identification could shed light on the components and mechanisms involved in ephrin-Eph receptor bidirectional signaling and their impact on cell-cell contacts. A combination of quantitative proteomics and genetic approaches was illustrated by Ruedi Aebersold, whose lab is charting a comprehensive kinase-substrate network in yeast by systematically performing quantitative proteomics on deletion mutants of all kinases and phosphatases. Other experiments link even more intimately, by design, systematical perturbations and molecular measurements to phenotypic outcome. Ben Cosgrove presented such work in the context of the study of drug hepatotoxicity. Systematical measurements of the phophorylation status of 17 signaling proteins and monitoring of cell death rates were performed in HepG2 cells under a variety of cytokine stimulation conditions. Multi-variate statistical analysis enable then to construct correlative models, which have not only predictive power but also reveal key players in the process and provide insight into how signaling components contribute to the phenotypic outcome. The power of data integration was also beautifully demonstrated in the work of Jennifer Mummery-Widmer, who performed genome-wide and tissue specific RNAi screens in Drosophila to identify modifiers of the notch signaling pathway. Integration of the genes identified in the screen with a map of known genetic and physical interactions resulted in a network model whose predictive power was exploited to identify and validate in vivo novel regulators of notch signaling.

Second, the technological platforms are maturing, data quality is increasing and protocols are streamlined, making these technologies progressively more accessible. This might be particularly to relevant for mass spectrometry proteomic approaches, which were omnipresent in the signaling talks. One of the consequences of a relative and progressive 'democratization' of MS proteomics platforms is that their application is not obligatorily restricted anymore to an initial exploratory phase traditionally aimed at providing an unbiased view of a particular system, but can now also be engaged in follow-up, often more focused, investigations to gain deeper mechanistic insights. An example of this was provided by Ernst Hafen who presented his work on growth regulation in Drosophila and showed data on a genome-wide and tissue-specific mutagenesis screen aimed at the identification of modifiers of growth regulation. Selected hits of the screen were then analyzed further in time course experiments upon insulin stimulation and mass spectrometry identification of TAP co-immunoprecipitated protein complexes could reveal the nature and dynamics of signaling complex assembly. One can thus predict that further development of optimized omics technologies for targeted follow-up experimentation will have a profound impact in molecular and cell biology.

Mass spectrometry based proteomics was clearly one of the predominant platforms in many of the studies presented during the sessions devoted to signaling. It was therefore particularly fascinating to listen to Mathias Uhlen's talk, who emphasized the need for complementary approaches based on affinity probes and presented foundational work towards antibody-based proteomics. The scale of the this work is such that it is hardly possible to summarize it in just a few sentences. Fortunately, the resource resulting from this enormous effort can be consulted directly online at the Human Protein Atlas portal. I will only add that Mathias Uhlen estimated that this resource will be able to provide quality controlled antibodies for 50% of human proteins within the coming years and that a first draft of the complete human proteome might be ready around 2014!

Beyond omics based on high-throughput measurements at the molecular level, one very exciting development is the application of imaging techniques for automated measurements of cellular and cytological parameters. Lucas Pelkmans showed that measurements of local cellular features (eg nucleus size, local density, mitotic stage, cell edges etc...) at the single cell level could be correlated to various cellular activities such as viral entry, clathrin distribution etc... He insisted that accounting for such local population parameters may have considerable implications for the interpretation of siRNA screens given the unavoidable heterogeneity of cellular populations. This strategy was then applied in the context of a large-scale siRNA screen for modifiers of viral entry performed on 8 different viruses. Cluster analysis of the resulting hits beautifully reveals a hierarchical 'functional phylogenetic' tree of the various virus strains according to the subset of cellular activities required for their entry. This information could in turn be used for the identification of a novel regulatory mechanism of viral entry essential for most of the viruses tested.

November 25, 2008

The role of neutral mutations in the evolution of phenotypes

Research highlight by Pedro Beltrao, University of California, San Francisco

MSB Research HighlightsIn a recent opinion piece, Andreas Wagner tries to reconcile the tension between proponents of neutral evolution and selectionism (Wagner 2008). He argues that “neutral mutations prepare the ground for later evolutionary innovation”. Wagner illustrates this point using a network model of genotype-phenotype relationships (Wagner 2005). In a so-called ‘neutral network’, nodes correspond to distinct genotypes associated with the same phenotype and are connected by an edge if the respective genotypes differ only by a single mutation event (eg point mutation). Examples of neutral networks include different genotypes coding for RNA or protein structures. In this representation, highly connected networks correspond to robust phenotypes that are not very sensitive to changes in genotype. Wagner notes the zinc finger fold as an impressive example of a highly connected neutral network as its structure remains essentially the same even after mutating all but seven of its 26 residues to alanine.

Using this model, Wagner describes how highly robust phenotypes can lead to faster exploration of the genotype space. He further proposes that evolution of innovation occurs via cycles of exploration of nearly neutral spaces (dubbed neutralist regime) followed by a reduction in diversity once a new phenotype of higher fitness is discovered (selectionist regime).

Although these models and ideas were mostly developed using models of sequence to structure relationships, Wagner cites several examples suggesting that these concepts are equally valid for cellular phenotypes that depend on molecular interactions (ex. gene expression patterns).

As Wagner points out, in order to understand the evolution of innovation we must fully understand the mapping between genotypes to phenotypes. This is why it is important to continue to develop richer evolutionary models to link changes at the DNA level with changes in molecular structures, interactions and ultimately phenotypes with a quantifiable impact on fitness. This is an area where systems biology should play an important role.

Models of RNA and protein structure stability upon mutation have existed now for some time (Hofacker et al. 1994, Guerois et al. 2002). More recently the study of large amounts of genomic information and/or systematic interactions studies are providing us with accurate models for different types of molecular interactions (Berger et al. 2008, Burger & van Nimwegen 2008, Chen et al. 2008). In parallel to these, theoretical analysis has been use to aid in the understanding of cellular phenotypes (i.e. cell-cycle, signaling pathways etc) (Tyson et al. 2003). Connecting these different layers of abstraction is an important challenge that will allow us to better understand the origins of biological innovation.


Berger MF et al. (2008). Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133:1266-76

Burger L & van Nimwegen E (2008). Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol 4:165

Chen JR et al. (2008). Predicting PDZ domain-peptide interactions from primary sequences. Nat Biotechnol 26:1041-5

Guerois R, Nielsen JE & Serrano L (2002). Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320:369-87

Hofacker IL et al. (1994). Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly 125:167-188

Tyson JJ, Chen KC & Novak B (2003). Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Curr Opin Cell Biol 15:221-31

Wagner A (2005). Robustness and Evolvability in Living Systems. Princeton University Press

Wagner A (2008). Neutralism and selectionism: a network-based reconciliation. Nat Rev Genet 9:965-974

August 18, 2008

SciFoo: scientific fireworks

In his list of eight 'generative' values (Better Than Free), Kevin Kelly includes 'embodiment'–the actual physical realization of an item or event which could be otherwise freely distributed over the web. While we are all 'hyperlinked' on the Internet, the value of those unique qualities that cannot be generated or "copied" on the web is dramatically increased. The type of intense emulation and shared excitement sparked at the recent Science Foo Camp (SciFoo 2008), organized by Nature, Google and O'Reilly, gave a wonderful example of the unique value of direct human exchange during an exclusive event bringing together roughly 200 top scientists, 'geeks' and other technologists at the Googleplex in Mountain View, California.

SciFoo is a so-called 'unconference': there is no program or more precisely, as Timo Hannay explained during the opening of the conference, the attendees are the 'program'. The actual schedule was defined only on the first evening in a purposefully chaotic process by anyone who wished to organize a session on any topic. For the next two days, in a festival of parallel sessions, astrophysicists, 'googlers', technologists, molecular biologists, taxonomists, game designers, flying car constructors, publishers, thinkers and (some) dreamers discussed and exchanged ideas with great enthusiasm and a rare intensity and openness.

Needless to say that deciding which session to attend was close to impossible... In any case, I ended up following three types of talks: a series on systems biology related topic (data integration, machine learning, personal genomics, baroque structure of the transcribed genome), several (of many) sessions focused on the theme of open data/science and finally some more eclectic sessions (only from my standpoint, of course) on diverse topics such as the foundations of the concept of time in physics, on some demonstration of very simple yet powerful Python scripting exercises to analyze text and the potential of game design to harness our 'cognitive surplus'. I cannot possibly summarize all the talks, interactions and impressions gathered at this meeting, but here are a few subjective excerpts:

  • There were quite a few sessions on open science and open data. Ernst Hafen made a strong case for the need of a unique AuthorID that would help in tracking the multiple aspects of researchers' scientific activities. With regard to data, Google announced that a new service will soon be launched, Google Research Datasets, offering to host, for free, large datasets of any type. The service will allow inclusion of some minimal meta-data about the submitted datasets and will provide a mechanism to define a delay before the dataset is made publicly visible. This will probably become a very simple and convenient way for storing data (in particular if a useful API is developed), so convenient in fact, that we may have to be a little careful that it will not turn into a temptation to bypass the 'minimal information...' standards usually required by traditional public databases.
  • George Church provided an overview of the Personal Genome Project (PGP) and described the type of biological data that will be integrated with the genomic and genetic information collected from consenting PGP volunteers: analysis of the transcriptome of pluripotent stem cells derived from the subjects; sequence of the repertoire of recombined V-D-J regions in immune cells ('VDJome') to exploit correlations between given V-D-J sequences and antigen-specific stimulations; characterization of the microbiome used as a tracer of the environmental and physiological conditions; record of phenotypic traits and disease conditions using controlled vocabularies. Finally, George also emphasized the exponentially decreasing cost of sequencing, which will not only make large scale sequencing of full personal genomes feasible but will also potentially open entire new fields of applications based on massive DNA sequencing.
  • Lee Smolin talked about the nature of the concept of time in physics and investigated the question of whether our perception of time as the 'experience of successive present moments' is 'real' or, alternatively, an emergent property of the laws of physics. I cannot pretend I followed the entire argument, but I learned that the mathematical representation of the physical reality involves the geometrization of time (as one of the state space's dimensions), leading in fact to a representation devoid of temporal flow (somehow the clock has to be outside the system). To this geometrical representation, physical laws are associated and applied to initial conditions. If I did not misunderstand it, it appears that this approach used in physics might have to be considered as approximative because it may only be valid for subsystems of the universe whereas it might not be appropriate for a true cosmological theory of the entire universe, with possibly disturbing consequences on the nature of physical laws...
  • Believe it or not but music can be 'geekified' as well: Chris diBona, later in the evening, brought his tenori-on for a fun demonstration. I want one of those!

The meeting ended with some final scientific fireworks, when some of the speakers gave a series of brilliant 2 min summary talks, providing a colorful overview of the many sessions we inevitably had missed. I have to admit that I like fireworks and I would certainly have enjoyed having a little more of this final kaleidoscopic view of science. Clearly, the authentic value of this conference lies in the unique and direct human interactions, but I wish there would be nevertheless some way–perhaps by using this last session in some form of outreach action–to disseminate this pure joy of scientific diversity and curiosity to a broader audience.

Credits: illustrations from Bob Lee, Flickr, some rights reserved

July 23, 2008

ISMB 2008: micro-blogging at its best

Probably like many others, I have often been puzzled by the phenomenon of 'micro-blogging', which consists in posting very short messages on the web (typically via sites such as Twitter) with the goal of providing an instantaneous description of the activity, state of mind or thoughts of the writer. The last few days, a small group of bloggers attending the ISMB 2008 Conference in Toronto used a form of collective micro-blogging on FriendFeed in an intensive way to cover many of the talks held at the conference.

Particularly interesting was the coverage of several keynote lectures, often commented simultaneously on a single 'feed' by several bloggers in the audience, providing so to say a real-time example of 'crowdsourcing'. The result is a surprisingly useful set of notes, where the combined attention and complementary knowledge of the participants allow some gaps to be filled, provide additional information (including references or links) and follow the flow of the presentation as it unfolds. I provide below a few picks, relevant to systems biology, while the rest can be consulted (and, importantly, searched!) in the ISMB 2008 Room' on FriendFeed. Good job & many thanks!

July 10, 2008

Fascinating correlations or elegant theories?

Chris Anderson, Editor-in-Chief of Wired , wrote a few weeks ago a provocative piece "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", arguing that in our Google-driven data-rich era ("The Petabyte Age") the good old "approach to science —hypothesize, model, test — is becoming obsolete", leaving place to a purely correlative vision of the world. There is a good dose of provocation in the essay and it was quite successful in spurring a flurry of skeptical reactions in the blogosphere, FriendFeed-land and lately in Edge's Reality Club.

I know that it is a bit late to write a post on this but this debate reminds me of the bottom-up vs top-down dialectic in (systems) biology. The tradition in molecular biology has been to focus on molecular mechanisms–a series of molecular events–that explain given biological functions. With detailed knowledge on the properties of an increasing number of components, bottom-up mechanistic descriptions–or models–can be constructed, which account for the experimental observations.

Of course, the purpose of models, at least for insightful ones, is more than merely providing mechanistic descriptions. As William Bialek writes, "Given a progressively more complete microscopic description of proteins and their interactions, how do we understand the emergence of function?" (Aguera y Arcas et al, 2003). There is therefore some subsequent subtle transition from description to insight, from model to theory, from detailed and specific to simple and general (watch Murray Gell-Mann's TEDTalk on "Beauty and truth in physics").

Theories are elegant.

On the other hand, high-throughput technologies (microarrays, proteomics, metabolomics, ultra high throughput sequencing, etc...) are indeed profoundly changing molecular biology and flooding the field with experimental data like never before. Currently, only part of this data can be explained within the context of mechanistic models. Still, and this is probably Chris Anderson's main point, it turns out that if the data is rich enough, one can exploit it by looking at the data globally, from the 'top', to reveal statistical patterns and correlations. Even if there is no mechanistic explanations (yet) for these correlations, they may reveal new worlds, novel structures and detect relationships between processes that were considered before as unlinked.

Correlations are fascinating.

Correlations resulting from data-driven analysis may well in turn stimulate new mechanistic investigations and hopefully new understanding. On Edge, Sean Carroll summarizes it all: "Sometimes it will be hard, or impossible, to discover simple models explaining huge collections of messy data taken from noisy, nonlinear phenomenon. But it doesn't mean we shouldn't try. Hypotheses aren't simply useful tools in some potentially-outmoded vision of science; they are the whole point. Theory is understanding, and understanding our world is what science is all about."

BUT, what is true for fundamental science is not obligatorily a rule for more applied fields, where the priority might less be on understanding than on acting. In particular, in medically related fields, top-down data-driven correlative approaches represent a pragmatic approach to obtain predictive models without waiting for still elusive fully mechanistic models that would encompass the entire complexity of human physiology (Nicholson, 2006).

As often in science, as in other human activities, different but complementary views are championed by people with different temperaments: there are those who like to build an edifice piece by piece and those who want to explore new territories. I think–I hope–that progresses in systems biology on both fronts, top-down and bottom-up, demonstrates that there is no need to turn this complementarity into an opposition.

March 3, 2008

Fewer papers to read, more data to use...

In a nice post at bbgm, Deepak writes:

...historical online literature lacks the relevant structure and metadata to make our task easier, but it is time that publishers thought ahead about some of the advantages of online publishing.

thumb080303.jpg I can't agree more. I heard sometimes the claim that within 5-10 years, more than 95% of the scientific literature is going to be read by computers only. Possible. However, the converse alternative might be interesting to consider: what if 95% of scientific papers could be 'written' by computers? Even if this formulation is obviously provocative and unrealistic, the point is that harnessing the 'network effect' of the web may have two complementary components, one community- the other computer-driven. On one hand, web 2.0 functionalities enable community-driven commenting, rating and even writing of scientific publications. On the other hand, semantic web technologies are expected to facilitate computer-driven integration of scientific data from multiple sources, which is likely to play an increasingly important role in science. Rather than mining thousands of unread papers, the scientist of the future may rather search the web for relevant data first and integrate it to generate – or 'write' – novel insight. In fact, integration of large datasets already represents a major field of research in systems biology (see Chuang et al 2007, Xue et al 2007 or Mani et al 2008 as recent examples published in Mol Syst Biol).

It seems thus that, in addition of being web 2.0 enabled, new publishing models should 'embed' more structured data into online publications. In short, 'papers' could progressively transform into hybrid online objects that resemble more to database records (see Timo Hannay's post on this topic) or highly structured documents. At the extreme, one could even imagine to publish 'naked' datasets, without any 'stories' around them. Of course, efficient data integration will require the data to be in a standard and structured format and its quality will have to be well characterized. These are all far from trivial qualities.

The good old-fashioned papers are probably not going to disappear as publication units, in particular for high-impact studies reporting novel and deep insights. It is also not the point here to propose dumping every scientist's hard drive into the web. Data-rich publications would be published only when the authors would feel it appropriate. There might thus be some equilibrium to find between papers that will never be read except by a text mining engine and pure datasets, published as a resource, easier to search, to mine and to integrate. This dialectic may ultimately boil down to the issue of how well will text mining and data integration technologies perform in the future.

In any case, within the context of the current debate about the saturation of the peer-review system, I wonder whether a data-centric form of scientific publishing could help to release somewhat the pressure. Reviewing of datasets might be quicker and could rely more on standardized evaluation parameters. If assorted with proper credit attribution mechanisms and metrics of impact, data-rich (or even data-only) publications may represent an alternative model complementing the traditional 'paper' format. It would prevent the loss of useful data otherwise buried in verbal descriptions and, most importantly, would hopefully stimulate web-wide integration of disparate datasets.

February 26, 2008

A refreshing model: peppermint terpenoids

Research highlight by Doron Lancet, Crown Human Genome Center, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel

MSB Research HighlightsLiving cells are typically asymmetric, having tens of thousands different biopolymers (proteins and polynucleotides), but merely <1000 types of small molecules, such as amino acids and lipids. An exception is certain plant cells that harbor members of ~40,000 strong group of low molecular weight terpenoids, often displaying a complex compositional balance essential for plant growth and survival (Aharoni et al, 2005). Understanding the intricacies of biosynthesis and interconversion of such unusual cellular components appears to require the full power of Systems Biology. In a recent paper, Rios-Estepa et al (2008) harness a systems approach, including iterative cycles of mathematical modeling and experimental testing, to help elucidate the metabolic dynamics of the terpenoid universe.

Specifically they ask how plants vary their monoterpene profiles in response to environmental stress – changing levels of illumination. A highlight of their results is that the variation of terpene metabolic fluxes is mediated by specific events in which members of the terpenoid repertoire exert a regulatory effect on terpene biosynthesis enzymes. Rewardingly, this is predicted by a computer simulation and subsequently verified by experiment. The broader conclusion, applicable to all living organisms, is that as the power of computing grows, it will become possible to make increasingly specific and accurate predictions, that will allow both a better global understanding and the successful engineering of cellular networks.


Aharoni A, Jongsma MA, Bouwmeester HJ (2005) Volatile science? Metabolic engineering of terpenoids in plants. Trends Plant Sci. 10:594-602.

Rios-Estepa R, Turner GW, Lee JM, Croteau RB, Lange BM (2008) A systems biology approach identifies the biochemical mechanisms regulating monoterpenoid essential oil composition in peppermint. Proc Natl Acad Sci U S A. 105:2818-2823

February 21, 2008

Top-down mapping of gene regulatory pathways

Trey Ideker videoIn a very recent lecture (see full video from NIH VideoCasting) given for the NIH Systems Biology Special Interest Group, Trey Ideker presents a great overview of the various strategies his group has been developing in the recent years in order to integrate multiple types of large scale datasets. While one of the most pervasive 'meme' about high-throughput measurement is that they are "notoriously unreliable" (see Hakes et al, 2008, for a recent example), Trey beautifully illustrates how predictive computational models and novel biological insights can be generated by sophisticated data integration strategies. Three types of applications are presented in his talk:

  1. mapping of transcriptional response pathways
  2. functional mapping of protein complexes
  3. disease diagnosis and stratification

In the last section, Trey presents the study recently published in Molecular Systems Biology (Chuang et al, 2007, video: 00hr:39min:15sec) where the information provided by microarray expression profiling is superposed to a protein-protein physical interaction network to identify 'subnetwork' biomarkers that classify metastatic vs non-metastatic breast tumors.

February 15, 2008

Transcription paused and poised for regulation

Research highlight by Frank C.P. Holstege, Department of Physiological Chemistry, University Medical Center Utrecht, the Netherlands.

MSB Research HighlightsFor eukaryotes, it is widely thought that transcription is primarily regulated through recruitment of the essential machinery to transcription start-sites. Previous hints challenging this paradigm have been confirmed by recent analyses showing that transcription regulation of a large number of genes actually occurs after recruitment. Mechanistically, such studies have gone furthest in Drosophila melanogaster (Muse et al, 2007; Zeitlinger et al, 2007). Here, conservative estimates indicate that more than 10% of genes are regulated through promoter-proximal pausing. On such genes, RNA polymerase II is recruited and initiates transcription, but then pauses around 50 bp downstream of the transcription start-site where it awaits further signals to resume elongation and complete transcription proper. These observations tie in with other observations made in yeast (Radonjic et al, 2005), embryonic stem cells (Bernstein et al, 2006; Lee et al, 2006) and differentiated mammalian cells (Guenther et al, 2007). There are numerous implications to these findings. For example, the widely assumed link between the presence of gene-specific transcription activators and full-length transcription appears to be much looser than expected. These results also underscore the importance of testing established models on a genome-wide scale. Indeed, other such surveys (Birney et al, 2007), indicate that to understand transcription, we may need to take into account even more surprises – such as the presence of ten times more start-sites than protein-coding genes and overlapping transcription units, etc… – than the post-recruitment mechanisms demonstrated in Drosophila.

Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, et al. (2006) A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125: 315-326

Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816

Guenther MG, Levine SS, Boyer LA, Jaenisch R, and Young RA (2007) A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130: 77-88

Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, Chevalier B, Johnstone SE, Cole MF, Isono K, et al. (2006) Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125: 301-313

Muse GW, Gilchrist DA, Nechaev S, Shah R, Parker JS, Grissom SF, Zeitlinger J, and Adelman K (2007) RNA polymerase is poised for activation across the genome. Nat Genet 39: 1507-1511

Radonjic M, Andrau JC, Lijnzaad P, Kemmeren P, Kockelkorn TT, van Leenen D, van Berkum NL, and Holstege FC (2005) Genome-wide analyses reveal RNA polymerase II located upstream of genes poised for rapid response upon S. cerevisiae stationary phase exit. Mol Cell 18: 171-183

Zeitlinger J, Stark A, Kellis M, Hong JW, Nechaev S, Adelman K, Levine M, and Young RA (2007) RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Nat Genet 39: 1512-1516

February 12, 2008

Information processing in signaling networks

Research highlight by Charles Auffray, Functional Genomics and Systems Biology for Health, UMR7091, CNRS and Pierre & Marie Curie University—Paris VI, Villejuif, France

MSB Research Highlights The work presented by Helikar et al. (2008) in a paper recently published in the PNAS represents a promising new step in the development of computational cellular physiology in eukaryotes. From curated cellular and biochemical data available in the literature, the authors have assembled a discrete Boolean model of signal transduction comprising 130 nodes, and examined in a systematic and controlled manner how varying combinations of external inputs translate into a range of cellular responses. The qualitative model is not only able to reproduce known input-output relationships representative of major transduction pathways, but it also provides evidence in support of the emergence of information-processing functions from the complex cellular network of molecular interactions. This is strikingly demonstrated by the fact that a large sample of randomly selected input combinations result in a very limited fraction of the possible outputs, which correspond to well-characterized global biological responses, a result which is obtained irrespective of the level of noise introduced in the inputs of the model. Moreover, similar input combinations are neatly clustered by the model into equivalence classes of global outputs, reflecting the ability of the cell to integrate complex environmental signals and translate them into robust specific responses and behaviours through common intracellular pathways. While discrete Boolean modelling makes it possible to highlight emergent properties of transduction networks, overcoming the hurdle of parameter estimation, very much as in classical physiology, it provides only high-order views in the form of black boxes with limited predictive and explanatory power. Integration with continuous models will be essential to unravel and engineer the underlying mechanisms.

Helikar T, Konvalina J, Heidel J, Rogers JA (2008). Emergent decision-making in biological signal transduction networks. PNAS 105, 1913-1918

January 18, 2008

Will probiotics bring systems biology to our table?

(via Scintilla)

thumb080118.jpgThe article on "Probiotics modulation of mammalian metabolism" published this week in Molecular Systems Biology by Jeremy Nicholson and colleagues (Martin at al, 2008) has attracted some attention (read the nice summary in Science News) in some (very) popular media (here, here, here and here).

In this follow-up study of the paper published last year (Martin et al, 2007), the team lead by Jeremy Nicholson, in collaboration with Nestlé, demonstrates clear physiological effects of oral probiotics administration on mice harbouring a humanized microbiome. The effects are intricate: both the host flora and metabolism are altered. By analyzing metabolite pools in several compartments (liver, blood, urine, feces, gut), and following in parallel the host microbiota, patterns of correlations between microbial species and metabolites start to be visible and reveal the probiotics-induced modulation of the microbial-mammalian interactions. But the actual paper is really just next door (synopsis), so have a look...

How will these results translate to humans? What will be the best way to influence our microbiome? Drugs or yoghurt? These are fascinating questions and the understanding of how our physiology depends on the microbial flora could have profound consequences, particularly in these times when we seem to be in a "rush to gene-based solutions to all our problems" (Wilson, 2007). Will personal genomics have to ultimately develop into personal metagenomics to include our "extended" microbial genome?

Even if I usually prefer to resist the temptation of a self-promoting section in this blog, I find the attention of the media for this topic interesting (despite the usual variable accuracy of newspaper reports) because it points to an area where systems biology provides insights into topics of immediate interest to the general public.

The NIH has recently started its Human Microbiome Project. In this context, this study also underscores the importance of developing model systems and tools to manipulate the microbiome and to analyze the incredibly dense and intricate interactions that connect host and microbial species. A field where top-down systems biology seems indeed a very pragmatic and promising approach.

January 14, 2008

Morphogen Paradoxes

Bicoid morphogen gradientA controversy seems to be brewing over some recent theories and quantitative analyses addressing the fundamental question of how the Bicoid morphogen gradient is established and decoded in early Drosophila embryos. The transcription factor Bicoid controls the anterior-posterior patterning of the developing embryo. It is translated from maternal mRNA localized at the anterior pole of the egg and its graded distribution activates, in a concentration-dependent manner, the expression of gap genes, thus determining their spatial domain of expression. Synthesis from a localized source combined with diffusion and uniform degradation of the Bicoid morphogen provides one of the simplest models to explain the approximately exponential shape of its gradient. While, historically, patterning has been thought to rely on the gradient at its steady state – that is when synthesis, transport and degradation processes balance each other – the question arose as to whether steady-state can be reached rapidly enough in the quickly developing embryo (Lander, 2007).

In February last year, Naama Barkai and colleagues published a study (Bergmann et al, 2007) in which they propose that the gradient would in fact be interpreted before it has reached its steady-state, when the gradient is still "moving". Experimental evidence for a dynamic evolution of Bcd profile between cleavage cycle 11 and 12 is provided using a reporter gene driven by bicoid-binding sites. These authors further show that a pre-steady-state model implies a reduced sensitivity of the gradient readout to variations in the production of morphogen at its source. One biologically relevant example of this robustness is the observation that the domain of expression of hunchback, a Bicoid target gene, shifts much less in embryos from mothers with altered bicoid gene dosage than would be predicted by a steady-state model.

A few months later, Thomas Gregor and colleagues published two papers (Gregor et al, 2007a, 2007b) reporting a detailed analysis of the profile and dynamics of the Bicoid gradient. Quantitative in vivo imaging of a transgenic bicoid-eGFP reporter revealed several paradoxes. While a stable gradient of nuclear Bicoid is quickly established (within 90min, approx. cleavage cycle 9), the (local) diffusion coefficient of Bicoid, as deduced from photobleaching experiments, appears to be far too small (D=0.3 μm2/s, much less than expected from previous estimations made by injecting labeled dextran molecules) to be compatible with such a rapid establishment of the (long-range) gradient by diffusion alone. These experiments further show that nuclear Bicoid is under a highly dynamic nuclocytoplasmic equilibrium, pointing to a fundamental role for the nucleus in gradient establishment and stability. Finally, the precision with which the Bicoid gradient is transformed into Hunchback expression (see illustration, after Gregor et al 2007b) is estimated to be around 10%. This remarkable level of precision would not only be close to the physical limits of the system, but also strikingly matches the accuracy required to detect changes of Bicoid expression between adjacent cells (10%, equivalent to a difference of only 70 Bicoid molecules per nucleus) and the level of reproducibility of the absolute morphogen concentration from embryo to embryo (10% as well).

In a Correspondence published last week, Bergmann and colleagues (2008) dispute these interpretations and claim that a "reanalysis of their [Gregor et al's] data demonstrates that their findings are consistent with the well-accepted paradigm of diffusion-based patterning and provides further support for the notion that the Bicoid profile is decoded prior to reaching its steady state". Thus, according to these authors, constant nuclear Bicoid levels are not indicative of steady-state of the gradient itself given that cytoplasmic levels may still be changing. The small diffusion coefficient of Bicoid would then be an additional argument in favor of the necessity of a pre-steady-state decoding mechanism. If this is the case, the differences in Bicoid levels between adjascent cells would be much bigger at cleavage cycle 9 (50% instead of 10% at cycle 14), thus resolving the paradox of the high precision of the hunchback response.

In their response (Bialek et al, 2008), Gregor and colleagues reply that if cells would make a decision by reading Bicoid concentration at cycle 9, the boundary between expression domains would be 5 cells wide at stage 14 (=\sqrt{2^14/2^9}), while in reality it is only a single cell wide. While they agree that the overall gradient might not be at steady-state at these early stages, they argue that the stability of nuclear Bicoid levels is functionally highly relevant given that Bicoid is a transcription factor. Finally, they also point out that the deduced local diffusion constant is so small that it is in fact incompatible with observing any Bicoid in the middle of the embryo in the first place, thus suggesting the existence of additional mechanisms to explain establishment of the gradient at the scale of the entire embryo. These and some additional arguments lead Bialek et al to conclude that "the small values of the diffusion constant for Bcd we reported are superficially consistent with their model, but the model provides no basis for understanding any of our observations."

Mmmmh... not an easy one. Those who have additional insights into these subtle but fascinating questions, please let us know!

January 11, 2008

Consumer Health Information Technology

Play video I highly recommend to visit the NIH VideoCasting page, which hosts many interesting video/podcasts. Even if I realize that this is a bit old according to the blogosphere time scale, I would like to point to this one: "The Future: Consumer Health Information Technology", featuring talks given at a NCI-sponsored meeting on Dec 10, 2007 by Adam Bosworth (formerly "Google Health architect", now starting his own company Keas), Bern Shen (Intel) and Bill Crounse (Microsoft).

In his introduction to the meeting, Bradford Hesse (NCI) colorfully summarizes one of the main concepts exposed by the speakers (the video is very long, so I give some pointers: 0h16min43sec) by comparing the future of healthcare to...an "IKEA flat pack": patients will progressively be empowered to assemble their own care from home, like they would build a piece of (cheap) furniture.

Adam Bosworth (0h25min53sec) presents his very pragmatic vision of how IT could concretely help healthcare (0h39min07sec): a) help the consumer to own and control his personal health data, and this already for very simple basic information; b) provide tools for doctors so that they can deliver personalized care as easily as producing a spreadsheet; c) develop tools for researchers to facilitate the design and implementation of new protocols and clinical trials.

Bill Crounse (Microsoft's other Bill...1h14min30sec) sees 5 major current trends that will increasingly challenge the healthcare system and call for IT solutions (1h26min22sec): a) increasing personal responsibility ("the end of health insurance"); b) progressive "retailization" of healthcare services (eg appearance of "retail minute clinics"); c) commoditization of healthcare providers; d) globalization of access to information (through the web of course); e) globalization of healthcare services. I recommend his little funny anecdote on the high-tech GPS wireless-connected plumber (1h25min30sec) who appears to better equipped than any practicing physician...

The speakers also all insist on the need for massive data integration promoted by the interoperability of formats and coding information, themes that probably sound familiar to many systems biologists.

Toward the end of his talk (1h35min00sec), Bill Crounse shows a short "science-fiction" movie on Microsoft's vision of the future of healthcare: a world full of credit-card sized tablet PCs, touch screens and many other very exciting gadgets (I love gadgets!). But I can't help missing a bit the warmth of human-to-human interactions within this jungle of virtual consultations, retail clinics, remote controlled metabolic parameters, etc... and I didn't quite see in that movie that the doctor would spend more time with his patient or the daughter with her sick Grandma. But this may of course only reflect some old-fashioned side of my temperament...

November 20, 2007

Personal genomics for a fistful of dollars

The wave of personal genomics is progressing rapidly. A string of four papers appeared recently (Porreca et al, 2007, Albert et al, 2007, Okou et al 2007, Hodges et al, 2007) reporting on microarrray-based technologies that enable the enrichment of selected genomic fragments in a single massively multiplexed reaction, thus greatly facilitating subsequent resequencing of pre-defined portions of the human genome (eg all coding exons). These technologies are expected to reduce dramatically the cost of targeted resequencing of individual genomes.

On the commercial front, deCODE and 23andMe have launched their personal genome service offering genome-wide SNPs profiling for a little less than $1,000 (NYT articles: Nicholas Wade, Amy Harmon, or Wired, ScienceRoll, Sandra, DNA and You).

The chips used by 23andMe are the "Illumina HumanHap550+ BeadChip, which reads more than 550,000 SNPs (single nucleotide polymorphisms) plus a 23andMe custom-designed set that analyzes more than 30,000 additional SNPs." The profile provided by deCODEme includes "over one million variants across the genome."

So what do you think?

November 18, 2007

Glia-neuron interactions

thumb071115b.jpg Nature Neuroscience has a nice special focus on glia and disease. The featured reviews and perspective articles discuss multiple aspects of neuron-glia interactions and their role in disease. The reason why I am highlighting this collection here is that I have the feeling that this field could potentially be a nice playground for systems biology.

For example, Rossi and colleagues (2007) review the various metabolic processes affected during brain ischemia. Several of the examples discussed illustrate very well how the extent of brain damage is determined by the concurrent dynamics of both harmful and protective processes engaging complex interactions between neurons and astrocytes. A critical determinant for ischemic damage is the catastrophic loss of ATP levels caused by deficient glucose and oxygen delivery. Astrocytes have glycogen stores that can normally be converted to lactate which is exported to neurons to provide energy during phases of high activity. In absence of oxygen however, lactate can no longer be oxidized. In this case, glucose may then help delay loss of ATP levels, via anaerobic glycolysis. But this beneficial effect might be counteracted by lactic acidosis caused by continued glycolysis in the absence of O2, which is known to accentuate ischemic damage in the case of hyperglycemia. Moreover, acidosis may activate Na+-H+ exchange, cytosolic Na+ accumulation, reversal of Na+-Ca2+ exchange resulting in astrocyte Ca2+ overload, either impairing their protective functions or even killing them.

A similar complexity is seen in the events underlying ischemic glutamate release. Loss of cellular ATP levels impairs the function of the Na+-K+ ATPase and thus disrupts ionic gradients. The resulting depolarization leads to a large increase in extracellular glutamate that is amplified by positive feedback, ultimately resulting in neuronal death by excitotoxicity. Astrocytes may contribute to increased extracellular glutamate levels via direct vesicular glutamate release and vesicular ATP release that in turn activates glutamate-permeable P2X receptors. Glutamate reuptake is normally carried out by five high-affinity sodium-dependent glutamate transporters. Disruption of transmembrane potential and of ionic gradients can cause transporter reversal thus further contributing to glutamate release. This depends in turn on the intracellular glutamate concentration which is much higher in astrocytes than neurons, determining the relative kinetic of neuronal and astrocytic reuptake/release as the ischemic perturbations progress. Further details are visible on Figure 3 from Rossi et al (2007):

Even if this short overview is condensed and incomplete, it suggests to me that quantitative measurements and integrated modeling could be quite helpful, if feasible, to understand the various contributions of the many processes involved and to identify potential points of protective synergies or characterize regimes under which the stability of the astrocyte-neuron system is catastrophically compromised. Perhaps this type of model and its calibration could even serve as a starting point to investigate the involvement of astrocytes in computational aspect of neuronal functions (Wang et al, 2006).