Send your posts to emailaddress.jpg

Subscribe

About Data integration

This page contains an archive of all entries posted to The Seven Stones in the Data integration category. They are listed from oldest to newest.

Bioinformatics is the previous category.

Databases is the next category.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
embo_logo.gif npg_logo.gif
Powered by
Movable Type 3.33

Main

Data integration Archives

March 3, 2008

Less papers to read, more data to use...

In a nice post at bbgm, Deepak writes:

...historical online literature lacks the relevant structure and metadata to make our task easier, but it is time that publishers thought ahead about some of the advantages of online publishing.

thumb080303.jpg I can't agree more. I heard sometimes the claim that within 5-10 years, more than 95% of the scientific literature is going to be read by computers only. Possible. However, the converse alternative might be interesting to consider: what if 95% of scientific papers could be 'written' by computers? Even if this formulation is obviously provocative and unrealistic, the point is that harnessing the 'network effect' of the web may have two complementary components, one community- the other computer-driven. On one hand, web 2.0 functionalities enable community-driven commenting, rating and even writing of scientific publications. On the other hand, semantic web technologies are expected to facilitate computer-driven integration of scientific data from multiple sources, which is likely to play an increasingly important role in science. Rather than mining thousands of unread papers, the scientist of the future may rather search the web for relevant data first and integrate it to generate – or 'write' – novel insight. In fact, integration of large datasets already represents a major field of research in systems biology (see Chuang et al 2007, Xue et al 2007 or Mani et al 2008 as recent examples published in Mol Syst Biol).

It seems thus that, in addition of being web 2.0 enabled, new publishing models should 'embed' more structured data into online publications. In short, 'papers' could progressively transform into hybrid online objects that resemble more to database records (see Timo Hannay's post on this topic) or highly structured documents. At the extreme, one could even imagine to publish 'naked' datasets, without any 'stories' around them. Of course, efficient data integration will require the data to be in a standard and structured format and its quality will have to be well characterized. These are all far from trivial qualities.

The good old-fashioned papers are probably not going to disappear as publication units, in particular for high-impact studies reporting novel and deep insights. It is also not the point here to propose dumping every scientist's hard drive into the web. Data-rich publications would be published only when the authors would feel it appropriate. There might thus be some equilibrium to find between papers that will never be read except by a text mining engine and pure datasets, published as a resource, easier to search, to mine and to integrate. This dialectic may ultimately boil down to the issue of how well will text mining and data integration technologies perform in the future.

In any case, within the context of the current debate about the saturation of the peer-review system, I wonder whether a data-centric form of scientific publishing could help to release somewhat the pressure. Reviewing of datasets might be quicker and could rely more on standardized evaluation parameters. If assorted with proper credit attribution mechanisms and metrics of impact, data-rich (or even data-only) publications may represent an alternative model complementing the traditional 'paper' format. It would prevent the loss of useful data otherwise buried in verbal descriptions and, most importantly, would hopefully stimulate web-wide integration of disparate datasets.

February 21, 2008

Top-down mapping of gene regulatory pathways

Trey Ideker videoIn a very recent lecture (see full video from NIH VideoCasting) given for the NIH Systems Biology Special Interest Group, Trey Ideker presents a great overview of the various strategies his group has been developing in the recent years in order to integrate multiple types of large scale datasets. While one of the most pervasive 'meme' about high-throughput measurement is that they are "notoriously unreliable" (see Hakes et al, 2008, for a recent example), Trey beautifully illustrates how predictive computational models and novel biological insights can be generated by sophisticated data integration strategies. Three types of applications are presented in his talk:

  1. mapping of transcriptional response pathways
  2. functional mapping of protein complexes
  3. disease diagnosis and stratification

In the last section, Trey presents the study recently published in Molecular Systems Biology (Chuang et al, 2007, video: 00hr:39min:15sec) where the information provided by microarray expression profiling is superposed to a protein-protein physical interaction network to identify 'subnetwork' biomarkers that classify metastatic vs non-metastatic breast tumors.

February 15, 2008

Transcription paused and poised for regulation

Research highlight by Frank C.P. Holstege, Department of Physiological Chemistry, University Medical Center Utrecht, the Netherlands.

MSB Research HighlightsFor eukaryotes, it is widely thought that transcription is primarily regulated through recruitment of the essential machinery to transcription start-sites. Previous hints challenging this paradigm have been confirmed by recent analyses showing that transcription regulation of a large number of genes actually occurs after recruitment. Mechanistically, such studies have gone furthest in Drosophila melanogaster (Muse et al, 2007; Zeitlinger et al, 2007). Here, conservative estimates indicate that more than 10% of genes are regulated through promoter-proximal pausing. On such genes, RNA polymerase II is recruited and initiates transcription, but then pauses around 50 bp downstream of the transcription start-site where it awaits further signals to resume elongation and complete transcription proper. These observations tie in with other observations made in yeast (Radonjic et al, 2005), embryonic stem cells (Bernstein et al, 2006; Lee et al, 2006) and differentiated mammalian cells (Guenther et al, 2007). There are numerous implications to these findings. For example, the widely assumed link between the presence of gene-specific transcription activators and full-length transcription appears to be much looser than expected. These results also underscore the importance of testing established models on a genome-wide scale. Indeed, other such surveys (Birney et al, 2007), indicate that to understand transcription, we may need to take into account even more surprises – such as the presence of ten times more start-sites than protein-coding genes and overlapping transcription units, etc… – than the post-recruitment mechanisms demonstrated in Drosophila.

Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, et al. (2006) A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125: 315-326

Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816

Guenther MG, Levine SS, Boyer LA, Jaenisch R, and Young RA (2007) A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130: 77-88

Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, Chevalier B, Johnstone SE, Cole MF, Isono K, et al. (2006) Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125: 301-313

Muse GW, Gilchrist DA, Nechaev S, Shah R, Parker JS, Grissom SF, Zeitlinger J, and Adelman K (2007) RNA polymerase is poised for activation across the genome. Nat Genet 39: 1507-1511

Radonjic M, Andrau JC, Lijnzaad P, Kemmeren P, Kockelkorn TT, van Leenen D, van Berkum NL, and Holstege FC (2005) Genome-wide analyses reveal RNA polymerase II located upstream of genes poised for rapid response upon S. cerevisiae stationary phase exit. Mol Cell 18: 171-183

Zeitlinger J, Stark A, Kellis M, Hong JW, Nechaev S, Adelman K, Levine M, and Young RA (2007) RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Nat Genet 39: 1512-1516

January 11, 2008

Consumer Health Information Technology

Play video I highly recommend to visit the NIH VideoCasting page, which hosts many interesting video/podcasts. Even if I realize that this is a bit old according to the blogosphere time scale, I would like to point to this one: "The Future: Consumer Health Information Technology", featuring talks given at a NCI-sponsored meeting on Dec 10, 2007 by Adam Bosworth (formerly "Google Health architect", now starting his own company Keas), Bern Shen (Intel) and Bill Crounse (Microsoft).

In his introduction to the meeting, Bradford Hesse (NCI) colorfully summarizes one of the main concepts exposed by the speakers (the video is very long, so I give some pointers: 0h16min43sec) by comparing the future of healthcare to...an "IKEA flat pack": patients will progressively be empowered to assemble their own care from home, like they would build a piece of (cheap) furniture.

Adam Bosworth (0h25min53sec) presents his very pragmatic vision of how IT could concretely help healthcare (0h39min07sec): a) help the consumer to own and control his personal health data, and this already for very simple basic information; b) provide tools for doctors so that they can deliver personalized care as easily as producing a spreadsheet; c) develop tools for researchers to facilitate the design and implementation of new protocols and clinical trials.

Bill Crounse (Microsoft's other Bill...1h14min30sec) sees 5 major current trends that will increasingly challenge the healthcare system and call for IT solutions (1h26min22sec): a) increasing personal responsibility ("the end of health insurance"); b) progressive "retailization" of healthcare services (eg appearance of "retail minute clinics"); c) commoditization of healthcare providers; d) globalization of access to information (through the web of course); e) globalization of healthcare services. I recommend his little funny anecdote on the high-tech GPS wireless-connected plumber (1h25min30sec) who appears to better equipped than any practicing physician...

The speakers also all insist on the need for massive data integration promoted by the interoperability of formats and coding information, themes that probably sound familiar to many systems biologists.

Toward the end of his talk (1h35min00sec), Bill Crounse shows a short "science-fiction" movie on Microsoft's vision of the future of healthcare: a world full of credit-card sized tablet PCs, touch screens and many other very exciting gadgets (I love gadgets!). But I can't help missing a bit the warmth of human-to-human interactions within this jungle of virtual consultations, retail clinics, remote controlled metabolic parameters, etc... and I didn't quite see in that movie that the doctor would spend more time with his patient or the daughter with her sick Grandma. But this may of course only reflect some old-fashioned side of my temperament...

May 5, 2007

Semantic zooming of networks

One can only agree with Euan Adie, that "the way we present genomic and proteomic data on the web sucks" (read post on Nascent). And this holds for biological networks: depiction of protein-protein interactions as colorful hairballs results in impressive figures but is not obligatorily very useful. While the network representation is a powerful abstract representation of biological processes, it is trivial to say that a graph (with its jungle of nodes and edges) is far from resembling even remotely to an actual living cell as you see it under the microscope... In the crude visualization of biological process as simple graphs, space, time, multi-scale structure and biological context are missing.

Charles DeLisi makes an attempt to tackle the problem of visualization of complex mutli-scale biological networks by introducing the use of metagraphs (Hu et al, 2007, Nature Biotech 25:547). Metagraphs have so-called metanodes in addition to simple nodes. A metanode contains a subgraph composed of child (meta)nodes, which are revealed only when the metanode is in its "expanded" state. Edges link simple nodes while metaedges link "contracted" metanodes and are inferred from the links carried by nodes of the underlying subgraph. A key distinctive feature of metagraphs is that several instances (carrying different "labels") of a node can be shared between distinct metanodes (eg when a protein belongs to different complexes).

Metanodes can represent directly the multi-scale modular hierarchy of a network, incorporate biological context (eg sets of proteins sharing the same GO annotation) or even represent groups of orthologous genes. With this representation, implemented in the software VisANT (http://visant.bu.edu/), "semantic zooming" into the network is made possible. This would be similar to zooming into a Google Map, when not only the scale of the map changes but also the resolution of the labels and various abstract annotations, as is best seen using the "hybrid" mode superposing annotations with the satellite picture.

This analogy with Google Map illustrates also the limits of the current network representation as "maps" of cellular processes. There is still a long way until the graphs representing biological networks can really be mapped onto cellular structures to result into better visualization tools but also into more realistic computational models of the whole cell. In a sense, a "Google Cell" should also have a "hybrid" mode, where the abstract representation can be superposed onto the "satellite image" version of the biological object visualized. As if little tiny networks would be folded inside each voxel of a 3D full reconstruction of a cell, such as the one recently published by Antony and colleagues (Höög et al, 2007, see post). Something like integrating interaction networks, "ORFeome"-like datasets and electron tomography...

January 19, 2007

Analyzing time-series expression data

tree-like Ziv Bar-Joseph and colleague describe their new method Dynamic Regulatory Events Miner (DREM) to analyze time-series gene expression data and combine them with static ChIP-chip experiments. The expression profiles are modeled using an extension of Hidden Markov Model that enforces a tree structure onto the expression profiles. The technique allows to deduce the condition-specific or time-dependent activity of transcription factors that explain the observed expression profiles.

sharp transitionsIn their analysis of developmental time-series of gene expression in Drosophila, Peer Bork and colleagues apply a more drastic principle to identify robust groups of genes that correlate with major development phases. They required "four points of low expression and four subsequent points of high expression (or vice versa) even if the amplitude change was relatively low (see Materials and methods). This type of convolution not only requires a sharp increase or decrease of expression, but also that the change in transcript level is consistent over a period of time, thereby reducing the rate of false positives owing to individual outliers."