Discovery of cellular networks from high dimensional data
Organisms use a combination of cis- and trans-acting elements to respond to intra- and extra-cellular environmental stimuli by regulating gene and protein expression. The biological process of transcription begins with the binding of transcription factors to specific sequence motifs upstream of a gene's transcription initiation site. This induces conformational changes in the DNA and initiates the process of assembly of the RNA polymerase complex. The process is rather complex, with promoters, inhibitors and enhancers playing a role in regulating the level of gene expression. One consequence of transcriptional activation is that the levels of transcription factors themselves can be affected through the same promotion and repression mechanism. What emerges is not a single set of interactions, or even a single pathway, but a complex network of interacting genes and gene products. In principle, it is this network and the interactions between its components that we would like to understand since this underlies the way in which organisms respond to environmental and other cues.
Understanding the stochastic nature of biological processes
There are now many reports indicating that protein production has a stochastic component that gives rise to very different rates of protein synthesis in genetically identical cells in essentially identical environments. However, the network models developed to date do not allow for these variable results. The models are deterministic, meaning that if the right initial conditions are met and the right interactions are represented, then the model will predict a specific outcome. If gene expression is fundamentally a stochastic process, then what we are seeing in most experiments is an ensemble of individual cells, and the properties we measure are an average over that ensemble. Although we may be able to predict what happens on average, it is very likely there are other outcomes that are nearly as likely. In 1995, McAdams and Shapiro attempted to model one of the simplest organisms, lambda phage, and realized that stochastic inputs to the system made reliable prediction of outcome nearly impossible. We believe that we are only scratching the surface and that stochastic processes play a much more pervasive and profound influence on gene expression than we imagine at present.
Heterogeneous data integration
Increasingly, we are coming to understand that drawing useful conclusions from complex datasets involves integration of information from a wide range of sources. One of the most revolutionary but conceptually simple advances in the analysis of microarray expression data was the application of categorical statistics. This approach simply uses the available data to look for classes of genes, defined for example by Gene Ontology term and metabolic pathway assignments by using Fisher's Exact Test to identify those that are represented at a higher frequency than one would expect by chance. Similar approaches, such as Gene Set Enrichment Analysis allow the analysis of new data sets in relation to others and to build a consensus across studies that can provide a high degree of confidence in the results. We are in the process of creating a series of databases to facilitate aggregation of data and information from a wide range of sources into a central warehouse and to make them easily available to DFCI researchers, but what is needed is the development of new methods to extract the maximal biological insight from those data.