GCJ: What is the task at hand, in other words, what are computational biologists trying to accomplish using grid?
Maltsev / Sulakhe: In the recent years there have been an unprecedented accumulation of biological information. Genomes of over 500 organisms were completely sequenced and over 800 are at various levels of completion. To analyze this information for the needs of biomedical research, bioremediation and agriculture one needs high-throughput computational environments that integrate large amounts of genomic and experimental data, and bioinformatics tools for knowledge discovery and data mining. Most of these tools and algorithms are very CPU-intensive and require substantial computational resources. The large-scale, distributed computational and storage infrastructure of the Grid offers an ideal platform for mining such large volumes of biological information. hundreds terabytes of data. Because of recent advances in genetics, bio-informatics, and grid technology, we can now learn more about the way genes have adapted to changing ecosystems.
One of the major approaches in bioinformatics is comparative analysis. The availability of large volumes of genomic data now allows for systematic comparison of genomes of organisms originating from different taxonomic groups, displaying various levels of complexity in their biological organization and residing in a different of environments. Comparative analysis allows to investigate variety of scientific questions, such as "what is the difference between genomes of microbes that live in the freezing temperatures versus organisms living at the temperatures above 100 degrees Celsius?" or "what genes are different between pathogenic and nonpathogenic strains of E. coli?". Understanding evolutionary patterns that have emerged in the course of adaptation of organisms to environments is essential for the future of genetic engineering, medicine and environmental research.
To assist comparative analyses we have developed computational systems PUMA2 and GNARE for high-throughput analysis of genomes and metabolic reconstructions from sequence data. The PUMA2 is an interactive system that contains analysis of all publicly available genomes in comparative integrated framework. The GNARE system allows for interactive analysis of genomic data provided by users. PUMA2 and GNARE are used by the researchers worldwide.
To do this, we are comparing a wide variety of organisms that differ phylogenetically and phenotypically-that is, animals with different genes and different bodies. For example, we might want to compare an unknown gene to a bank of known genes, or we might want to discover the genetic basis for physiological differences that allow one organism to live in freezing conditions and another at hundreds of degrees. Much genetics work is fundamentally comparative in this regard.
But there is also an evolutionary component: previous studies have shown that analysis of phylogenetic data enables us to infer relationships among genes and reconstruct evolutionary events. From there, we can investigate the evolutionary mechanisms that have led to the development of a particular biological function. So for example, we could predict a protein's function and then compare its role in various taxonomic and phenotypic groups. Obviously, this kind of knowledge is very valuable from a basic science perspective, and I'm sure you can imagine all kinds of applications in drug design and diagnostics.
GCJ: What are the current computational needs of researchers in bio-informatics?
Maltsev / Sulakhe: As you probably guessed, genetics research is computational-intensive. GADU (Genome Analysis and Database Update), is the Grid-based automated engine that drives data analysis both in the PUMA2 system and GNARE. Besides these two large applications GADU also supports a variety of projects our high-throughput computational system, supports a variety of projects, including the National Institutes of Health's (NIH) initiative on Great Lakes Center of Excellence in Bio-defense and Emerging Infectious Diseases (the Pathos database), the NIH Center for Structural Genomics (TarGet database), as well as the Department of Energy's (DOE) Shewanella Federation, -- one of the Department of Energy's (DOE) Genomes to Life projects, the DOE Hanford site metagenome analysis project and others (e.g. Sentra, Chisel). which investigates a particular bacterium of interest. So there is demand for our computational resources in a number of contexts.
Right now we are juggling integrating genomic data from 25 different databases. And by 2010 there will be literally thousands of complete genomes, as well as tons of experimental, biochemical and phenotypic data available online. The growth has been absolutely explosive. The problem was, we figured it would take one CPU at least 66 minutes to analyze 100 genomic sequences, and over forty hours to process one genome of approximately 4,000 sequences. At that rate, it would take more than three years to analyze the 3.1 million sequences we have in our BLAST database.
GCJ: How can grid enable genetics research?
Maltsev / Sulakhe: Ultimately, grid has helped us move toward our goal of providing researchers with an environment for co-evolutionary and comparative analysis of genomes, metabolic networks, and enzymes, in the environmental-physiological framework I described.
Using the Open Science Grid (OSG), the TeraGrid, and the DOE Science Grid, we can perform those computations I mentioned previously in a much more timely fashion. The number of CPUs available on those grids varies of course, and we were able to use up to 1200 CPUs from various Grid resources at a given time during our last update. With this approximate computational power, it takes only five and a half days to analyze our 3.1 million sequences using BLAST, or about 10 minutes per genome.
Grid-enabled GADU performs several functions. First, it acquires data from a variety of publicly available databases and stores them temporarily. Then it runs analysis using a mix of public and in-house tools using the data is captured as well as information from our own integrated database. The existing tools we have for performing genetic sequence analysis, resources like TigrFam and PIR IproClass, are not always sufficiently high-resolution. So we've built some new applications such as PUMA2, Chisel, and Target, to support work being done at the NIH and DOE. These tools, used for things like structural and functional analysis, are web-based applications that operate on the data stored in our integrated database.
Finally, GADU stores all the data generated by the workflows executed on the Grid, into a relational database along with the preliminary data acquired from external sources. Few tools in GADU use the stored data as input for further analysis.
GCJ: What are the logistical or technical challenges you faced when using grid resources?
Maltsev / Sulakhe: One of the initial problems we had was to represent our tools in the form of workflows that are resource independent. Given the fact that we are using heterogeneous grid resources such as OSG, TeraGrid and DOE Science Grid, we wanted to implement something that can used for any type of Grid environment. Using Virtual Data System (VDS), Condor-G and Globus, GADU generates site specific workflows to execute the tools.
Another significant challenge we faced was the site selection from the various Grids we had access to. We needed to monitor the state of all our Grid sites and track how well they were responding to job requests. So we built a site selection mechanism inside GADU that tested all our grids for response time and the like. In the end we only used "available" grids that were communicating and responding to our Globus job submissions with minimal queuing delays.
GCJ: How has the Globus Toolkit and Grid Computing helped make your job easier?
Maltsev / Sulakhe: This is best summed up in five points:
-
Computational biologists routinely need to manipulate large amounts of data. And that data is increasing as an exponential rate. By using Grid we are insuring scalability and preparing for the future.
-
We generally don't need to concern ourselves with scheduling, so with Grid we are scalable on demand.
-
Computational biologists need absolutely up to date information. With Grid we can run updates often thereby keeping the data fresh.
-
In computational biology there are complex analytical pipelines. This often involves mixed and complex media such as video. Grid is an ideal framework to move and make available complex heterogeneous aggregations of data and media.
-
Grid is more than high throughput computations, its also a leading mechanism for the use of web services. Utilizing web services interfaces and the distributed infrastructure of Grid we plan to make some computational biology services available. Other scientists will be able to utilize these services and even build them into their own analytical applications.
GCJ: What comes next? Is this data publicly available? How can people learn more if they're interested?
Maltsev / Sulakhe: We are indeed building a set of web services to make GADU available for public usage and other collaborative projects. Stay tuned for their availability.
close window |
|