Guest Expert
Peter Covitz
Director for Infrastructure
NCI Center for Bioinformatics
caGrid

One of the characteristics of life science and biomedical science is the diversity of data types - the heterogeneity of data and the way it's described. In cancer research in particular, this presents interesting challenges for collaboration between different scientists and data sets. Recently, the Globus Consortium Journal had the opportunity to speak with Peter Covitz, director for infrastructure at NCI Center for Bioinformatics - and to learn more about the growing use of the caBIG (Cancer Biomedical Informatics Grid), and what it means for a data set to be "caBIG-enabled."

GCJ: Tell us a little bit about this "heterogeneity" issue that's prevalent in life sciences and biomedical sciences.

Covitz: Even within a given type of data from, say, a measurement technology, or a theoretical description of biology - even within an area that is "the same" from a conceptual standpoint, there is often a diversity of either terminology or a subtlety of meaning. This is a problem that commonly confronts informaticists who want to integrate resources in life sciences.

Other scientists - such as those in high energy physics, for example - may have tremendous amounts of data and separate challenges with large computational loads, but they tend to deal with a relatively modest number of "well understood" data types in their domains. They don't' have this diversity and heterogeneity problem that we face.

So that's exactly what we're addressing with caBIG. We're taking the best possible technology for integrating and sharing resources - namely, the Grid technology that's evolved over the years, driven by physics and astronomy's cases - and we've extended it to a common base of needs for the life sciences community. The extensions that we've put in have been largely about better support for descriptions of data and diverse data types, and semantic control of those data types by binding them to structured ontologies. We've binded that all into the grid framework that Globus already provided. We have thus created what is known as a data Grid. CaGrid is probably one of the more sophisticated data Grid architectures out there.

Other people are solving the problems of large data set transfer. When we get confronted with that bottleneck of large size data manipulation, data transfer and federation - we can use the technologies that have been devised and optimized by those that tackled the problem before we did. It's the data complexity and heterogeneity problem that we focus our energies on.

GCJ: So how do different scientists go about discovering these new sets of data that are available to them. Is there some way to search and index?

Covitz: One of the great things about the caGrid is that it a provides for a service registry or index service. That is a service that's on the Grid that essentially lets you do an inquiry the Grid about itself. In other words, it lets you discovery what's on the Grid.

GCJ: Does someone have to manually register new sets of data with that service when they put something on there?

Covitz: Yes, there is a process by which new services get registered. And we've set up a set of requirements that have to bet met before you even get to the point of being able to register your service on the Grid. The requirements have to do with data descriptions and semantics. And once you've satisfied all of those requirements - which are largely about providing structured metadata that describe your service - you are now ready to register your service with our Grid service registry. And then once that's done, that metadata becomes available through the service registry for others to discover what your resource provides.

GCJ: It looks like you guys are getting a tremendous amount of traction with the community.

Covitz: More than 500 researchers and 80 different organizations were working with caBIG as of last spring. That number has really grown since then too.

GCJ: I've seen announcements from other science groups where they have identified their project as 'caBig-enabled.' Is that carrying a certain cache with audiences now that they're having their project 'caBig-enabled?'

Covitz: I think so. And we want it to be a meaningful label. One of the things that we're actually working on is putting some more official parameters around what it means to be allowed to claim that. Right now it's great that people want to claim it - but one of the things that's on the table for the future that's not in place right now is a more formal mechanism for certifying systems as being caBIG-compatible.

Today, we determine caBIG compatibility by looking at two things. First, it's evaluated in terms of semantic issues and terminology issues. The other review angle is the compliance with architectural and interface requirements for caBIG. We have a lot of work pending to formalize and standardize that review process, and that will be happening over the next year.

GCJ: Are there any threats of people putting data onto this system that would compromise others? Any security issues that are of particular concern for you guys?

Covitz: For us, the security issues are more about appropriate access to resources by people who have appropriate permission.

GCJ: And how do you define those permissions?

Covitz: They vary, and they depend on the nature of the data being shared. So in some cases, it's merely a scientific group that would like to have some control over who sees their data, when - in other words, they may want to start off just sharing with specific collaborators. In others, the sharing issues are more legal and regulatory in nature. For example, data that's derived from human subject research is subject to a specific oversight by institutional review boards at the particular institutions that are involved. Further, data that can be used to identify a human being based on some medical measurement or patient information - is subject to legal requirements of the HIPAA statute. So all of these things come into play when we talk about security.

GCJ: Any next steps for the project or upcoming milestones that you'd like to talk about?

Covitz: So we recently put up our first test bed version of this Globus Grid - that we call caGRID. We call it caGRID to refer to the Grid itself, versus caBIG, the program. That went up in September, and that was a huge milestone for us. So what we're doing now is encouraging groups to take a look at it, to learn about this technology and to expand it. It's still at the point where it's aimed at development teams that want to Grid-ify their work, more than its aimed at end user scientists and researchers. We're not quite at the point yet where Grid is delivering value to the scientists, but we hope to achieve that in the coming year.

GCJ: Do you have any hard stats about the amount of resources under that CA grid, or how many campuses are connected?

Covitz: There are four locations and approximate 6 or 7 nodes total. Some locations have more than one node. The locations are the NCI, the Georgetown Lombardi Cancer Center, the Duke Cancer Center, University of Pittsburg Medical center. That's the initial implementation. We have at least one other node that's coming online - here at the NCI, although a different group. So it's still in the early days.

GCJ: How these different science Grid folks interact ... is your perception that these are cohesive communities, or a lot of independent work?

Covitz: I should point out that caGRID is developed by a team of people that have a lot of experience with participation in these other Grid projects. Arnie and Steve at Georgetown are certainly in this group. The group under Joe Salt at Ohio State University is particularly experienced with Grid computing, and has been involved with NSF-funded Grid projects. Most recently, the new members of the caGRID team include Ian Foster's group from Argonne National Labs. They're one of the key groups that have been involved in developing a number of those other grids that we've referred to. We are very much engaged with these other groups.

Each Grid is targeted at a particular community of users, and optimized for that group of users. So it's ok that there is more than one Grid. But it is important that everyone share experience and tools so that you don't reinvent the low-level technologies that are involved. We've tried very hard not to do that. That's why we're using Globus. That's why we're using OGSA-DAI, which stands for Open Grid Services Architecture Data Access and Integration - and it comes out of the work being done by the e-Science program in Europe and the UK. They are the ones that did the first efforts to extend Grid technology to accommodate data services, in addition to just compute cycle or CPU services. So we've built on their very good work, and not reinvented their steps.

GCJ: How do you collaborate with IT vendors? How do they get involved?

Covitz: We absolutely engage the private sector in a lot of work. Private companies have been involved in building caGRID - it's not just been an academic and government project. We look to the private sector for software engineering, for project management - certainly for hardware, we're not building our own hardware. We do issue RFPs and do competitive bids for those services, when we need them.

close window