Guest Expert
Carl Kesselman
Director of the Center for Grid Computing
University of Southern California School of Engineering
Carl Kesselman

In this Globus Consortium Journal Q&A, guest expert Carl Kesselman clarifies the perceived dichotomy between data Grids and compute Grids, the role data virtualization plays in Globus, and the differences of derived data products within enterprise and research/science. Kesselman is the Director of the Center for Grid Computing at the University of Southern California School of Engineering.

GCJ: Tell us your thoughts on the confluence of compute Grids with data Grids in Globus environments.

Kesselman: I don't actually believe in this dichotomy between compute Grids and data Grids. My feeling is that there's a basic set of protocols and services and that there are some data-oriented services and some compute-oriented services. You don't do much computing without data, you don't generally have too much data without doing some computing. It's a false dichotomy. There are different classes of services and perhaps different types of things you're going to end up doing, but deep down there's nothing that fundamentally distinguishes the type of Grid infrastructure you need for doing analysis and management of data from the core things that you need for doing computation.

GCJ: Where does data virtualization fit in?

Kesselman: Data virtualization was a term coined for, at least in our context, an NSF-funded activity that has been ongoing for the last couple of years exploring the duality of computation and data. There is some basic fundamental underlying data which cannot be "ground truth." Rather, it's coming from a sensor, it's collected from a survey, it's the results - images from a microscope. So a lot of what we're interested in is not looking at that raw data as it gets generated from ground truth, but derived data products, things that are resulting from analyzing it, data mining it, summarizing it, and transforming it in various ways.

With the duality between data and computing, I can either go all the way back to the source from which it came and reproduce the entire analysis or computational chain that resulted in the particular data product that I'm interested in... or I can go out and look and see if that product already exists somewhere in the network, and just reuse it or maybe transfer it (which might involve remote access or moving the data towards me). For the end user this distinction, in terms of the actual bits that are seen, is completely transparent. Answers may be stored somewhere, computed from the original raw data or some combination in between. That's the idea behind virtual data. Someone can look at all the values that are already available within a virtual organization within that Grid environment, figure out whether they're good enough and figure out what additional computations they may or may not need to do, which things they might re-compute, which things they'll move -- and then put in place the calculations and the data operations to then deliver what they're actually interested in.

GCJ: How are the derived data products different in enterprise than in research and science? Do business intelligence and data warehousing products in enterprise present any different challenges?

Kesselman: There is some overlap. Scientific data tends to be reasonably large in size and tends to be numeric or images or numeric data or experimental data. I think there certainly are enterprise level data and data mining types of activities that have those characteristics. For example, we're looking at large scale data mining in inventory management and "decision support" types of data operations.

But then there's also other types of more traditional enterprise data, which I think maybe are a bit different - salary and payroll and the like - which are less similar to the science data. The other distinguishing feature of science data from significant amounts of enterprise types of data is that science data tends not to exist in databases. A lot of science data is not relational. It tends to be more file-oriented data. This is changing though as we're seeing more and more databases showing up in the scientific domains.

GCJ: Do you have any insights about what's been done in Globus to reconcile the different protocols and interfaces into storage systems?

Kesselman: Within Globus, they're two major data solutions. There's the GridFTP solution which tends to be agnostic about the content or format of the information being moved. It's really oriented towards moving around blobs of data as efficiently and reliably and quickly as possible. Then there's another set of services which are contributed from the UK e-Science project, called the data access and integration services (DAI), which are focused more on interfacing to structured data, particular various types of databases and how those are combined and integrated in general and flexible ways.

GCJ: Any new updates or perspectives on what Globus is doing with the commercial virtualization players like VMWare or Xen?

Kesselman: There's certainly continued evaluation and integration. There's starting to be some deployment activities and experimentation. We realize the importance of these virtualization technologies from the perspective of better utilization in that it gives us a handle on how to do software management, deployment, and configuration in a more effective way. It also gives us a better handle on resource management and containment.

GCJ: What's you particular area of focus within Globus these days?

Kesselman: I'm involved in a lot of areas. One high level agenda I'm focused on is how to establish a virtual organization in a more dynamic and effective way and the life cycle issues associated with that. How do you define security and policy across virtual organizations? How do you identify the structure of the resources available to those organizations? How do you specify and drive activities for those organizations across those resources?

Drilling down, we're looking into the notion of data provenance and its relation to infrastructure. The data itself is only half the story in that you also need to know where it came from and where it's going. How do you keep track of where the data generated, who generated it, what programs were used to generate it, when was it generated, and what were the parameters that were used to cause it to be generated?

GCJ: Would this information be contained within the MDS directories?

Kesselman: Some of it, but more likely it would live within metadata catalogues, but would be accessed through things like DAI which is part of Globus Toolkit Version 4.

close window