Guest Expert
David Martin
IBM / GGF
David Martin

'Data' versus 'Storage'

Are today's storage systems ready to participate in the Grid? Where do the complexities reside, and where is progress being made? GCJ went to David Martin -- co-director of the Data Group in the GGF, and program director, Internet Standards and Technology at IBM -- for some answers.

GCJ: Within the GGF, how do you distinguish between "data" and "storage" issues for the Grid?

Martin: The way we've viewed it is that data is the issue of the structure and handling of things based on their information content. So we do a lot of things in GGF around databases and file systems where we're operating on the ones and zeros, based on knowledge of what's in those ones and zeros. So we know it's a file system or we know it's a database, or we know it's an XML query, or a file that's being FTP'ed or something. And then where we draw the line for storage is storage is where we're really just storing the raw bits. The storage systems tend not to know any difference between whether it's a file or a database or a video or any of those sorts of things.

GCJ: As enterprises move towards intelligent Grid infrastructures where you need to have the right data delivered to the right application or computation when it needs to be -- is your perception that today's storage systems are equipped to deal with that reality?

Martin: The storage systems that I've worked with in an enterprise capacity tend to be a lot more focused on the single data center. They have a lot of the facilities that we'd need in a Grid, things like replication or mirroring or the ability to divide up storage among different users with different requirements. But almost all the storage technology that's out there right now is really very much locally or network-oriented, so they tend to run over fiber channel or SANs or FDDI connections, or something that is really a local area network technology. And they don't really take into account the fact that there's speed of light delays if you're going from Chicago to Tokyo or something, and they don't really take into account the fact that you could have packet loss over distributed networks.

The storage technologies are somewhat similar, but they aren't quite really what the Grid needs right now. So we've tended to do those types of things at the data level rather than the storage level, and so we have efforts to mirror file systems or to be able to distribute queries across redundant databases, and things like that. Currently, we're not relying on the underlying storage to do that... we're relying on the higher-level data mechanisms.

GCJ: It seems like out of the different categories of IT vendors, that the storage guys have been relatively quiet in the Grid discussions. It's been the big systems guys, like IBM, HP, Sun, etc., and the hardware and server guys that have made the most noise. And it seems like even the networking guys have made a little more noise than the storage guys with respect to the Grid evolution. Do you see any indication that the 'storage requirements for Grid' discussion is going to get a little louder or that more people are going to participate?

Well, there has been a lot of participation by the storage vendors in the Grid evolution, even though they perhaps haven't been quite as publicly outspoken about the new storage directions for Grid environments. I know, for example, that EMC has been involved a good bit in GGF for quite a while, but they're looking at it more from an architectural perspective and how they fit their solutions into the service-oriented architecture that we're doing, rather than really pushing the hardcore storage stuff.

And we've got a liaison relationship between GGF and SNIA (the Storage Networking Industry Association). They're the ones you think of - Hitachi and all those companies that make disk drives and big network appliances that do storage.

To date, almost all of the Grids have big storage systems that are fairly complex behind them. But you really don't expose that to the Grid management systems or to the Grid applications. So the next step in the evolution is "storage Grids," per se, which is where the storage is actually integrated, and so that the underlying storage system starts to take advantage of a lot of the Grid capabilities that we have. That's when you start to see running data replication over secure links, with tie-ins to the authentication systems that the Grid already has... and they'll start making use of some of the management systems that people are already using on Grid.

That's kind of a really early effort right now. We still have a lot of trouble even agreeing on the same concepts. So the idea of security within a storage system right now, for example, is just pretty much a physical security thing. People don't have physical access to it. In the Grid it's exactly the opposite, where you have to assume everything is open to everybody and you have to secure every link and every application.

So there are a lot of discussions that need to be ironed out before the Grid world and storage world are truly on the same page.

GCJ: Don't a lot of these SAN systems have their own management consoles, and aren't there interoperability issues just between hooking up multiple SANs today without even thinking about Grid environment?

Martin: At the very base level, I think they've pretty much got it worked out, to where if you buy a storage system from one company and storage from another, you can pretty much put them together. But you're right, where the real issue comes is the management. And so pretty much everybody who has a storage system has their own management system.

SNIA's trying to solve that some. There's a project that they have, called SMI (Storage Management Initiative), which is their attempt to provide a common management framework around storage. It's very storage-specific, and so they're talking to people like DMTF (Data Management Task Force), which has a broader view of manageable entities DMTF developed CIM (Common Information Model) and SMIS (Storage Management Interface Specification) uses it to model storage. The idea is that by using CIM they can start populating a common information management system.

GCJ: What has the GGF's relationship been like historically with DMTF?

Martin: Well, we've had a relationship where the DMTF does the real hardcore-type descriptions of schemas and things like that, and where GGF then maps those onto the Grid and comes up with very specific things that can be used in a Grid context. So the DMTF people are very good at knowing CIM. And we've relied on DMTF a huge amount for expertise in that. A colleague of mine at IBM, Ellen Stokes, has been very active in GGF and DMTF, and has been a bridge between the two, to educate both sides.

So we've got a little bit of joint work going on. There's also a GGF working group, called SCRM (Standards development organizations Collaboration on networked Resources Management), and it is intended to be a place where a whole bunch of different organizations come together to talk about management issues. And DMTF is pretty active in that.

GCJ: One of the recurring themes we see in Grid is pushing the intelligence ever lower. We see it with the networking folks, pushing more of the intelligence in the system down to the network level. And I think we're going to start seeing more and more of that in the data level. What are your thoughts about virtualization of data? Do you see this as a trend? Do you see this as something that the industry is concentrating on more and more?

Martin: Yes, virtualization is a big topic from the networks and processors and storage. The typical way that Grid people have treated any hardware is as being very dumb. And so the traditional Grid way of handling processes is just to allocate a single processor to a single user until they're done with it. And the way they allocate storage is just by handing over a chunk of temporary storage and then taking it away when the job is done. And even on networking, they just kind of assumed that the networks are just best-effort networks, and so they do tricks like Grid FTP does, where it goes up and talks to ten different servers, hoping one of them will have good network performance.

But I think that it's changing some, and I think that, like you said, if they push the intelligence down into the lower level hardware, then the Grid can start taking advantage of saying things like they want a gigabit of bandwidth for the next hour, or something like that, and rely on the intelligence of the networks to really provide that.

And I think that's happening in storage as well. The storage virtualization that's going on right now in the industry is still very much data center-oriented. So it's the idea that you can have a couple of customers in a data center that are sharing one big thing, but they both see it as their own individual SAN that they can do whatever they want with.

As soon as that starts working in a wider area or across multiple domains of control, then the Grid middleware can get a lot simpler, because the Grid middleware can just say things like I want this data to be replicated or I want this data to be spread to these five sites. Whereas right now, that's kind of a manual operation. It's an evolution that I would have kind of expected it to move faster than it has, in a way.

GCJ: What about from the presentation standpoint? Will the data present itself a little differently depending on which application is accessing it, and have that taken care of at the storage layer?

It's still pretty early. I think most storage systems are still treating the data as just a random collection of bits. And I think the ultimate thing for storage systems is that they'll eventually know what they're holding. So they'll know that this is a database, and so therefore there are certain characteristics of database access that they can start taking advantage of. And you see that a little bit. IBM has something called GPFS (General Parallel File System) that has some quality of service capabilities to it, so you can actually talk to the underlying storage and say this is a video file, and when we're retrieving it, we need to be able to retrieve it at this data rate. Because if you can't pull it off the disk fast enough, then it can't be served out fast enough and there'll be glitches in the video. So there are things like that are happening in pretty proprietary ways right now.

But I think it'll be a while before the underlying mechanisms really, especially in storage, know what's going on. You're seeing that a little bit more in networks, where networks are intelligent enough to look at the flows that are going by and saying, all right, this is a voice over IP call, so we'll give it priority bandwidth or priority forwarding on each router.

And processes are actually getting that way as well. One of the things that the virtualization layers in really big hardware is doing is saying, OK, these jobs - they look at what jobs are running and then they start dynamically allocating resources to those. So if one job is really compute-intensive, it gets more CPU, while another is more IO-intensive, it gets shuttled around until it gets a better site for the IO.

I think the storage people are probably a little bit behind in that, in just that they're still kind of not looking at data at all. They're just still really looking at it as bits on the wire.

close window