Is "Grid Storage" For Real?
Is the concept of "Grid Storage" legitimate, or is it just fuzzy, vendor 'marketecture'? Leading storage industry analyst Jon Toigo separates fact from fiction, and illuminates some challenges that large enterprises are facing with today's storage systems.
GCJ: In a Network World article a couple of years ago, you pulled the reins back on the hysteria around "Grid Storage" and suggested that it was still just a rather fuzzy, vendor "marketecture" type of concept.
Toigo: Yes, the concept of Grid Storage has to do with the dynamic allocation and de-allocation of storage resources based on what an application requires. Quite simply, that's what it's all about. It is driven at the university level by the fact that they don't have deep pockets, they need to build massive infrastructure and get massive compute cycles to support hugely scientific research projects, things like visualizing nuclear explosions, predicting weather, the human genome project - really massive research projects, where they need to bring online lots and lots of compute cycles in order to accomplish a task. And then they de-allocate those when they're done with the task. Now, that is typically achieved in research / academia with a massively parallel infrastructure.
But you haven't yet seen massive parallelization of storage in enterprise. You will see it. It is coming, but it is not here yet in the products that are out there today.
And I would still say that instead of "Grid Storage," enterprise IT pros should probably refer to the "horizontal scalability" and "vertical scalability" of storage.
With horizontal scalability, the question is how do we deploy a lot of storage as tiers in a flat configuration? Whereas vertical scalability looks at how much storage we can reasonably put behind a single heard or behind a single controller? And what we're seeing today is the vendor community trying to line up behind new mechanisms that will allow you to scale both vertically and horizontally. And that, in theory, is one of the concepts of Grid... the idea of infinite scalability.
What you run into in vertical scaling is the problem of putting too many disks behind a single control or a back plane or whatever, and you end up with choke points being created. So there's all kinds of technology innovation going into how we can improve those back planes and how would we be using this at the "head," which is the industry term for the controller or the interface to the outside world.
GCJ: Are there any compelling new technologies you've seen that are pushing the envelope with respect to parallelization of storage?
Toigo: Probably the clearest example that I could reference right now in the sense of a true parallelization of storage would be in the Zetera technology. Imagine you take a disk drive and you snap a little piece of plastic on the end of it, called a "tailgater." And then you pop an Ethernet cable into the one port that's available on the front of the tailgater. And now DHCP (Dynamic Host Configuration Protocol) assigns that disk drive an IP address. And it becomes a node in the network.
And you can do this many, many thousands of millions of times with many thousands and millions of disk drives, and you can stripe across all the drives using multicasting. Now all of a sudden, you can create RAID sets on the fly. You can virtualize on the fly, because it's a function of multicasting. And you've captured the commodity price of the underlying disk drive, which drops in terms of its price at a rate of 50% per year. So the technology just leverages what's already there in IP networking.
To me, that's truly "Grid storage," because I can add nodes by virtue of their IP addresses and then delete those nodes by virtue of their IP addresses - selectively add and then remove them from sets of working sets that are allocated to improve a function on the fly. This is what MIT Media Labs is doing now. They just announced last week or a week and a half ago in their project they're doing on human learning, where they're collecting massive amounts of video, what they did was they were deploying 1.5 - or 1.4 petabytes of the Zetera-enabled storage over IP. And it allows them to scale dramatically and using the standard IP networking technologies that are already out there. I think it's a dynamite concept, and whether Zetera succeeds as a company or not is irrelevant. The genie's out of the bottle.
GCJ: How expensive is it?
Toigo: It's whatever cost for the underlying disk drive, and disk drives dropped in price on a per gigabyte basis at a rate of 50% per year. It's the only storage technology I've ever seen that captures the commodity falling price of disk.
GCJ: What would you say average storage utilization is today in enterprise?
Toigo: I would say that companies that buy very large arrays are finding that they're using less than 40% of the capacity of the array.
In fact, some vendors even obfuscate the way out of the data on the inside of their array. They don't give you the tools so you can see how you're using the disk inside. Partly that's because they're sticking a bunch of their own software on there and they're holding a bunch of capacity in reserve for themselves. So you buy a 15 terabyte array and you only have actual use of ten terabytes. Five of it is reserved for the vendor.
GCJ: But that's still far better than the average utilization rate for servers, right?
Toigo: I wouldn't make the comparison on an apples-to-apples basis like that. I'd say that there are two kinds of ways that end users typically look at storage. One is "capacity allocation efficiency," and the other is "capacity utilization efficiency." Capacity allocation efficiency is, OK, of the physical spindles that I've got, how much of it am I actually able to use and use efficiently without creating a choke point that slows down the overall efficiency of the performance of the box? And then capacity utilization efficiency focuses on whether we are parking the right kind of data on this array based on what the access characteristics are of that data or its volatility, how often it's updated.
GCJ: With all the talk about service-oriented architectures and the 'dynamic provisioning' of resources... in your opinion, are today's enterprise storage systems up to the task of actually participating in these dynamic environments, where x-quanta of data would need to be available at y-point on the network, and with extremely low latency or disruption to whatever service might be requesting it?
Toigo: No, I don't. We've seen a lot of angry backlash from customers through the years that hate the big iron storage arrays, where you build up these gigantic, monolithic storage platforms, and they basically just feed the vendors. And that's because they typically become huge lock-in junk drawers, where it's very difficult for the customer to sort it out and clean it out, let alone make these storage systems participant in an SOA or Grid- type of architecture environment.
When the first disk drives were introduced into the market, they were extremely expensive and they didn't have a lot of capacity. You may recall Tom Hanks in the Apollo 13 movie, he points to a big, multi-storied building, and he says something like "we can store over 8K of data in that building." Obviously we've gone through leaps and bounds of capacity improvement of the underlying disk infrastructure since it was first introduced some 50 odd years ago.
But at the time that the original disk came out, the original file systems came out. The file systems were trying to economize on disk usage, so they were self-destructive, meaning every time you saved a file, you had to overwrite the last version of that file. And all the talk right now about things like "continuous data protection" are actually efforts to try to heal that particular design choice that was made 40 or 50 years ago.
The other thing that's missing is the file systems were not designed to make data self-describing or associative. You can't tell what application created data. Now, you can in a Microsoft environment, because we attached little tags on them, like doc for a Word document, or ext or whatever. But you go to the Unix environment or go into even some of the Linux environments that are out there, and you can't tell what application created them. You haven't got a clue.
As far as the file system regards it, it's just an anonymous set of ones and zeros. So we have no idea of what data's important, what data needs to be retained for compliance, what data needs to be protected for security reasons because of sensitive information, what data needs to be protected and replicated for disaster recovery purposes. We have no clue from the file name. Now, that's a people process issue, to a certain extent, because file systems aren't going to help you out at all.
GCJ: So organizations typically don't have their data catalogued properly to participate in these types of dynamic compute environments?
Toigo: Correct. The metadata that describes data today is totally inadequate to the management of data. And there are a lot of different types of approaches that customers are taking today to try to make the data self-describing.
One thing that's clear to me though is that any sort of approach that's disruptive to the individual user's normal way of doing things is not going to succeed. For example, one of my clients is a major oil company that decided that they wanted to classify all the files that were coming out of their users. So they created these three-ring binders, and inside the binders are all these laminated pages. And what you're supposed to do as a user, before you save your file, is look up in this red book until you find a laminated page that kind of looks like what you're saving. And once you find this statement that kind of looks like what you're saving, you're supposed to type in a 16 to 32 character string into the file name. So for the rest of its useful life, it can be moved around and selected by a cherry-picker out there, and then can migrate into different kinds of disk over time, per policy, until finally it's discarded.
Now, that's all well and fine, but you know how many users actually do it? None.
There was a guy from a big five accounting firm about a year and a half ago who wrote in an article somewhere that the only way to get users to cooperate in a data naming schema is to kill one user in each environment and let his corpse rot until the smell scares the heck out of everybody else. And then maybe a few others will comply.
GCJ: That easy, huh?
Toigo: Yes, anything that is disruptive to the user's normal way of doing things is not going to succeed.
But there are a whole bunch of vendors trying to create workarounds. One set of workarounds, I would say, falls into the category of replacing the file system with a database. That's been talked about since the mid-'90s. Oracle has a solution in that space. IBM has a solution in that space. And of course, Windows - Microsoft is now pushing WinFS, which is a kind of quasi-solution. They need a clustered database underneath the whole vault of data, which they don't have right now because their current version of SQL server is not cluster-able. And they're not about to invite Oracle to be the repository of all metadata for files that are stored in a Windows environment. It's just not going to happen. But Microsoft's in a unique position to actually force that down everybody's throat, they - if they can get the technology together to do it, because they own 87% of the server market that's out there right now.
GCJ: Are Web Services making it any easier to pull the right information out of storage systems?
Toigo: No, Web services are, to me, another one of those marketecture terms. It'd be nice if there really were Web standards to facilitate the management area. I don't believe that there are. We can wrap everything in XML, and that's been proposed by various people, but the overhead that's associated with XML as a description scheme on data is even more painful than all the other methods that people are trying.
What we have seen being brought in from the Web world into the world of storage are various access retrieval schemes, if you will, trying to use Google-like searches to find key words in data. That unfortunately produces an awful lot of false positives, just like the early days of the Web when I tried various indexing schemes to manage Websites and so forth. You'd ask for stock finish, and it would bring up gunstock oils, which were not the same as the paper stock finish that you were trying to find.
You talk to index engines and to several others that are out there trying to do this in the storage realm, and they're usually crowing about things like 'proximity search engines' that basically look for a couple of words in close proximity to each other. And supposedly it reduces the number of false positive hits that you got. But that doesn't mean that it eliminates them, and it's still nearly impossible to use those types of engines to do practical things an enterprise wants to do with their data... for example, finding all the data or documents that meet a certain criteria, so they can delete them or use them for some other type of purpose.
GCJ: Any other storage-related trends or directions that you'd like to point out?
Toigo: I'd just say that some of the IP-based technologies that are coming out today are really interesting. Just as applications became Web-enabled, I think that storage will eventually become Web-enabled as well. It isn't yet. Fiber channel is the first effort in network storage, but fiber channel is not a network protocol. It's a channel protocol. So with fiber channel you're still slaving the storage in a direct-attached configuration behind the server and you're letting the server do all the walking and talking, and the storage is kind of just a peripheral device. And that's still the case with fiber channel SANs.
What you need to do is move out of the whole channel-attached modality and get into a network modality. And that's just now beginning to happen. I think this MIT Media Lab thing will be a proven concept of one of the early pioneers in this space.
close window |
|