GCJ: Tell us about EverGrid - who you are, what you do, and what makes you unique?
Anderson: EverGrid is a private software company based in Fremont, CA. We were founded in 2004. We do cluster infrastructure software for next generation data centers. Without any modification to operating systems or applications, we're able to checkpoint a large cluster of processors and capture their state and then restart them at a later time. Big, compute-intensive applications stay running, and companies save time and money. That sounds like a straightforward thing to do, but that happens to be a very tough problem to solve. Because for 256, 1000, even 10,000 processors, trying to get them in a consistent state that can be restarted when hardware problems inevitably occur, is a very challenging thing to do.
GCJ: Is this an automatic type event, without an IT person needing to kick it off? Is that the idea?
Anderson: Exactly. In fact, we also sell a resource manager. You can think of that as a job scheduler for cluster or grid environments, where that resource manager periodically schedules a checkpoint to be taken. As an example, on a long-running job, you might take a checkpoint every 15 minutes. That happens completely in the background, with no modifications required by the users. And you can checkpoint a specific application or a specific job. You can even checkpoint all the processors in the entire machine so that, for example, if there's a failure anywhere, you could restart every application and every job running back to that checkpoint.
GCJ: On your website, you have a label called application abstraction software. Is that how you're encapsulating what you are doing?
Anderson: Yes, exactly. When we take a checkpoint, we have to capture the state of memory, the state of the IO system, and the like. But we also have to get all of the checkpoints from these various processors synchronized. What we do is we log every system call. And that's what happens in this abstraction layer. We're intercepting every system call between the application and the operating system, logging that, and then passing it on. And we use those logs to resynchronize all of the checkpoints.
That sounds like it would be fairly high overhead, but it's not, when you compare it with virtual machine technology. The bandwidth at the application-to-operating system interface is less than a 1000th of the bandwidth between the operating system and the hardware. So it's a fairly straightforward thing for us to do.
GCJ: This strikes me, the way you describe it, similar to what Windows had with that "Last Known Good" option. I know I'm oversimplifying, but that's kind of the thought, right?
Anderson: Well, that is the thought, but of course the challenge here is being able to do this for a large number of processors that are running a single application. That's what's so challenging, because for most of these large applications that are running on grid or cluster machines, it takes a long time to stop them in a consistent state. One of our prospects that we were just talking to, a very large engineering company, said their typical application takes about 40 minutes to stop and about another 40 minutes to get it running again at reasonable speed.
Needless to say, at that rate, stopping it, gathering the state, and then continuing on is not a practical thing to do. We're able to do that without stopping the application at all. We don't block IO. We don't block the application. And typically, the overhead is - well, we say the overhead's less than 5%. The reality is the most we've ever seen was 3%. And that was on a benchmark. On real world jobs, it's 1.5% when you are doing checkpoints every 15 minutes.
GCJ: What are the platforms you're dealing with and what are the operating environments?
Anderson: Our initial platform is Linux, pretty much all varieties of Linux. Our software is operating system call- specific, so any version of Linux is going to be fairly straightforward. We've certified the last two versions of both SUSE and Red Hat. For other versions of UNIX, we have not done them yet, but they're fairly straightforward to do and, as we have customer demand, we'll go ahead and do them.
We've not yet done the Microsoft environment. We're looking at it. We're not sure how difficult it will be. We know, of course, that it's more difficult than doing a new version of UNIX or Linux, because there are somewhat different system calls, but certainly the concept should work.
GCJ: Our president of the Globus Consortium, Greg Nawrocki, is calling for 2007 to be The Year of the Grid Application. It's time for grids and apps to really take off - or applications using grids to really take off. I'd like to get your thoughts, from a market perspective, on what that means to you, how it happens, and how EverGrid helps out.
Anderson: Well, looking at what's happening in the marketplace, it's very clear that clusters in particular have literally taken over in high performance computing IDC is projecting that clusters will be greater than 80% of all high performance computing.
One of the things that I really haven't pointed out are the problems this solves. The big problem is that when you're running a large number of processors for a long period of time, the probability that you can do that without having one of the processors have a failure becomes very, very low as the number of processors go up and the amount of time goes up. We see customers that are very much constrained in how long they can run a job, how many processors they can put on a job, and what kind of a resolution they can model to by the failure rates of the processors. And so checkpointing is critical when it comes to creating high availability for these applications.
The second thing is that since you can't interrupt these jobs because they're so difficult to stop, scheduling becomes problematic. If I'm running a job and I get a higher priority job request, pretty much the only choice I have is either to wait until enough resources come free to run that high priority job, or kill one of the lower priority jobs and then rerun it from the beginning. With checkpointing, you have the ability to do preemptive scheduling. When a high priority job arrives, you checkpoint enough jobs so that you'll have the resources you need to run the high priority job, stop the jobs that you've checkpointed, run the high priority job, and then when the high priority job is done, resume the jobs from where you were before.
GCJ: Who is the target market? Who's trying to use this?
Anderson: The initial target market is anyone in high performance computing. Down time for big number-crunching type applications is a very serious problem. You would have only had to look at our booth in November at Supercomputing 2006 in Tampa, immediately after we emerged from stealth mode, to realize how serious a problem this is. We had a huge amount of customer and partner interest when we announced our company.
Because we have the ability to checkpoint clusters and because we have the ability to do preemptive scheduling - our resource managers led us to doing resource management for large clusters that are in the enterprise space as well. So it isn't just high performance computing, but also for centers using large clusters of machines to effectively share servers among multiple applications.
GCJ: What verticals are interested?
Anderson: Financial services, of course. Certainly oil and gas. Anyone using Electronic Design Automation (EDA). Chip simulation is an example of a job that runs for long, long periods of time with large numbers of processors. Also, military and government. But in fact, any kind of high performance computing problem where the problem runs for a fairly long period of time applies to us. So really we have a horizontal solution. We're focusing on verticals only because there is a set of verticals that have this problem worse than others.
GCJ: What about when an organization starts to virtualize their resources? How does EverGrid handle that, how does that all fit into the mix?
Anderson: Our resource manager is designed to control a cluster, all the way from booting it up to doing preemptive scheduling. When someone is using a cluster, do they want to boot up a machine as a Linux machine, say SUSE or Red Hat, or do they want it to come up with Unix, say Solaris, or do they want to come up with a Microsoft operating system, or do they want to come up with a hypervisor, say from either VMware or Xen? And, in fact, was the request to schedule a job on a large stack of machines? So maybe you want 256 physical processors to run a job? Or maybe you want one-tenth of one processor to run back-end for something like a thin client on a desktop. There are a lot of possibilities.
In that case, we actually call the virtual machine supervisors, such as VMware or Xen, from our resource manager and have them provision virtual machines. So then our job scheduling software is sitting on top of this pool of machines where some of the machines in the pool are physical and some of the machines are virtual.
GCJ: And is that true with Service-Oriented Architecture (SOA) as well, or is there a different approach for SOA?
Anderson: Well, from a SOA perspective, that's more a characteristic of the applications. We're agnostic about what applications are running. The applications that are running can be databases. You could be running Google. You could be running long high performance compute jobs. You could be running a browser. You name it. You run what you want.
GCJ: Who are your competitors in this space, and how does EverGrid stack up?
Anderson: Well, in the high availability space, where we're doing checkpointing for grids of machines, there isn't anyone else that does this without modifications to either the application or the operating system. Meiosys does it with modifications to the operating system. They were purchased by IBM. DataSynapse has you rewrite your application to essentially their API, as an example. There are a couple of people that do single processor checkpoint. But nobody else does this without modifying either the operating system or the application. You never want to say you don't have any competition, but our customers and our partners are telling us that we don't have any competition at our level of functionality. That's a good thing.
In the resource management space, of course, especially in high performance computing, there are a couple of leaders in the resource management space - Platform and Altair would be two that come to mind, with LSF and PBS Pro. And then you've got people like OpForce et al in the hardware provisioning space, and of course VMware on the virtual machine space. In many of those cases we call those companies systems to control the environment. But we haven't found anyone that has the breadth, all the way from managing their hardware up to managing jobs and doing job-load preemption. So we've got a number of competitors at various levels in the stack, but we don't really appear to have any competitors that control as many levels of the stack as we do.
GCJ: What about partnerships?
Anderson: Well, we're in the middle of negotiating a number of partnerships. For example, with other resource managers. Customers already have resources managers in place. We know we're not going to go out and replace those, but customers do want to have the checkpointing capability. And so, we will undoubtedly do partnerships with some of the existing resource management people to offer that capability through them.
We're also talking to a number of systems vendors and a number of application vendors. If you've got, say, a long-running EDA application, we know of at least one EDA vendor that wrote their own checkpoint. But it's the most miserable part of their code to maintain, because it's very, very difficult to do that sort of thing. And they just want to get out of the business of having to fool around with checkpointing, and would like to glue our software on the bottom of their software so that we checkpoint it for them. And so there's an opportunity to sell through those guys as well.
So we don't have any partnerships that are actually closed, but that's largely because we only announced the company over a little less than a month ago. We're in the middle of negotiations right now.
GCJ: Would you have a customer or a case study that you could talk to, describing real world use?
Anderson: Let me talk about two customers that are utilizing our software in various stages. One is a university. It's a university high performance computing center that has installed our software and is using our software to do checkpoints as they run jobs. As a matter of fact, they've had some failures, had some problems with their file system, and used us to recover. So it's nice to know this stuff actually works in a real world environment.
Another is a large financial services organization that is centralizing all of their servers, with the intent of ending hard provisioning between applications and servers. Our software's been demonstrated in their resource management environment, and we're right in the middle now of going into production, and expect to be in production at the end of the first quarter.
GCJ: What's on the docket for EverGrid next year? What big announcements or developments will our audience be looking for?
Anderson: Well, we haven't actually announced the products yet! We've announced the company. We'll be doing the announcement of the products in the first quarter. And as I said, initially we're focused on high performance computing, but we will be announcing the ability to do checkpointing of online applications, in particular, so that we can do preemptive scheduling initially with databases like Oracle and other online apps that you would find within the enterprise.
And then towards the end of the year, you can expect us to make an announcement about having the ability to checkpoint online applications, so that for an enterprise database, we can do checkpointing. We'll be extending our capabilities to all online applications.
close window |
|