It's been approximately three years since the National Science Foundation kicked off the
TeraGrid project -- a massive Grid delivering computational resources to support scientific discovery. Through the collaborative effort of numerous government agencies, universities and IT vendors (and roughly $100 million in total funding to date), TeraGrid deployment was completed in September of 2004, and today delivers more than 50 teraflops of computing power and more than 2 petabytes of rotating storage. In a recent interview with Globus Consortium Journal, TeraGrid Director (and former Chair of the GGF), Charlie Catlett, discusses the capabilities of TeraGrid, how it's being used -- and where the additional $150 million in funding recently secured for the project will be applied.
GCJ: The TeraGrid has evolved quite a bit over the last few years ... tell us a little bit about what's been accomplished.
Catlett: Well, to begin with we have created a base "cyberinfrastructure" that integrates some of the nation's most powerful resources, thus enabling scientists to explore new ways of doing computational science. We have users getting good results using TeraGrid in a variety of ways, ranging from coupled supercomputers to complex workflows, and at a scale that would not be possible with stand-alone data centers. We've also expanded from an original project that involved four Linux clusters. What we have today -- across eight sites [including the original four] -- is 16 different platforms and seven different operating systems. And of those 16 platforms, only four of them look similar to one another, so we're really supporting seven operating systems and a dozen distinct system architectures.
GCJ: So a truly heterogeneous infrastructure ...
Catlett: It's extreme heterogeneity. And I say that because not only do we have heterogeneity among commodity-style platforms, but we also have supercomputers as part of that heterogeneity.
GCJ: What sort of usage has there been and who's using it?
Catlett: There's been a tremendous amount of usage -- and it's been very broad in terms of who's using the TeraGrid. Going back to January 2004 when we began early access, and including the past 9 months of production use, we have delivered over 21 million "service units" which are units of allocation roughly equivalent to a processor-hour. The top three disciplines that are using TeraGrid are chemistry, physics, and molecular and cell biology. Those are the top three, and those three account for about half the use of TeraGrid. And there are about another two dozen disciplines that make up the remaining 50%. One of the innovations that we have put into place with TeraGrid is a new way of providing users with allocations. We still use a national peer-review board to evaluate proposals and grant allocations, and scientists can request an allocation on a particular supercomputer. But we have added a new type of allocation that we call a "roaming" allocation. Such an allocation can be used on any of the TeraGrid machines, allowing the scientist to roam between resources just as mobile phone users can roam between service providers. In the past, a scientist received an allocation on a particular computer at a particular site, and his or her use was limited to just that system.
With this freedom to use an allocation on any TeraGrid resource, scientists can select a machine on which to run their job based on current availability or predicted turnaround. Further, a scientist can experiment on different TeraGrid platforms without having to write separate proposals for allocations on those platforms.
GCJ: So users can choose from a free resource pool -- a utility kind of model?
Catlett: Yes. So it's taking an existing community of users that are very accustomed to the client-server way of accessing supercomputers and saying, well, what are some Grid resources that we could offer that would be interesting to those kind of users? That's where we came up with this initial set of services across TeraGrid that allows the user to have an allocation that can be used on any of the machines. To encourage this sort of "roaming" use we have also created the user environment that has a base set of common tools and settings that a user can expect a "common environment" that is documented and consistent across all of the platforms. While each platform has unique features, and there may be added value of specific applications or tools on various systems, we ensure that the base environment is there as well. This means that there are no accidental or arbitrary differences between the different user environments on the different resources. By way of analogy, you can select an automobile based on engine size, stereo system, color, body style, etc. But you expect the pedal on the right to be the accelerator and the pedal on the left to be the brake. This allows you to drive different cars effectively without having to learn to drive from scratch. That's what we've done - given users a common environment while encouraging our resource providers to pursue differentiation atop this common environment.
GCJ: Are you using GRAM? What's the scheduler for allocating free resources?
Catlett: We do use GRAM, and this couples together with schedulers on the various platforms. TeraGrid has some very high-performance systems that require quite sophisticated scheduling, and we provide GRAM as a common interface to the specialized schedulers and job management systems on the resources. We are beginning to look at ways to provide meta-scheduling, and we do co-scheduling today with human intervention.
We're also deploying a technology from the Texas Advanced Computing Center called "Gridshell" that allows a user to view TeraGrid resources as extensions to their Condor pool. This has enabled some of our users to literally run millions of jobs on TeraGrid for ensemble-type jobs and similar operations.
Another area that we're experimenting with is to expose the TeraGrid system details to portals and other scheduling tools so that those tools can schedule on behalf of the user. For example, we just started a project with Mark Green at the University of Buffalo, one of the Open Science Grid partners. They're taking their portal and they're going in and getting information about TeraGrid resources and doing their own interface for their users, and all we're doing at the TeraGrid side is providing them with a service that says, here's a compute service, here's how you get information about it, and here's how you submit a job. And all of this submission would be based on the GT4 Globus Toolkit, either using traditional Globus capabilities or web services that GT4 enables.
GCJ: TeraGrid recently announced having been awarded an additional $150 million in funding. Where is that money going?
Catlett: This award, from the National Science Foundation's Office of Cyberinfrastructure, follows a $100 million three year project to construct the current TeraGrid system. The majority of the construction funds went to purchasing equipment - clusters, storage, and network infrastructure.
The $150 million just secured is for a five year period and it is primarily to operate, manage, and evolve the TeraGrid system so the funding is primarily for staffing. Two thirds of the funding will go toward operation and management of the resources at the eight resource provider sites, along with user support and related services provided by those partners. The remaining $50 million [over five years] is for software integration and engaging the user community through programs such as the Science Gateways initiative and a coordinated support program called Advanced Support for TeraGrid Applications, or ASTA.
The Science Gateways program involves partnerships with projects that are putting infrastructure in for entire communities, where we are working on a set of rapid prototypes aimed at developing a capability that will allow any number of such "science gateways" to integrate with TeraGrid. In some cases, that would be a portal project such as the Nanohub at Purdue University. This is a portal that is offering tools and data and applications to a community of people interested in microelectronics and nanotechnology. That community includes about 40 or 50 university courses at six or seven universities. We're putting the TeraGrid behind the nanoHUB as a provider of computing and storage and other resources. And by doing that, we get to work with the nanoHUB group, and provide services to an entire community.
Another example of a Science Gateway involves an NIH project at the University of Chicago, called the National Microbial Pathogen Data Resource Center. They're providing desktop applications to several hundred pathogen researchers, funded by NIH. And while it's not a portal, we're embedding within those applications the capability to reach out and use TeraGrid resources from the desktop. So that's another kind of a science gateway.
And the third kind of science gateway is a Grid-to-Grid project, and we're doing that with Open Science Grid.
GCJ: What about the IT operations aspect of managing the resources in TeraGrid? When a box goes down, is there any kind of switching over or automatic configuration type stuff, or does someone manually have to go and discover which box is down and pull it off the system? Is there a heavy administrative burden to running TeraGrid?
Catlett: They're all run by existing data centers who have been doing this for quite a long time, so they are quite sophisticated in providing extremely reliable and robust services. We have three layers of information services that help us manage TeraGrid. The first is that we have a 24/7 TeraGrid Operations Center that is operated in partnership with SDSC and NCSA. The second is a test and monitoring infrastructure called Inca, designed by Pete Beckman (Director of Engineering for the TeraGrid construction project) from the University of Chicago and implemented by a team of developers led by Shava Smallen at SDSC. Inca provides status about the health of the TeraGrid user environment and tracks this status over time, allowing us to look at trends. The third layer is something we are working on as part of the GT4 deployment, which is to use the information services provided by GT4 to provide operational and status information via web services.
GCJ: What ways are you using the public network for TeraGrid? Is the Internet is going to be a sufficient interconnect for Grids moving forward, or will we see a lot of these large Grids kind of create their own private networks?
Catlett: I think that moving forward, we will see a hierarchy of complementary networks, not unlike what we see today. If you look at TeraGrid, we have already a three-level hierarchy that's reflected in the network. The broadest part of that hierarchy is between the end user and the TeraGrid resources, and that requires a network like the Abilene network, the Internet2 infrastructure. That's who we depend on for integrating the end users with the TeraGrid. So that requires a certain amount of bandwidth for the end user.
close window |
|