The Globus GRAM Story
Stuart Martin
GRAM Technology Coordinator
Argonne National Laboratory
Stu Martin

Grid applications frequently require mechanisms for executing remote jobs. While this requirement might appear straightforward, its practical realization can be challenging due to the need to (for example) address security, reliability, and performance concerns; enable both client and server management of resources consumed by remote jobs; propagate error messages and failure notifications; and move data to/from remote computations. Thus, in practice, the creation of secure, reliable, and performant job execution services is difficult.

In Globus Toolkit version 4 (GT4), remote job execution is supported by the Grid Resource Allocation and Management (GRAM) service, which defines mechanisms for submitting requests to execute jobs (defined in a job description language) and for monitoring and controlling the resulting job executions. More precisely, GT4 includes two different GRAM services: the "pre-WS GRAM," or GRAM2, first introduced in GT2, and the newer Web Services-based "WS GRAM," or GRAM4, first included in GT4.

GRAM2 has been widely deployed on grids around the world for many years. GRAM2 has it's limitations, but they are known and workarounds have been implemented. Grids know how to operate GRAM2. As such, many production grids are still relying on GRAM2 for remote job submission and management.

GRAM4 has been deployed on a number of Grids, but with recent improvement in performance and reliability, we expect this number to begin to grow rapidly. The GRAM team has been hard at work over the last 2 years, hardening the GRAM4 version in GT 4.0.x and designing and implementing new features for the upcoming GT 4.2.x series, while still supporting GRAM2 users.

Functionally, at a high level, the two services are still quite similar: stage in files, execute user application, stage out file, cleanup files, but the service interfaces and implementations are quite different. In a recent paper, we presented a detailed functional and performance comparison of GRAM2 and GRAM4 . A big win for GRAM4 clients was the move to a standard Web Service interface. Web Services has proven to be effective as a fundamental building block to distributed computing for commercial development as well as Grid research development. From the start, Globus and GRAM4 has been a leader in the evolution to Web Service.

Another GRAM4 highlight in the GT 4.0.x series is the improved scalability and reliability. First reliability, the GRAM4 service can be throttled for the amount of work that it will perform at a time. This allows an admin to limit the load caused by GRAM4 on a service host. Thus GRAM4 provides the foundation for a reliable grid job execution service stack. Next scalability, the Scheduler Event Generator (SEG) is an internal GRAM4 component used to efficiently monitor all jobs GRAM4 has submitted to a local resource manager. The SEG approach is much more scalable than the previous GRAM2 or GRAM3 approach. As a result, when a job is not changing states e.g. executing or pending in the resource manager's queue, the GRAM4 service's RAM required to maintain the state for the job can be freed. With this GRAM4 can scale much higher than GRAM2. These are just a few highlights, please read the paper for more, but what has now become clear is that the GRAM4 service is overall superior to GRAM2. Now is the time for grids to make the move to GRAM4!

One of the key features additions we're working on for GT 4.2 is adding support for job submissions. Open Grid Forum's (OGF) JSDL specification has been adopted by various projects and requests have been made to add support for JSDL in GRAM4. We will be announcing an alpha release sometime in Q1 07. In addition, we are monitoring and participating in other OGF compute area working groups, in particular OGSA BES and OGSA-HPCP. We do not yet have specific plans to support these lst two specifications, but we anticipate that we will at some point.

Another key feature recently designed and implemented based on requirements from TeraGrid is the new GRAM service auditing capabilities. This was added to both GRAM2 and GRAM4. In general, service auditing is needed for production grid service providers to investigate various forms of suspected intrusion and abuse. In such cases, we may need to access an audit trail of the actions performed by a service. When accessing this audit trail, it will frequently be important to be able to relate specific actions to the user requests that caused them to be performed. When turned on, GRAM will insert an audit record per job directly into a database. The record includes field like the grid job id, local job id, grid user id, local user id, stage in grid id, stage out grid id, ... For TeraGrid, the GRAM job audit record is used as the basis to retrieve accounting information from TeraGrid's central accounting data base for a client given a grid job id. The grid job id is generated from a GRAM4 EPR. OGSA-DAI is used to create a "virtual database" between the GRAM audit DB and TG's central accounting DB and exposing a remote "getChargeForJob" operation.

I am extremely pleased with the progress we have made and very excited about the potential of GRAM4. And I'd also like to add that if you'd like to join the Globus team in this work, we have open developer positions.

close window