Grid applications frequently require mechanisms for executing remote
jobs. While this requirement might appear straightforward, its
practical realization can be challenging due to the need to (for
example) address security, reliability, and performance concerns;
enable both client and server management of resources consumed by
remote jobs; propagate error messages and failure notifications; and
move data to/from remote computations. Thus, in practice, the
creation of secure, reliable, and performant job execution services
is difficult.
In Globus Toolkit version 4 (GT4), remote job execution is supported
by the Grid Resource Allocation and Management (GRAM) service, which
defines mechanisms for submitting requests to execute jobs (defined
in a job description language) and for monitoring and controlling the
resulting job executions. More precisely, GT4 includes two different
GRAM services: the "pre-WS GRAM," or GRAM2, first introduced in GT2,
and the newer Web Services-based "WS GRAM," or GRAM4, first included
in GT4.
GRAM2 has been widely deployed on grids around the world for many
years. GRAM2 has it's limitations, but they are known and
workarounds have been implemented. Grids know how to operate GRAM2.
As such, many production grids are still relying on GRAM2 for remote
job submission and management.
GRAM4 has been deployed on a number of Grids, but with recent
improvement in performance and reliability, we expect this number to
begin to grow rapidly. The GRAM team has been hard at work over the
last 2 years, hardening the GRAM4 version in GT 4.0.x and designing
and implementing new features for the upcoming GT 4.2.x series, while
still supporting GRAM2 users.
Functionally, at a high level, the two services are still quite
similar: stage in files, execute user application, stage out file,
cleanup files, but the service interfaces and implementations are
quite different. In a recent paper, we presented a detailed functional and performance comparison of GRAM2 and GRAM4 . A big win for GRAM4 clients was the move to a standard
Web Service interface. Web Services has proven to be effective as a
fundamental building block to distributed computing for commercial
development as well as Grid research development. From the start,
Globus and GRAM4 has been a leader in the evolution to Web Service.
Another GRAM4 highlight in the GT 4.0.x series is the improved
scalability and reliability. First reliability, the GRAM4 service
can be throttled for the amount of work that it will perform at a
time. This allows an admin to limit the load caused by GRAM4 on a
service host. Thus GRAM4 provides the foundation for a reliable grid
job execution service stack. Next scalability, the Scheduler Event
Generator (SEG) is an internal GRAM4 component used to efficiently
monitor all jobs GRAM4 has submitted to a local resource manager.
The SEG approach is much more scalable than the previous GRAM2 or
GRAM3 approach. As a result, when a job is not changing states e.g.
executing or pending in the resource manager's queue, the GRAM4
service's RAM required to maintain the state for the job can be
freed. With this GRAM4 can scale much higher than GRAM2. These are
just a few highlights, please read the paper for more, but what has
now become clear is that the GRAM4 service is overall superior to
GRAM2. Now is the time for grids to make the move to GRAM4!
One of the key features additions we're working on for GT 4.2 is
adding support for job submissions. Open Grid Forum's (OGF) JSDL
specification has been adopted by various projects and requests have
been made to add support for JSDL in GRAM4. We will be announcing an
alpha release sometime in Q1 07. In addition, we are monitoring and
participating in other OGF compute area working groups, in particular
OGSA BES and OGSA-HPCP. We do not yet have specific plans to support
these lst two specifications, but we anticipate that we will at some
point.
Another key feature recently designed and implemented based on
requirements from TeraGrid is the new GRAM service auditing
capabilities. This was added to both GRAM2 and GRAM4. In general,
service auditing is needed for production grid service providers to
investigate various forms of suspected intrusion and abuse. In such
cases, we may need to access an audit trail of the actions performed
by a service. When accessing this audit trail, it will frequently be
important to be able to relate specific actions to the user requests
that caused them to be performed. When turned on, GRAM will insert
an audit record per job directly into a database. The record
includes field like the grid job id, local job id, grid user id,
local user id, stage in grid id, stage out grid id, ... For
TeraGrid, the GRAM job audit record is used as the basis to retrieve
accounting information from TeraGrid's central accounting data base
for a client given a grid job id. The grid job id is generated from
a GRAM4 EPR. OGSA-DAI is used to create a "virtual database" between the GRAM audit DB and TG's central accounting DB and exposing a remote "getChargeForJob" operation.
I am extremely pleased with the progress we have made and very
excited about the potential of GRAM4. And I'd also like to add that if
you'd like to join the Globus team in this work, we have
open developer positions.
close window |
|