Globus Toolkit Developer's Forum
Bill Allcock
Globus Alliance
Argonne National Laboratory
Bill Allcock

GridFTP

In the Grid community, there's a popular expression that "access to the data is as important as access to the compute resources."

And no Globus Toolkit subcomponent is more central to Grid data access issues than the Globus implementation of GridFTP (File Transfer Protocol). Globus' GridFTP is a high-performance data transfer protocol and software suite optimized for the gamut of data access issues -- from bulk file transfer, to the nitty gritty details of getting the data out of complex storage systems within virtual organizations on the Grid, and pretty much every data requirement in between.

GridFTP Origins

"It's important to remember that the GridFTP protocol was built on the existing FTP protocol," said Bill Allcock, technology coordinator at Argonne National Laboratory, and one of the authors of GridFTP, both the protocol (developed in the GGF) and the Globus implementation.

The FTP protocol originated with the ARPAnet community all the way back in 1971 (here's a link to a good historical synopsis). FTP has seen many new specification twists and turns through the years.

In 1973, the Internet Engineering Task Force received a number of initial 'requests for comment' (RFCs) for FTP specs. The version that perhaps signaled the maturity of the protocol arrived in 1985, when Jon Postel and Joyce Reynolds (of ISI) authored RFC 959. RFC 959 included extensions to FTP to further "1) promote sharing of files (computer programs and/or data), 2) to encourage indirect or implicit (via programs) use of remote computers, 3) shield a user from variations in file storage systems among hosts, and 4) to transfer data reliably and efficiently."

FTP became a pervasive protocol with the arrival of the commercial Internet. But as Grid computing usage accelerated in e-Science in the late 90's, new challenges arose for Grid users who needed to access different storage systems between virtual organizations. Storage systems had become increasingly customized to serve specific user needs -- and the FTP protocol in its existing form was unable to reconcile this explosion of incompatible disparate systems for accessing data.

According to a white paper GridFTP; Universal Data Transfer for the Grid written by some of the same Globus and GGF teams that authored the GridFTP protocol: "Most customized storage systems utilize incompatible protocols for accessing data and require the use of their own clients. The use of multiple incompatible protocols for data storage effectively partitions the datasets available on the Grid. Applications that require access to data stored in different storage systems must use different methods to retrieve data from each system. It can be challenging to transfer a dataset from one system to another."

So in a draft presented to the IETF in 2001, GridFTP authors (Allcock, Bester, Besnahan, Chervenak, Liming and Tuecke) presented the new features of the proposed GridFTP protocol:

  • Grid Security Infrastructure (GSI) and Kerberos support
  • Third-party control of data transfer
  • Parallel data transfer
  • Striped data transfer
  • Partial file transfer
  • Automatic negotiation of TCP buffer/window sizes
  • Support for reliable and restartable data transfer
  • Integrated Instrumentation

Globus GridFTP Today

Today, by default, the Globus implementation of GridFTP will work on any storage device that has a POSIX file system for the storage, and TCP/IP for the network.

"It doesn't matter whether you're running RAID or not, EXT3 versus XFS, PVFS or GPFS," said Allcock. We work fine on all of those. The one caveat is that certain configuration parameters can have a much larger impact on some of those than the others. For instance, GPFS wants big reads -- they want large sequential reads. Whereas PVFS wants you to match whatever the stride size is. But Globus GridFTP will work on all of them, just out of the box, regardless of system type."

With the release of GT4 about a year ago, there were several performance enhancements for Grid FTP. According to Allcock, the Grid FTP server that's in GT 4.0 is 100% Globus code, written from scratch by the Globus Alliance.

"It's much easier to maintain, much more extensible," said Allcock. "It's a brand new code base. It's very clean, very modular. We developed something called the data storage interface (DSI) -- which completely abstracts the storage system. It can talk to whatever your storage device might be. We have DSI's for a POSIX file system, the IBM HPSS tape system, and the storage resource broker (SRB). It should work for virtually any storage system."

On the security front, Globus GridFTP uses Authz call-out that provides a default authorization system out of the box.

"If you have your own authorization mechanism that you want to use, you can write your own authz callout, put it in the LD library path, and you can do your own authorization," said Allcock. "This turns out to be fairly common and necessary in big Grids, because there's a movement in the Grid community towards dynamic accounts, shared accounts and things like that. Files have lifetimes... if you come back next time and you're under a different account, standard file permissions don't work, you don't have read permission. So people are doing custom authorization systems with GridFTP and making interesting progress on these new security requirements."

And, of course, Globus GridFTP continues to thrive in its legacy of move data very fast over high bandwidth, high latency networks... allowing users to drop large data sets into the network and get them to their destinations with minimal fuss.

"The Internet and IP networks were designed to do a very good job of aggregating a lot of small traffic flows into very big pipes -- for example, thousands of bits of Web traffic shared over multiple 10-GB links within the backbone of the network," Steve Tuecke said in an interview a few months ago. "What IP networks are not so good at handling is sustained high-bandwidth traffic... for example, if you really need to drive 10-GB/s or more for a single purpose. This is where Globus and its work with GridFTP has really been a leader in terms of allowing the reliable and secure movement of a lot of data on big network pipes. You can have 20 nodes sending 1GB/s, rather than one big computer sending 20 GB/s. But managing data transfers using 20 nodes, rather than just one, requires more sophisticated protocols and software, which is one of the issues GridFTP and its Globus Toolkit V4 implementation addresses."

Globus GridFTP Users

Today, Globus GridFTP has pervasive use in the e-Science Grid community. The high energy physics community in particular has been a huge user from the start. A notable recent use was by the Relativistic Heavy Ion Collider (RHIC) community in Brookhaven - who used Globus GridFTP to sustain 600 megabytes per second of data transfer (from Long Island, New York, to Japan) over 11 days.

For the British Broadcasting Corporation (BBC), their frequent large file demands (the typical broadcast hour today requires 280 GB for all pre-processed media streams), are met by GridFTP. "Everything in Gridcast is built using Globus Toolkit," said Terry Harmer, Technical Director at the Belfast e-Science Centre, in an interview in '05 with the Globus Consortium Journal. "We use it as a means by which we create, define, and deploy services. We are big users of GridFTP."

"The Reliable File Transfer Service (RFT) also uses Grid FTP," said Allcock. "RFT is an alternative client to Grid FTP. What RFT brings you is the 'R' part -- reliability. It's a service that writes everything to a database and can recover on its own, if there's a transmission failure. It's a service that allows you to do "fire and forget".

Allcock notes that there is a constituency of Globus GridFTP users out there that are still running old versions, however.

"There are some folks out there still clinging on to the 2.43 version of Globus GridFTP," said Allcock. "But you really want to use the new server. There's just no reason not to do it, because protocol-wise, there's no difference. You can literally drop it onto the new server, and your clients will never know the difference. We've done this a thousand times."

close window