The February 2012 issue of the “Communications of the ACM” features an article by Bryce Allen et al., “Software as a service for data scientists” (or see PDF).
The main idea of the article is as follows. The amount of research data is growing exponentially. (Can we say: “the quantity of data is turning into a new quality”?) Then, one day you need to transfer a big amount of data from one data storage facility to another. The data is in files that can be either big or small, and there are many files — more than you can handle manually. Moreover, some files don’t need to be transferred because they are already present on the target system. Also, you are not sitting in either of the facilities: you are at home, in front of your notebook. And you want to start the transfer and go to bed, so that every failed transfer is retried automatically.
So, the solution that the Globus Online team came up with is a sophisticated system that launches the transfer directly between the two facilities, using the GridFTP protocol. Your notebook is only a console to control the process with a Web interface (or a command line). Rather elegant, indeed.
Besides, “the brains”of the system that performs the transfer between the two endpoints are hosted in Amazon EC2 cloud for the highest availability. Of course, this means the system is dependent on the cloud, and will not function without it. But at least it is resilient to failures.
(What made me smile, though, is that the authors are discussing the possibility of integrating the Sneakernet solution into Globus Online — in the era and in the country where research network backbones are running at speeds of 10 Gbps or more. Despite all those examples given on the Sneakernet Wikipedia article, I would still prefer a cable to be laid, even to some remote facilities like telescopes. I believe in the long run the cable would be cheaper to build and maintain).
However, there is something that I feel is missing in this sort of solution. Think of military and civil aviation, or racing cars versus civil cars. The civil aircraft and cars adopt the best practices from their hi-tech counterparts. For example, modern cars borrow their smooth contours from early race cars to ensure that airflow around the car is more laminar, which reduces fuel consumption.
Do we see the same kind of adoption when talking about file transfers? I would argue that. Remember the tool that you use everyday, the Web browser. How do you download big sets of files with your browser? Usually one by one, or hope that the website will provide the archive of all files bundled together. Ability to retry a failed download? In Mozilla Firefox, this still requires user interaction. Ability to download many files in parallel with a single click, or only update changed files (“synchronize directories”), or (automatically) tune TCP parameters? None.
But all features required to implement this have been here for years. Persistent connections (“Keep-alive”) were introduced in HTTP/1.0 (1996), while HTTP Pipelining (useful for fetching small files) was made available in HTTP/1.1 (1999), together with byte serving which allowed to request only part of a document — e.g., the remainder of the failed transfer. The File eXchange Protocol (FXP) — transferring data directly between the two remote endpoints — was described in RFC 959 (see Figure 2) back in 1985.
Parallel FTP or HTTP transfers are now implemented at a higher level — the application level — and are a distinct feature of download managers. In the good old days you could download an entire directory by FTP as one big compressed file. Or nowadays you can use an “mget” command of your FTP client to download multiple files with wildcards. Or you could configure wget to mirror the remote site for you, but transfers will be queued, not parallel.
So many helpful protocols, tools and techniques — but never combined in a single program! I remember that several years ago, out of despair, TACC suggested to transfer files to their Ranger machine with FTP, using an intricate scheme of tunneling a control connection via SSH but sending the file contents in clear text to improve throughput.
Finally, after the HPC community tried these stillborn approaches, it settled with GridFTP. It wasn’t easy to use, and required manual tuning for best results, but at least you could hope to substantially utilize that fat Internet link that your institution had. And now we have that wonderful solution that the blog post started with.
But! How are we going to make the cross-pollination of ideas? How are we going to introduce that high-bandwidth, automatic and resilient directory mirroring into every desktop computer in the world? (Think of sending your entire photo collection to a family member) Are we going to build GridFTP into Web browsers? Most certainly not.
Besides, as you could see from the above analysis, most mechanisms required to achieve that goal were already implemented in separate software tools many years ago. GridFTP and Globus Online will likely remain the tools of HPC community and will not proliferate into desktops.
The community might have gone a different path. They could create a new download manager, capable of parallel downloads (or uploads) and initiating third-party transfers, with automatic retries of transfers, directories synchronization and automatic TCP optimizations. All the goodies that GridFTP has (and more), but simple to use. WinSCP is already very intuitive in terms of user interface.
This tool could utilize security mechanisms like TLS when necessary. For usage in GRID environments, Grid Security Infrastructure and certificates would be utilized instead. And the tool won’t rely on the cloud as its central management post.
Just dreams, but I still hope that the lessons learned from Globus Online would be somehow usable for better data transfer solutions for the rest of us.
how about aspera? although a commercial protocol (faspx) its a snap to use (browser plugin). its perhaps as fast as globus’ but again the fact that its a propitiatory is limiting its spread
The article quoted in the beginning of the post has a section called “Other approaches”. Aspera is not mentioned there, but it is yet another approach. For those looking for something working “out of the box” this could be a good solution, so thanks for sharing this! The ease of use is very important.