Getting Serious with Amazon Glacier

After playing around last weekend, I’m ready to push real content up into Amazon Glacier for long-term preservation. I’ve worked up an Ant script to do the work, and I’ve posted it on GitHub as pbinkley/glacier-ant-bagit.

Here’s what I’m doing:

work at the directory level: each Glacier archive will be a tar file containing the contents of a single directory.
package the content with Bagit so that it will be self-verifying when (if ever) I download it again.
keep metadata locally: at the moment the script maintains a CSV file, and eventually I’ll load that into a database. It also saves the bag manifest to provide a full list of files.

I’m using a new command-line interface to do the uploading: carlossg/glacier-cli. I liked glacierFreezer’s use of SimpleDB to store file metadata in Amazon’s cloud, but this package at the moment only connects to the us-east-1 zone, and I wanted to use something a little closer to home. The code of glacierFreezer isn’t currently open, either.

It’s ready to go, so I’ll let some a large job run overnight and see how it works out.

Update (next day)

The script is working on the Ubuntu family fileserver. Overnight it uploaded 70 directories with almost 3gb of stuff (the limiting factor being, obviously, our home internet connection). I fixed a couple of bugs; if you’ve taken a copy from Github, you should do a pull.

So: I’m on the hook for 3¢/month so far, in perpetuity, and 36¢ if I ever need to retrieve those directories (if I go over my 5%/month retrieval allowance). In exchange, the risk of loss of those old images has been reduced … by how much? How could you calculate that? I maintain four local copies already, three at home and one in my office, but a fire could easily bring me down to depending on a single spindle; and I’m getting older and will become increasingly likely to make mistakes with this stuff. Eventually these objects will become a digital inheritance. The more serious dangers they must survive to get through the next fifty years are probably those of human error, indifference, and absence of the right technical skills at the right moment. Multiplying copies onto different platforms, beyond the number required by a continuity-of-service calculation, makes sense: creating impediments to managing this stuff comprehensively reduces the potential consequences of massive failures of management. I hope.