Digital Preservation « Quædam cuiusdam
Getting Serious with Amazon Glacier
Wednesday 29 August 2012 @ 6:53 pm

After playing around last weekend, I’m ready to push real content up into Amazon Glacier for long-term preservation. I’ve worked up an Ant script to do the work, and I’ve posted it on GitHub as pbinkley/glacier-ant-bagit.

Here’s what I’m doing:

  • work at the directory level: each Glacier archive will be a tar file containing the contents of a single directory.
  • package the content with Bagit so that it will be self-verifying when (if ever) I download it again.
  • keep metadata locally: at the moment the script maintains a CSV file, and eventually I’ll load that into a database. It also saves the bag manifest to provide a full list of files.

I’m using a new command-line interface to do the uploading: carlossg/glacier-cli. I liked glacierFreezer‘s use of SimpleDB to store file metadata in Amazon’s cloud, but this package at the moment only connects to the us-east-1 zone, and I wanted to use something a little closer to home. The code of glacierFreezer isn’t currently open, either.

It’s ready to go, so I’ll let some a large job run overnight and see how it works out.

Update (next day)

Gary and Dave

The script is working on the Ubuntu family fileserver. Overnight it uploaded 70 directories with almost 3gb of stuff (the limiting factor being, obviously, our home internet connection). I fixed a couple of bugs; if you’ve taken a copy from Github, you should do a pull.

So: I’m on the hook for 3¢/month so far, in perpetuity, and 36¢ if I ever need to retrieve those directories (if I go over my 5%/month retrieval allowance). In exchange, the risk of loss of those old images has been reduced … by how much? How could you calculate that? I maintain four local copies already, three at home and one in my office, but a fire could easily bring me down to depending on a single spindle; and I’m getting older and will become increasingly likely to make mistakes with this stuff. Eventually these objects will become a digital inheritance. The more serious dangers they must survive to get through the next fifty years are probably those of human error, indifference, and absence of the right technical skills at the right moment. Multiplying copies onto different platforms, beyond the number required by a continuity-of-service calculation, makes sense: creating impediments to managing this stuff comprehensively reduces the potential consequences of massive failures of management. I hope.

Playing with Amazon Glacier
Saturday 25 August 2012 @ 2:17 pm

This week Amazon released its new digital preservation platform Glacier. It is similar to the S3 storage service, but optimized for long-term, low-access storage. You pay a penny per GB per month, and you accept that access will be slow (four hours or more) and expensive (12 cents/GB, with free access to 5% of your content each month). I’ve been storing family digital assets on S3 as a remote backup, and Glacier will save me a few bucks each month. I’ll only need to access the content if my local backups fail, so I can accept the barriers to access. And, of course, at work we’re interested in low-cost off-site replication. So, let’s check it out.

The initial offering from Amazon has a web management console and Java and .NET SDKs and a REST API, but no user-friendly client. Third parties are starting to release clients, though, and there’s enough there to work with. Within a few days there will be more.

Getting started is easy: just activate Glacier in your AWS account. The data model is simple: “vaults” contain “archives”, which as far as I’m concerned are simply files. You can create vaults through the web console and tie them into Amazon’s SNS notification service, but that’s as far as you can get; to upload a file you need client.

I started with the glacierFreezer command-line client, which is based on the Java SDK. It makes use of an Amazon SimpleDB domain to store information about your archives, so you need to create one for it first. Then gather your access key and secret key (from the “Security Credentials” tab in the web console), and run it:

java -jar glacierFreezer.jar <accessKey> <secretKey> <simpleDbDomainName> <vaultName> <fileName>

Up the file goes, and the results are stored by glacierFreezer in the SimpleDB domain:

      <Value>Thu Aug 23 12:36:42 MDT 2012</Value>

That’s a lot of information you’re going to need to keep track of, because Glacier won’t keep track of it for you. If you want to be able to restore an archive to a local file with the same name as it had before you uploaded it, you need to remember the mapping of the archiveID to the fileName.

At this point I was blocked again, since glacierFreezer doesn’t yet have the functionality to do anything with an archive in a vault (give it a few days). The day after the upload, when Glacier had done its daily job of generating inventories, I could at least see in the web console that the vault had been populated:

We’ve got an archive! The file I uploaded was only a few bytes, so the 32kb size presumably represents the block size of Glacier’s file system.

This morning I looked again for Glacier clients, and found that the Node.js project node-awssum had added Glacier to the list of supported Amazon APIs. I’ve been meaning to play with Node.js for a while so I jumped on it. I installed Node.js and its package manager npm according to these instructions, then installed node-awssum (and the required package fmt) with a lovely simple

npm install fmt
npm -d install awssum

The Glacier examples that come with node-awssum cover fetching vault descriptions and such, but not the job-oriented tasks that I need at this point. To fetch an inventory of a vault, or an archive from that vault, you need to initiate a job and wait for Glacier to let you know it’s done (which they say takes four hours). Not to worry, though, node-awssum is easy to work with. I copied one of the examples and created a script “inventory-retrieval.js” like this:

var fmt = require('fmt');
var awssum = require('awssum');
var amazon = awssum.load('amazon/amazon');
var Glacier = awssum.load('amazon/glacier').Glacier;

var accessKeyId = 'xxxxxxxxxxxxxxxx';
var secretAccessKey = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';
var awsAccountId = 'xxxxxxxxxxxxx'; // note: omit hyphens

var glacier = new Glacier({
'accessKeyId' : accessKeyId,
'secretAccessKey' : secretAccessKey,
'awsAccountId' : awsAccountId, // required
'region' : amazon.US_EAST_1

fmt.field('Region', glacier.region() );
fmt.field('EndPoint', );
fmt.field('AccessKeyId', glacier.accessKeyId().substr(0,3) + '...' );
fmt.field('SecretAccessKey', glacier.secretAccessKey().substr(0,3) + '...' );
fmt.field('AwsAccountId', glacier.awsAccountId() );

glacier.InitiateJob({ VaultName : 'test', Type: 'inventory-retrieval' }, function(err, data) {
fmt.msg("describing vault - expecting success");
fmt.dump(err, 'Error');
fmt.dump(data, 'Data');

I run that and get the following output:

$ node inventory-retrieval.js
Region : us-east-1
EndPoint :
AccessKeyId : AKI...
SecretAccessKey : 3HO...
AwsAccountId : xxxxxxxxxxxxxxx
describing vault - expecting success
Error : null
Data : { StatusCode: 202,
{ &#039;x-amzn-requestid&#039;: &#039;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
location: &#039;/xxxxxxxxxxxxxxx/vaults/test/jobs/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
&#039;x-amz-job-id&#039;: &#039;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
&#039;content-type&#039;: &#039;application/json&#039;,
&#039;content-length&#039;: &#039;2&#039;,
date: &#039;Sat, 25 Aug 2012 18:36:32 GMT&#039; },
Body: &#039;&#039; }

So, I’ve successfully created the job. And now I wait, savoring the full experience of Glacier’s slow retrieval which is going to save me so much money compared to S3. I’ll update this post when I get Glacier’s notification that it’s complete. Meanwhile, I’ll contemplate David Rosenthal’s analysis of Glacier’s costs.

Update 1

The job took just over four hours; I didn’t get a notification (have to look into that, probably my misconfiguration) but the job description shows the time. The next step is retrieve the job output, and it turns out that node-awssum hasn’t finished this function: it generates the uri for a job description rather than the job output. The code was easy to patch so I was able to retrieve the inventory:

 { VaultARN: 'arn:aws:glacier:us-east-1:xxxxxxxxxx:vaults/test',
   InventoryDate: '2012-08-24T13:54:52Z',
    [ { ArchiveId: 'E4lnahK_rbGeenbN07Yc3Myl3FLuJ6IhlmnLSHeAlYmfilRiiJ-3aCCs8C2lPgocUvmBIYpY2lIR1tWmVfXeji73WJrHKqIw9snU8ADWBkPO92Dp688E-mMyLCTMT-s1A7_D2bxxOQ',
        ArchiveDescription: 'Archived file readme-pbinkley.txt Thu Aug 23 12:36:41 MDT 2012',
        CreationDate: '2012-08-23T18:36:42Z',
        Size: 36,
        SHA256TreeHash: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' } ] } }

So it does have my original file name, but only in a text description. This response gives me the archive ID of my file (which I had anyway because glacierFreezer saved it for me – nice to see that they agree). Just to close the loop I’ve initiated the archive-retrieval job.

Update 2

I posted an issue about the problem with get-job-output in node-awssum. Heard a couple of hours later that it’s been fixed in master; tried it, it works. God I love open-source.

Update 3 (next day)

The archive retrieval job finished after 4 1/2 hours, and I can retrieve my file. I get a sha256 hash in the header to let me verify the content. For some reason Amazon doesn’t pay attention to my byte range request if I try to retrieve less than the whole file; perhaps it’s because the file is so short, just 36 bytes. I’ll try that again when I’ve uploaded something bigger. And it turns out the notifications were coming through to my email after all: dunno how I overlooked them. So all is good.

It’s easy to imagine a full-scale retrieval process that would manage the initiation of retrieval jobs, monitor the notification stream (which uses Amazon’s Simple Notification Service and can therefore push notifications using a variety of protocols), and fetch the output when it receives notification that a job is ready. Amazon says that output is available for at least 24 hours after the job completes, so you would want to manage the chunking of jobs in such a way as to avoid retrieving more content than you can download in a day, taking into account the somewhat convoluted calculations required to avoid overrunning your 5% monthly free download allowance.

I’m currently using S3 for offsite backup of my personal digital archive, and moving it to Glacier is a no-brainer. I don’t expect ever to retrieve this stuff, since I keep multiple local copies: it’s fire insurance. After a disaster, I’d be willing to pay the download costs to retrieve the family photos. In my professional role (where this is all speculative), I’d take David Rosenthal’s concerns seriously and avoid lock-in: as long as we’ve got local copies, we could move our content to a competitor of Amazon’s without incurring the retrieval costs.

Finally, my first experience with node.js has been great, and I’ll definitely be putting some time into learning more.