Quædam cuiusdam
Getting Serious with Amazon Glacier
Wednesday 29 August 2012 @ 6:53 pm

After playing around last weekend, I’m ready to push real content up into Amazon Glacier for long-term preservation. I’ve worked up an Ant script to do the work, and I’ve posted it on GitHub as pbinkley/glacier-ant-bagit.

Here’s what I’m doing:

  • work at the directory level: each Glacier archive will be a tar file containing the contents of a single directory.
  • package the content with Bagit so that it will be self-verifying when (if ever) I download it again.
  • keep metadata locally: at the moment the script maintains a CSV file, and eventually I’ll load that into a database. It also saves the bag manifest to provide a full list of files.

I’m using a new command-line interface to do the uploading: carlossg/glacier-cli. I liked glacierFreezer‘s use of SimpleDB to store file metadata in Amazon’s cloud, but this package at the moment only connects to the us-east-1 zone, and I wanted to use something a little closer to home. The code of glacierFreezer isn’t currently open, either.

It’s ready to go, so I’ll let some a large job run overnight and see how it works out.


Update (next day)

Gary and Dave

The script is working on the Ubuntu family fileserver. Overnight it uploaded 70 directories with almost 3gb of stuff (the limiting factor being, obviously, our home internet connection). I fixed a couple of bugs; if you’ve taken a copy from Github, you should do a pull.

So: I’m on the hook for 3¢/month so far, in perpetuity, and 36¢ if I ever need to retrieve those directories (if I go over my 5%/month retrieval allowance). In exchange, the risk of loss of those old images has been reduced … by how much? How could you calculate that? I maintain four local copies already, three at home and one in my office, but a fire could easily bring me down to depending on a single spindle; and I’m getting older and will become increasingly likely to make mistakes with this stuff. Eventually these objects will become a digital inheritance. The more serious dangers they must survive to get through the next fifty years are probably those of human error, indifference, and absence of the right technical skills at the right moment. Multiplying copies onto different platforms, beyond the number required by a continuity-of-service calculation, makes sense: creating impediments to managing this stuff comprehensively reduces the potential consequences of massive failures of management. I hope.




Playing with Amazon Glacier
Saturday 25 August 2012 @ 2:17 pm

This week Amazon released its new digital preservation platform Glacier. It is similar to the S3 storage service, but optimized for long-term, low-access storage. You pay a penny per GB per month, and you accept that access will be slow (four hours or more) and expensive (12 cents/GB, with free access to 5% of your content each month). I’ve been storing family digital assets on S3 as a remote backup, and Glacier will save me a few bucks each month. I’ll only need to access the content if my local backups fail, so I can accept the barriers to access. And, of course, at work we’re interested in low-cost off-site replication. So, let’s check it out.

The initial offering from Amazon has a web management console and Java and .NET SDKs and a REST API, but no user-friendly client. Third parties are starting to release clients, though, and there’s enough there to work with. Within a few days there will be more.

Getting started is easy: just activate Glacier in your AWS account. The data model is simple: “vaults” contain “archives”, which as far as I’m concerned are simply files. You can create vaults through the web console and tie them into Amazon’s SNS notification service, but that’s as far as you can get; to upload a file you need client.

I started with the glacierFreezer command-line client, which is based on the Java SDK. It makes use of an Amazon SimpleDB domain to store information about your archives, so you need to create one for it first. Then gather your access key and secret key (from the “Security Credentials” tab in the web console), and run it:

java -jar glacierFreezer.jar <accessKey> <secretKey> <simpleDbDomainName> <vaultName> <fileName>

Up the file goes, and the results are stored by glacierFreezer in the SimpleDB domain:

<Item>
   <Name>readme-pbinkley.txt</Name>
   <Attribute>
      <Name>sizeBytes</Name>
      <Value>36</Value>
   </Attribute>
   <Attribute>
      <Name>archiveId</Name>
      <Value>E4lnahK_rbGeenbN07Yc3Myl3FLuJ6IhlmnLSHeAlYmfilRiiJ-3aCCs8C2lPgocUvmBIYpY2lIR1tWmVfXeji73WJrHKqIw9snU8ADWBkPO92Dp688E-mMyLCTMT-s1A7_D2bxxOQ</Value>
   </Attribute>
   <Attribute>
      <Name>fileName</Name>
      <Value>readme-pbinkley.txt</Value>
   </Attribute>
   <Attribute>
      <Name>localDateTime</Name>
      <Value>Thu Aug 23 12:36:42 MDT 2012</Value>
   </Attribute>
</Item>

That’s a lot of information you’re going to need to keep track of, because Glacier won’t keep track of it for you. If you want to be able to restore an archive to a local file with the same name as it had before you uploaded it, you need to remember the mapping of the archiveID to the fileName.

At this point I was blocked again, since glacierFreezer doesn’t yet have the functionality to do anything with an archive in a vault (give it a few days). The day after the upload, when Glacier had done its daily job of generating inventories, I could at least see in the web console that the vault had been populated:

We’ve got an archive! The file I uploaded was only a few bytes, so the 32kb size presumably represents the block size of Glacier’s file system.

This morning I looked again for Glacier clients, and found that the Node.js project node-awssum had added Glacier to the list of supported Amazon APIs. I’ve been meaning to play with Node.js for a while so I jumped on it. I installed Node.js and its package manager npm according to these instructions, then installed node-awssum (and the required package fmt) with a lovely simple

npm install fmt
npm -d install awssum

The Glacier examples that come with node-awssum cover fetching vault descriptions and such, but not the job-oriented tasks that I need at this point. To fetch an inventory of a vault, or an archive from that vault, you need to initiate a job and wait for Glacier to let you know it’s done (which they say takes four hours). Not to worry, though, node-awssum is easy to work with. I copied one of the examples and created a script “inventory-retrieval.js” like this:

var fmt = require('fmt');
var awssum = require('awssum');
var amazon = awssum.load('amazon/amazon');
var Glacier = awssum.load('amazon/glacier').Glacier;

var accessKeyId = 'xxxxxxxxxxxxxxxx';
var secretAccessKey = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';
var awsAccountId = 'xxxxxxxxxxxxx'; // note: omit hyphens

var glacier = new Glacier({
'accessKeyId' : accessKeyId,
'secretAccessKey' : secretAccessKey,
'awsAccountId' : awsAccountId, // required
'region' : amazon.US_EAST_1
});

fmt.field('Region', glacier.region() );
fmt.field('EndPoint', glacier.host() );
fmt.field('AccessKeyId', glacier.accessKeyId().substr(0,3) + '...' );
fmt.field('SecretAccessKey', glacier.secretAccessKey().substr(0,3) + '...' );
fmt.field('AwsAccountId', glacier.awsAccountId() );

glacier.InitiateJob({ VaultName : 'test', Type: 'inventory-retrieval' }, function(err, data) {
fmt.msg("describing vault - expecting success");
fmt.dump(err, 'Error');
fmt.dump(data, 'Data');
});

I run that and get the following output:

$ node inventory-retrieval.js
Region : us-east-1
EndPoint : glacier.us-east-1.amazonaws.com
AccessKeyId : AKI...
SecretAccessKey : 3HO...
AwsAccountId : xxxxxxxxxxxxxxx
describing vault - expecting success
Error : null
Data : { StatusCode: 202,
Headers:
{ &#039;x-amzn-requestid&#039;: &#039;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
location: &#039;/xxxxxxxxxxxxxxx/vaults/test/jobs/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
&#039;x-amz-job-id&#039;: &#039;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#039;,
&#039;content-type&#039;: &#039;application/json&#039;,
&#039;content-length&#039;: &#039;2&#039;,
date: &#039;Sat, 25 Aug 2012 18:36:32 GMT&#039; },
Body: &#039;&#039; }

So, I’ve successfully created the job. And now I wait, savoring the full experience of Glacier’s slow retrieval which is going to save me so much money compared to S3. I’ll update this post when I get Glacier’s notification that it’s complete. Meanwhile, I’ll contemplate David Rosenthal’s analysis of Glacier’s costs.


Update 1

The job took just over four hours; I didn’t get a notification (have to look into that, probably my misconfiguration) but the job description shows the time. The next step is retrieve the job output, and it turns out that node-awssum hasn’t finished this function: it generates the uri for a job description rather than the job output. The code was easy to patch so I was able to retrieve the inventory:


Body:
 { VaultARN: 'arn:aws:glacier:us-east-1:xxxxxxxxxx:vaults/test',
   InventoryDate: '2012-08-24T13:54:52Z',
   ArchiveList:
    [ { ArchiveId: 'E4lnahK_rbGeenbN07Yc3Myl3FLuJ6IhlmnLSHeAlYmfilRiiJ-3aCCs8C2lPgocUvmBIYpY2lIR1tWmVfXeji73WJrHKqIw9snU8ADWBkPO92Dp688E-mMyLCTMT-s1A7_D2bxxOQ',
        ArchiveDescription: 'Archived file readme-pbinkley.txt Thu Aug 23 12:36:41 MDT 2012',
        CreationDate: '2012-08-23T18:36:42Z',
        Size: 36,
        SHA256TreeHash: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' } ] } }

So it does have my original file name, but only in a text description. This response gives me the archive ID of my file (which I had anyway because glacierFreezer saved it for me – nice to see that they agree). Just to close the loop I’ve initiated the archive-retrieval job.


Update 2

I posted an issue about the problem with get-job-output in node-awssum. Heard a couple of hours later that it’s been fixed in master; tried it, it works. God I love open-source.


Update 3 (next day)

The archive retrieval job finished after 4 1/2 hours, and I can retrieve my file. I get a sha256 hash in the header to let me verify the content. For some reason Amazon doesn’t pay attention to my byte range request if I try to retrieve less than the whole file; perhaps it’s because the file is so short, just 36 bytes. I’ll try that again when I’ve uploaded something bigger. And it turns out the notifications were coming through to my email after all: dunno how I overlooked them. So all is good.

It’s easy to imagine a full-scale retrieval process that would manage the initiation of retrieval jobs, monitor the notification stream (which uses Amazon’s Simple Notification Service and can therefore push notifications using a variety of protocols), and fetch the output when it receives notification that a job is ready. Amazon says that output is available for at least 24 hours after the job completes, so you would want to manage the chunking of jobs in such a way as to avoid retrieving more content than you can download in a day, taking into account the somewhat convoluted calculations required to avoid overrunning your 5% monthly free download allowance.

I’m currently using S3 for offsite backup of my personal digital archive, and moving it to Glacier is a no-brainer. I don’t expect ever to retrieve this stuff, since I keep multiple local copies: it’s fire insurance. After a disaster, I’d be willing to pay the download costs to retrieve the family photos. In my professional role (where this is all speculative), I’d take David Rosenthal’s concerns seriously and avoid lock-in: as long as we’ve got local copies, we could move our content to a competitor of Amazon’s without incurring the retrieval costs.

Finally, my first experience with node.js has been great, and I’ll definitely be putting some time into learning more.




Contributing a Book to Internet Archive
Sunday 4 March 2012 @ 4:27 pm

Page 8, with photograph by Frances Binkley

Yesterday I ran across an interesting booklet in my grandparents’ papers; and noting that it was published before 1977 without a copyright notice and was therefore in the public domain in the US, and having an interest in user-contributed items for digitization projects at the moment, I thought I would scan it and contribute it to the Internet Archive. After a hasty reading of the FAQ I proceeded to mess up the process, so I thought I’d document it here.

The booklet is a prospectus for an art photography course offered by the New York photographer Rabinovitch (who went by his surname alone). My grandmother was a student of his in 1938, and a photograph of hers is on p.8, one of the ones I posted on my other blog. The booklet is in WorldCat (from the Getty) and Google Books, but hasn’t yet been digitized, as far as I found.

My item was a 32-page booklet with covers, so 36 page images in all. To upload a scanned book, you need each page in a separate image. I scanned two-page openings at 300dpi, 24 bit colour, saved as tiffs. I cropped each one to the two-page spread, then used ImageMagick to split each one into left and right pages:

convert -crop 50%x100% +repage Rabinovitch-00.tif output/Rabinovitch-00-%d.tif

(That prodused output/Rabinovitch-00-1.tif and output/Rabinovitch-00-2.tif). Having tested it, I ran a quick awk job to generate a script to split all the images:

ls -1 *.tif | awk -F. '{ print "convert -crop 50%x100% +repage " $1 ".tif output/" $1 "-%d.tif" ;}' > split.sh

I had scanned the cover as a two-page spread, so I had to rename the image of the back cover to put it at the end of the sequence: filename sort-order seems to determine the order of the images in the final product. Finally (and this is the bit I missed the first time), I combined all the tiffs into a pdf:

convert Rabinovitch-*.tif rabinovitch.pdf

Now I had a 730mb pdf. Off to the Internet Archive Contribution page, where I logged in with the account I created a while ago when I uploaded a video. Click the “Upload” button at the upper right to get to the main uploading page, then click “Share” at the upper right, and select your pdf. A flash app uploads it; it took a couple of hours on my home connection. Meanwhile you have a form to fill in. It’s more oriented to a/v uploads than to books, and you don’t actually get to tell it what you’re uploading, but the process will figure it out in the end. You can fill in a title (which gets converted into an Internet Archive ID) and a description and an author, and that’s it for metadata. You can also apply a license. If you choose to dedicate your item to the public domain, you’ll be given the option in the next step to mark it as already in the public domain, which was appropriate for my item.

When the upload completed I submitted the form, and after a minute or so it told me my item was available and gave me a url. And it was available, but it hadn’t finished processing yet: all I could get was the original pdf that I had uploaded. There is a link on the upload page to a list of your unfinished tasks: go there, and you’ll see the processing task still running. Click to see a log of the ongoing processing (here’s mine, after completion). You can see which node in Internet Archive’s storage array received and processed the item, the job priority (-1: hey, I don’t mind), the generating of the xml (including the assignment of an ARK, replication of the files to another node with rsync, and finally the generation of the various derivatives: OCR, images, pdfs, etc., with the results being rsynced to the mirror node. The whole process took a little more than ten minutes.

The outcome is http://www.archive.org/details/RabinovitchProspectus1938. It’s in the Community Texts collection (formerly known as the Open Source Collection), and it has the full Internet Archive treatment. The photographs in the offset-print original have been nicely descreened, and look quite good in the book-viewer presentation. All it lacked was an Open Library page, so I activated my mad cataloguing skillz and created one.

I also have the option to edit this item; this brings up another form with more fields, including date, notes, and rights — things I couldn’t fill in from the upload screen. I can also modify the files if I like.

And here’s the result, all ready to help some future historian of the Depression-era New York photography scene decide whether they need a trip to the Getty:




Older Posts »