Digitization « Quædam cuiusdam
Contributing a Book to Internet Archive
Sunday 4 March 2012 @ 4:27 pm

Page 8, with photograph by Frances Binkley

Yesterday I ran across an interesting booklet in my grandparents’ papers; and noting that it was published before 1977 without a copyright notice and was therefore in the public domain in the US, and having an interest in user-contributed items for digitization projects at the moment, I thought I would scan it and contribute it to the Internet Archive. After a hasty reading of the FAQ I proceeded to mess up the process, so I thought I’d document it here.

The booklet is a prospectus for an art photography course offered by the New York photographer Rabinovitch (who went by his surname alone). My grandmother was a student of his in 1938, and a photograph of hers is on p.8, one of the ones I posted on my other blog. The booklet is in WorldCat (from the Getty) and Google Books, but hasn’t yet been digitized, as far as I found.

My item was a 32-page booklet with covers, so 36 page images in all. To upload a scanned book, you need each page in a separate image. I scanned two-page openings at 300dpi, 24 bit colour, saved as tiffs. I cropped each one to the two-page spread, then used ImageMagick to split each one into left and right pages:

convert -crop 50%x100% +repage Rabinovitch-00.tif output/Rabinovitch-00-%d.tif

(That prodused output/Rabinovitch-00-1.tif and output/Rabinovitch-00-2.tif). Having tested it, I ran a quick awk job to generate a script to split all the images:

ls -1 *.tif | awk -F. '{ print "convert -crop 50%x100% +repage " $1 ".tif output/" $1 "-%d.tif" ;}' > split.sh

I had scanned the cover as a two-page spread, so I had to rename the image of the back cover to put it at the end of the sequence: filename sort-order seems to determine the order of the images in the final product. Finally (and this is the bit I missed the first time), I combined all the tiffs into a pdf:

convert Rabinovitch-*.tif rabinovitch.pdf

Now I had a 730mb pdf. Off to the Internet Archive Contribution page, where I logged in with the account I created a while ago when I uploaded a video. Click the “Upload” button at the upper right to get to the main uploading page, then click “Share” at the upper right, and select your pdf. A flash app uploads it; it took a couple of hours on my home connection. Meanwhile you have a form to fill in. It’s more oriented to a/v uploads than to books, and you don’t actually get to tell it what you’re uploading, but the process will figure it out in the end. You can fill in a title (which gets converted into an Internet Archive ID) and a description and an author, and that’s it for metadata. You can also apply a license. If you choose to dedicate your item to the public domain, you’ll be given the option in the next step to mark it as already in the public domain, which was appropriate for my item.

When the upload completed I submitted the form, and after a minute or so it told me my item was available and gave me a url. And it was available, but it hadn’t finished processing yet: all I could get was the original pdf that I had uploaded. There is a link on the upload page to a list of your unfinished tasks: go there, and you’ll see the processing task still running. Click to see a log of the ongoing processing (here’s mine, after completion). You can see which node in Internet Archive’s storage array received and processed the item, the job priority (-1: hey, I don’t mind), the generating of the xml (including the assignment of an ARK, replication of the files to another node with rsync, and finally the generation of the various derivatives: OCR, images, pdfs, etc., with the results being rsynced to the mirror node. The whole process took a little more than ten minutes.

The outcome is http://www.archive.org/details/RabinovitchProspectus1938. It’s in the Community Texts collection (formerly known as the Open Source Collection), and it has the full Internet Archive treatment. The photographs in the offset-print original have been nicely descreened, and look quite good in the book-viewer presentation. All it lacked was an Open Library page, so I activated my mad cataloguing skillz and created one.

I also have the option to edit this item; this brings up another form with more fields, including date, notes, and rights — things I couldn’t fill in from the upload screen. I can also modify the files if I like.

And here’s the result, all ready to help some future historian of the Depression-era New York photography scene decide whether they need a trip to the Getty:

Google and the Digital Border
Friday 1 February 2008 @ 3:39 pm

Here’s something I had been waiting for a chance to quantify. Dian Schaffhauser writes:

Type “sonoma” and “mission” into books.google.com and choose “Full view” to eliminate those books that haven’t granted permission to be fully displayed or that are still in copyright because they were published post-1923. About 550 titles show up…

Or try it from the frozen wastes north of the 49th parallel, and 166 titles show up. Google is so careful about copyright, it hurts.

Camouflage class
Thursday 17 January 2008 @ 2:29 pm

Camouflage class at N[ew] Y[ork] University, where men and women are preparing for jobs in the Army or in industry, New York, N.Y. This model has been camouflaged and photographed. The girl is correcting oversights detected in the camouflaging of a model

Originally uploaded by The Library of Congress

Because I can, here’s a photo from the wonderful set uploaded to Flickr by the Library of Congress. Note that they’re not just sharing; they’re also hoping to harvest user-created metadata. Camouflage class – how come that wasn’t offered when I was at school?

Older Posts »