Contributing a Book to Internet Archive

Yesterday I ran across an interesting booklet in my grandparents’ papers; and noting that it was published before 1977 without a copyright notice and was therefore in the public domain in the US, and having an interest in user-contributed items for digitization projects at the moment, I thought I would scan it and contribute it to the Internet Archive. After a hasty reading of the FAQ I proceeded to mess up the process, so I thought I’d document it here.
The booklet is a prospectus for an art photography course offered by the New York photographer Rabinovitch (who went by his surname alone). My grandmother was a student of his in 1938, and a photograph of hers is on p.8, one of the ones I posted on my other blog. The booklet is in WorldCat (from the Getty) and Google Books, but hasn’t yet been digitized, as far as I found.
My item was a 32-page booklet with covers, so 36 page images in all. To upload a scanned book, you need each page in a separate image. I scanned two-page openings at 300dpi, 24 bit colour, saved as tiffs. I cropped each one to the two-page spread, then used ImageMagick to split each one into left and right pages:
convert -crop 50%x100% +repage Rabinovitch-00.tif output/Rabinovitch-00-%d.tif
(That produced output/Rabinovitch-00-1.tif and output/Rabinovitch-00-2.tif). Having tested it, I ran a quick awk job to generate a script to split all the images:
ls -1 *.tif | awk -F. '{ print "convert -crop 50%x100% +repage " $1 ".tif output/" $1 "-%d.tif" ;}' > split.sh
I had scanned the cover as a two-page spread, so I had to rename the image of the back cover to put it at the end of the sequence: filename sort-order seems to determine the order of the images in the final product. Finally (and this is the bit I missed the first time), I combined all the tiffs into a pdf:
convert Rabinovitch-*.tif rabinovitch.pdf
Now I had a 730mb pdf. Off to the Internet Archive Contribution page, where I logged in with the account I created a while ago when I uploaded a video. Click the “Upload” button at the upper right to get to the main uploading page, then click “Share” at the upper right, and select your pdf. A flash app uploads it; it took a couple of hours on my home connection. Meanwhile you have a form to fill in. It’s more oriented to a/v uploads than to books, and you don’t actually get to tell it what you’re uploading, but the process will figure it out in the end. You can fill in a title (which gets converted into an Internet Archive ID) and a description and an author, and that’s it for metadata. You can also apply a license. If you choose to dedicate your item to the public domain, you’ll be given the option in the next step to mark it as already in the public domain, which was appropriate for my item.
When the upload completed I submitted the form, and after a minute or so it told me my item was available and gave me a url. And it was available, but it hadn’t finished processing yet: all I could get was the original pdf that I had uploaded. There is a link on the upload page to a list of your unfinished tasks: go there, and you’ll see the processing task still running. Click to see a log of the ongoing processing (here’s mine, after completion). You can see which node in Internet Archive’s storage array received and processed the item, the job priority (-1: hey, I don’t mind), the generating of the xml (including the assignment of an ARK, replication of the files to another node with rsync, and finally the generation of the various derivatives: OCR, images, pdfs, etc., with the results being rsynced to the mirror node. The whole process took a little more than ten minutes.
The outcome is http://www.archive.org/details/RabinovitchProspectus1938. It’s in the Community Texts collection (formerly known as the Open Source Collection), and it has the full Internet Archive treatment. The photographs in the offset-print original have been nicely descreened, and look quite good in the book-viewer presentation. All it lacked was an Open Library page, so I activated my mad cataloguing skillz and created one.
I also have the option to edit this item; this brings up another form with more fields, including date, notes, and rights – things I couldn’t fill in from the upload screen. I can also modify the files if I like.
And here’s the result, all ready to help some future historian of the Depression-era New York photography scene decide whether they need a trip to the Getty:
Thanks for documenting this process. I've got some materials to get up there as well. PS: I added the cover to the OL page for you. :-)