MarcEdit Internet Archive/HathiTrust Data Packager Plugin

The Internet Archive does a lot of wonderful things — including, digitizing books for libraries.  At Ohio State University, the Libraries worked with other CIC members to contribute ~a million items as part of the Google books projects, and ultimately, for inclusion into the HathiTrust.  When this project wrapped up, we still wanted to include materials into the HathiTrust — but were looking for a way to maximize access to the materials (and simplify some of the digitization processing).  Enter the Internet Archive.  Like many other cultural heritage organizations, we provide them with content that they are hosting.  You can see the collection here: https://archive.org/details/OhioStateUniversityLibrary.

One of the added benefits in working with the Internet Archive for Ohio State, is that a simple, well-documented, path exists for moving content between Internet Archive and the HathiTrust.  Using their metadata specifications (https://www.hathitrust.org/bib_specifications),users can create MARCXML files that include the necessary information for HathiTrust to retrieve content, digitized into the Internet Archive.

To simplify this process at OSU, I’ve created a plugin for MarcEdit.  This plugin is available in all versions of MarcEdit.  This allows a user to point to a collection or contributor and a date range and generate a single MARCXML file that will facilitate the upload of that content into the HathiTrust.  For information on downloading and managing plugins in MarcEdit, see: Managing Plugins in MarcEdit

The Plugin

 

So what exactly does the tool do?  Basically, the tool runs a query against the Internet Archive using either the Contributor information or Collection identifier.  The tool then collects the identifiers of all the materials found between the noted dates, and extracts the MARC and the structural metadata at the Internet Archive.  The tool then reads the data files, converting them to MARCXML and concatenating all records together into a single MARCXML file.

At this point, I’m looking for a couple folks that use the Internet Archive, contribute (or would like to contribute) to the HathiTrust, and use MarcEdit — that might be able to put this through it’s paces.  We know that it works here at OSU — I’d like to validate that these assumptions work more broadly.

For those interested in the source code — it will be available in my github repo as soon as I get a chance to package it.

Troubleshooting

Probably the most difficult part of the process is figuring out what data the plugin needs to Group the data.  MarcEdit provides two options for grouping data: by collection or by contributor.  This information is easily found on any book display page in the Internet Archive.

By Contributor: 

By Collection: