Chapter 3: Slice, Dice, and Join your Records Again

Say I have that same file of 100,000 records and I’d like to generate 5 files from this larger record set.  What I cannot do is check #of files, and set it to 5 and expect MarcEdit’s MARCSplit tool to automatically calculate how many records that would be per file.  The tool never knows how many total records are in a file until the process has completed.  In this case, you’d have to know that you have 100,000 records, and set the Records per File to 20,000 in order to get just 5 smaller files.

Finally, when MarcEdit’s MARCSplit tool generates new files — it uses the filename: msplit0000000.mrc.  The tool increments the file number till it completes the process.  However, if you are splitting a large file into a bunch of individual files with one record per file, MarcEdit will prompt you on process to see if you’d like to utilize one of the values in the MARC record to name the file (i.e., a control number like the 001, or the 245$a).

Joining MARC Data

It would stand to reason that if catalogers want to split large files into smaller ones, they may need to be able to join smaller files into a large one.  The MARCJoin tool provides a simple interface for joining individual files or entire directories of files into a single file.  The tool also provides the ability to append data into an existing file, not just overwrite existing data.  For example:

As a cataloger, I need to join files in 10 directories together.  Using the MARCJoin tool, I join all the files in my first directory into a new file.  I then join all the files in subsequent directories, joining the data into the existing file defined in step one.  When finished, I’ve run the join tool 10 times, but I’ve only created one file as each process appends the new data into the existing file.

Figure 2: MARC Join Window

Figure 2: MARC Join Window

By default, MARCJoin is configured to join individual files together.  The Join Individual Files button configures the File(s) to join button to open an individual file selection dialog box — this box allows users to individually selected multiple files for join.  However, if I have a directory of content that I want to join, I can forgo the file selection by unchecking the Join Individual Files checkbox.  This reconfigures that File(s) to Join button to open a folder selection dialog box.  I’m also given the option to define the file extension of the files I wish to join together (Figure 3).

Figure 3: MARCJoin by directory

Figure 3: MARCJoin by directory

This option is much more efficient when working with large sets of files that you wish to join.  While large, File(s) to Join textbox has a limit to the number of characters that it can contain.  This limit is approximately 2 GB (in theory), but the gist here is that there are limits (and performance implications) if using the select individual files option and then selecting thousands of files to join.  In those cases, isolating the data to join into a directory and then just joining the data in the directory together represents the most efficient and performant option.

Beyond Split and Join

MarcEdit provides a number of other options for users looking to isolate specific records for edit.  Probably the most versitle is the Extract Selected Records Tool.