Editing MarcEdit’s Linked Data Rules File

In creating the linked data platform in MarcEdit, I made the decision that nearly all interactions with linked data endpoints and bibliographic data would be controlled via a rules file.  The reason for this approach is that rules files could then be created for different flavors of MARC and work equally well, without having to recompile code.  So, while I am actively adding and testing new endpoints as they become available, users have the ability to customize their linked data rules files and add their own endpoints to MarcEdit for use.  This document will provide some details on how this customization works.

Rules file:

The rules file is located in the application configuration path, in the file:  linked_data_profile.xml.  The structure of the file is fairly straightforward.  The rules are broken into two parts.

  • Part 1: Field Rules
    Field rules are defined within the <rules> block, with the rules for each field to be processed, defined with a <field> block.  The field block is where the user can define how the data should be processed.  The rules/field blocks follow the following criteria:

 

    rules block:
        top level: field
            Attributes:
                type: authority, bibliographic, authority|bibliographic
            tag (required): 
                Value: Field value
                Description: field to process
            subfield (required): 
                Value: Subfield codes 
                Description: subfields to use for matching
            ind2 (optional):
                Attributes:
                  value: second indicator value
                  vocab: matches a valid vocabulary option
            index (optional): 
                Values: subfield code or empty
                Description: field that denotes index
            sticky (optional):
                Values: subfield codes that should always be part of an atomized field (i.e., abc)
                Description: When processing atomized data, some subfields are applicable to all information within 
                a field block.  These are sticky values, and should be marked to ensure that they replicate between 
                atomized versions of a field set.
            atomize(optional): 
                Values: 1 or empty
                Description: determines if field should be broken up for uri disambiguation
            special_instructions (optional): 
                Values: name|subject|mixed|linking
                Description: special instructions to improve normalization for names and subjects.  
            uri (required):
                Values: subfield code to include a url
                Description: Used to determine which subfield is used to embed a URI
            vocab (optional): 
                Values (see supported vocabularies section)
                Description: when no index is supplied, you can predefine a supported index
                
  • Part II: Collections
    The collections definitions define how MarcEdit can interact with a webservice to extract data.  Currently, the tool requires endpoints to be SPARQL, and return JSON in order to be user configured.  The <collection> block uses the following rules:

 

    collections block
      top level: collection
        attributes: none
        
        name (required):
            Value: Defines the name of the service
         label (required):
            Value: Defines the index name used within the record to identify the vocabulary.  
                 Example: =650  \7$aTest$2fast
                 The label would be defined as 
         uri (required):
            Value: The URL MarcEdit will use to query the endpoint.  Use {search_terms} to denote 
                   the placeholder where MarcEdit should include search terms.  Please note that arguements 
                   need to be encoded.  Use: http://string-functions.com/urlencode.aspx or other tools to 
                   URL encode the string to ensure proper communication.  MarcEdit will only encode the 
                   Search Terms as it injects them into the URI.  The tool assumes you, the user, have properly 
                   encoded any other required data.  For examples, see the Japanese Diet Configuration below:
                       
                          Japanese Diet Library
                          
                          http://id.ndl.go.jp/auth/ndla?query=PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20SELECT%20*%20WHERE%20%7B%20%3Fsubj%20rdfs%3Alabel%20%22{search_terms}%22%7D&output=json
                          results.bindings[0].subj.value
          
         pattern (optional)
            Value: For use when replacing an identifier in the $0 with a full URI.  Examples, see FAST and GND.

         path (required): 
            Value: For user defined collections, path is required.  This defines the json object path
                   to the URI.  
                   Example: results.bindings[0].subj.value

It is recommended when developing collections that you test your profiles in a browser or SPARQL tool to determine the proper construction of the URI.

Contributing endpoints:

Ideally, if you create a definition to a new endpoint, or would like to have a major controlled vocabulary added to the MarcEdit rules file, please contact: reeset@gmail.com and provide documentation related to the specified service and sample data for testing.