Chapter 2: Working with Regular Expressions | The MarcEdit Field Guide

=245.*[^\W]$

We know that the regular expression is making a statement that something must show up right before the end of the line. The above example would be an expression that looked at the 245 field, and noted any elements that didn’t end in a punctuation mark. In fact, this expression uses a character class mnemonic, the \W or match on a non-word character, to stand in for all potential punctuation characters.

Within the .NET language, a range of different types of characters may find themselves represented as an character escape mnemonic. Primarily, these would be unprintable characters like a tab (\t) or new line (\n) or carriage return (\r) – or it could represent a literal version of a character that has special meaning in the regular expression language. For example, a \$ to represent a literal dollar-sign, or a \^ to represent a carot, or a \[ or \] or \\ to represent other characters that have special meeting within the regular expression language, but can also be represented as a literal value. So, if I wanted to match on a subfield a, that would look like $a in MarcEdit’s mnemonic format. To match this as an expression, we would need to remember to escape the $ — so in a regular expression, you would match using \$a.

Character Classes

Character classes are used within the regular expression language to match a set of characters having some kind of similar set of characteristics. The most common of these constructions is the character grouping [characters] construct or negative [^characters] character grouping construct. Within the grouping construction, users can search for individual characters like 0, 1, 2, 9 using a grouping construct like [0129] or as a range [0-29]. Likewise, users could look for items that don’t have specific characters. For example, if I was evaluating the second indicator, and only wanted to match fields where that second indicator was not a 7 or 9: [^79]. Included in this set is also the period “.”, which can be used to match any character. In regular expressions, you will often find this value paired with a quantifier to specify how many times any character should be matched. You could match all characters: .*, or match the any 3 characters .{3} or match a minimum of 1 time, but up to 3 characters: .{1,3}. When paired with a quantifier, the “.” or any character value allows an expression to perform “greedy” matching, and because of that, users should take care to make sure their expressions aren’t too “greedy” as this is how data is lost.

Finally, character classes include a set of common mnemonics that can be used to stand in for large classes of characters. The most common are:

\w: any word character
\W: any non-word character
\s: any white-space character
\S: any non-white-space character
\d: matches any decimal digit
\D: matches any non-decimal digit

Character classes give users the ability to write extremely complicated expressions targeting particular types of characters, without having to create lists of all the characters that may fall into a specific class. Likewise, MarcEdit implements these elements as language neutral, so if you are working with Chinese data, the expression will match any word character within that language, rather than just English.

Pages: 1 2 3 4