Chapter 2: Working with Regular Expressions

Anchors:

Anchors introduce positioning into an expression.  The two most common are the carot “^” and dollar-sign “$” which not that a match must start at the beginning of the line or at the end.

Grouping

Essentially, regular expressions are about parsing a line or string into groups for processing.  Groups are established by placing matching expressions within parenthesis “()”.  Matched groups are then referenced by their number, the order in which they appear within the expression.  For example:

(=245.{4})(\$a[^$])

Has two groups.  The first group (=245.{4}) would be represented as group 1 and the (\$a[^$]) would be represented as group 2.  Sometimes groups need to be named, for a variety of reasons, and we do that by adding a naming element to the group.  If we take the same example: (=245.{4})(\$a[^$]) , we can name the first group by updating this expression to:

(?<one>=245.{4})(\$a[^$])

In .NET’s expression language, the ?<..> syntax creates an assertion around the data – in this case, the assertion is a name.  This also changes how we reference the other groups in the expression.  Once a value is named, it is no long able to be referenced as a number.  So, in this case, the named expression would be referenced by its name: one, and the second group, would become ordinal number 1.

Quantifiers

Quantifiers are used to set the number of times a character or character group can match.  These special values establish how “greedy” a match will be.  Common characters in this class would be: *, +, ?, and {} construct.

Substitutions

Substitutions are utilized for replacement patterns and are how expressions output data captured in groups.  In the named example above:

(?<one>=245.{4})(\$a[^$])

If this expression was used in a “Find”, when re-outputting the data back into a “replace”, the user would utilize a “$ number” or “$ name” construct to reference the captured group.  For example:

${one} to output the named group, or $1 to output the unnamed group.

Most Common Expression Elements used with MarcEdit

The .NET regular expression language is large, and can be confusing.  With so many options, one might ask, which are the most important.  Doing an evaluation of the most common expressions written in response to questions on the MarcEdit listserv, the most common elements are:

  • Character escapes: \d\r\n\$\x##
  • Character Classes [] & [^]
  • Grouping Elements ()
  • Anchors: ^$
  • Quantifiers: *?+{#}
  • Substitutions: $#

While other elements are certainly utilized, these represent the values that are most often utilized.

Examples

Regular expressions represent one of the most powerful tools in your MarcEdit arsenal.  Learning how to use them, and when they can make a different will a nearly impossible editing task, into one that can be accomplished with a single expression. How can you get to that point?  Like any language, exposure and practice will help. Ask questions on the listserv.  I’ve included some examples below that demonstrate different regular expressions and how they might be used within MarcEdit.