Chapter 2: Working with Regular Expressions
In this Chapter:
- Getting Started with Regular Expressions in MarcEdit
- Overview of the .NET Expression Language
- Most Common Expression Elements
- Examples
- Scoping
Getting Started with Regular Expressions in MarcEdit
For as long as I can remember, MarcEdit has included some flavor of regular expression support as part of its editing toolkit. In earlier versions of the application, the regular expression engine was written by me, in assembly language, to replicate some of the common functions found within common unix command line tools. This approach was handy, especially for folks already used to tools like sed and awk, because the syntax within MarcEdit was nearly the same. From a maintenance perspective, however, this approach was tedious, in part, because the expression language was difficult to write and expand. What’s more, since I only was able to support a subset of the language, large portions of really useful functionality was never going to be created.
This changed in MarcEdit 5. MarcEdit 5 represented a new code-base, and an opportunity to rethink how regular expression support was implemented. In previous versions of MarcEdit (version 0-4), MarcEdit was coded in a variety of languages: assembly code for the libraries, Visual Basic for the GUI, and Delphi for the COM components. MarcEdit 5 represented a clean break from the original codebase, and an opportunity to rethink how different parts of the application was implemented. The application was re-coded into C# and as such, was able to take advantage of the Microsoft .NET Regular Expression language.
The Regular Expression language in .NET has a very Perl-like feel to it, and provides MarcEdit with a full feature regular expression processing language. What’s more, the language includes the ability to add extensions; embedding code directly into an expression which will be executed as part of the process. Overall, it has represented a significant step forward. At the same time, the .NET language has its quirks. Backreferences and substitutions can be odd, and the assertion language definitely has a unique flavor to it. But with practice, knowing how to utilize regular expressions within MarcEdit, can turn enable an someone to move beyond simply editing records, to reshaping them.
Overview of the .NET Regular Expression Language
Microsoft provides a quick reference tutorial for individuals getting familiar with the regular expression language found in .NET. I highly recommend that users review the documentation (https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx) and become familiar with it’s concepts.
Like any regular expression language, the .NET regular expression language is divided into different language components. For the purposes of working with MarcEdit, the components that are most advantageous to understand are the Character Escapes, the Character Classes, Anchors, Grouping Constructs, Quantifiers, and Substitutions. There are other parts of the language, and they can be useful, but I’ll focus on these particular parts of the language.
Character Escapes
Character escapes are essentially mnemonics that stand it for literal characters. If you think about it, a regular expression is just a piece of marked up text that is later interpreted by an expression engine to perform an action. Within this markup are characters that have special meaning within the markup. For example, the dollar-sign character “$”, is a special character within the regular expression markup. When found in a search string, the “$” translates as “the end of the line”. So, when you see a regular expression like the following: