{"id":520,"date":"2017-03-10T08:17:28","date_gmt":"2017-03-10T08:17:28","guid":{"rendered":"http:\/\/marcedit.reeset.net\/learning_marcedit\/?page_id=520"},"modified":"2018-01-03T21:32:25","modified_gmt":"2018-01-03T21:32:25","slug":"chapter-2-regular-expressions-01","status":"publish","type":"page","link":"https:\/\/marcedit.reeset.net\/learning_marcedit\/book-iii-the-marceditor\/chapter-2-regular-expressions-01\/","title":{"rendered":"Chapter 2: Working with Regular Expressions"},"content":{"rendered":"<p>Chapter 2: Working with Regular Expressions<\/p>\n<p>&nbsp;<\/p>\n<h1>In this Chapter:<\/h1>\n<ul>\n<li><strong>Getting Started with Regular Expressions in MarcEdit<\/strong><\/li>\n<li><strong>Overview of the .NET Expression Language<\/strong><\/li>\n<li><strong>Most Common Expression Elements<\/strong><\/li>\n<li><strong>Examples<\/strong><\/li>\n<li><strong>Scoping<\/strong><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Getting Started with Regular Expressions in MarcEdit<\/h3>\n<p>For as long as I can remember, MarcEdit has included some flavor of regular expression support as part of its editing toolkit.\u00a0 In earlier versions of the application, the regular expression engine was written by me, in assembly language, to replicate some of the common functions found within common unix command line tools.\u00a0 This approach was handy, especially for folks already used to tools like sed and awk, because the syntax within MarcEdit was nearly the same.\u00a0 From a maintenance perspective, however, this approach was tedious, in part, because the expression language was difficult to write and expand.\u00a0 What\u2019s more, since I only was able to support a subset of the language, large portions of really useful functionality was never going to be created.<\/p>\n<p>This changed in MarcEdit 5.\u00a0 MarcEdit 5 represented a new code-base, and an opportunity to rethink how regular expression support was implemented.\u00a0 In previous versions of MarcEdit (version 0-4), MarcEdit was coded in a variety of languages: assembly code for the libraries, Visual Basic for the GUI, and Delphi for the COM components.\u00a0 MarcEdit 5 represented a clean break from the original codebase, and an opportunity to rethink how different parts of the application was implemented.\u00a0 The application was re-coded into C# and as such, was able to take advantage of the Microsoft .NET Regular Expression language.<\/p>\n<p>The Regular Expression language in .NET has a very Perl-like feel to it, and provides MarcEdit with a full feature regular expression processing language.\u00a0 What\u2019s more, the language includes the ability to add extensions; embedding code directly into an expression which will be executed as part of the process.\u00a0 Overall, it has represented a significant step forward.\u00a0 At the same time, the .NET language has its quirks.\u00a0 Backreferences and substitutions can be odd, and the assertion language definitely has a unique flavor to it.\u00a0 But with practice, knowing how to utilize regular expressions within MarcEdit, can turn enable an someone to move beyond simply editing records, to reshaping them.<\/p>\n<h3>Overview of the .NET Regular Expression Language<\/h3>\n<p>Microsoft provides a quick reference tutorial for individuals getting familiar with the regular expression language found in .NET.\u00a0 I highly recommend that users review the documentation (<a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/az24scfc(v=vs.110).aspx)\">https:\/\/msdn.microsoft.com\/en-us\/library\/az24scfc(v=vs.110).aspx)<\/a> and become familiar with it\u2019s concepts.<\/p>\n<p>Like any regular expression language, the .NET regular expression language is divided into different language components.\u00a0 For the purposes of working with MarcEdit, the components that are most advantageous to understand are the Character Escapes, the Character Classes, Anchors, Grouping Constructs, Quantifiers, and Substitutions.\u00a0 There are other parts of the language, and they can be useful, but I\u2019ll focus on these particular parts of the language.<\/p>\n<p><strong>Character Escapes<\/strong><\/p>\n<p>Character escapes are essentially mnemonics that stand it for literal characters.\u00a0 If you think about it, a regular expression is just a piece of marked up text that is later interpreted by an expression engine to perform an action.\u00a0 Within this markup are characters that have special meaning within the markup.\u00a0 For example, the dollar-sign character \u201c$\u201d, is a special character within the regular expression markup.\u00a0 When found in a search string, the \u201c$\u201d translates as \u201cthe end of the line\u201d.\u00a0 So, when you see a regular expression like the following:<!--nextpage--><\/p>\n<p>=245.*[^\\W]$<\/p>\n<p>We know that the regular expression is making a statement that something must show up right before the end of the line.\u00a0 The above example would be an expression that looked at the 245 field, and noted any elements that didn\u2019t end in a punctuation mark.\u00a0 In fact, this expression uses a character class mnemonic, the \\W or match on a non-word character, to stand in for all potential punctuation characters.<\/p>\n<p>Within the .NET language, a range of different types of characters may find themselves represented as an character escape mnemonic.\u00a0 Primarily, these would be unprintable characters like a tab (\\t) or new line (\\n) or carriage return (\\r) \u2013 or it could represent a literal version of a character that has special meaning in the regular expression language.\u00a0 For example, a \\$ to represent a literal dollar-sign, or a \\^ to represent a carot, or a \\[ or \\] or \\\\ to represent other characters that have special meeting within the regular expression language, but can also be represented as a literal value.\u00a0 So, if I wanted to match on a subfield a, that would look like $a in MarcEdit\u2019s mnemonic format.\u00a0 To match this as an expression, we would need to remember to escape the $ &#8212; so in a regular expression, you would match using \\$a.<\/p>\n<p><strong>Character Classes<\/strong><\/p>\n<p>Character classes are used within the regular expression language to match a set of characters having some kind of similar set of characteristics.\u00a0 The most common of these constructions is the character grouping [characters] construct or negative [^characters] character grouping construct.\u00a0 Within the grouping construction, users can search for individual characters like 0, 1, 2, 9 using a grouping construct like [0129] or as a range [0-29].\u00a0 Likewise, users could look for items that don\u2019t have specific characters. For example, if I was evaluating the second indicator, and only wanted to match fields where that second indicator was not a 7 or 9: [^79].\u00a0 Included in this set is also the period \u201c.\u201d, which can be used to match any character.\u00a0 In regular expressions, you will often find this value paired with a quantifier to specify how many times any character should be matched.\u00a0 You could match all characters: .*, or match the any 3 characters .{3} or match a minimum of 1 time, but up to 3 characters: .{1,3}.\u00a0 When paired with a quantifier, the \u201c.\u201d or any character value allows an expression to perform \u201cgreedy\u201d matching, and because of that, users should take care to make sure their expressions aren\u2019t too \u201cgreedy\u201d as this is how data is lost.<\/p>\n<p>Finally, character classes include a set of common mnemonics that can be used to stand in for large classes of characters.\u00a0 The most common are:<\/p>\n<ul>\n<li>\\w: any word character<\/li>\n<li>\\W: any non-word character<\/li>\n<li>\\s: any white-space character<\/li>\n<li>\\S: any non-white-space character<\/li>\n<li>\\d: matches any decimal digit<\/li>\n<li>\\D: matches any non-decimal digit<\/li>\n<\/ul>\n<p>Character classes give users the ability to write extremely complicated expressions targeting particular types of characters, without having to create lists of all the characters that may fall into a specific class.\u00a0 Likewise, MarcEdit implements these elements as language neutral, so if you are working with Chinese data, the expression will match any word character within that language, rather than just English.<!--nextpage--><\/p>\n<p><strong>Anchors:<\/strong><\/p>\n<p>Anchors introduce positioning into an expression.\u00a0 The two most common are the carot \u201c^\u201d and dollar-sign \u201c$\u201d which not that a match must start at the beginning of the line or at the end.<\/p>\n<p><strong>Grouping<\/strong><\/p>\n<p>Essentially, regular expressions are about parsing a line or string into groups for processing.\u00a0 Groups are established by placing matching expressions within parenthesis \u201c()\u201d.\u00a0 Matched groups are then referenced by their number, the order in which they appear within the expression.\u00a0 For example:<\/p>\n<p>(=245.{4})(\\$a[^$])<\/p>\n<p>Has two groups.\u00a0 The first group (=245.{4}) would be represented as group 1 and the (\\$a[^$]) would be represented as group 2.\u00a0 Sometimes groups need to be named, for a variety of reasons, and we do that by adding a naming element to the group.\u00a0 If we take the same example: (=245.{4})(\\$a[^$]) , we can name the first group by updating this expression to:<\/p>\n<p>(?&lt;one&gt;=245.{4})(\\$a[^$])<\/p>\n<p>In .NET\u2019s expression language, the ?&lt;..&gt; syntax creates an assertion around the data \u2013 in this case, the assertion is a name.\u00a0 This also changes how we reference the other groups in the expression.\u00a0 Once a value is named, it is no long able to be referenced as a number.\u00a0 So, in this case, the named expression would be referenced by its name: one, and the second group, would become ordinal number 1.<\/p>\n<p><strong>Quantifiers<\/strong><\/p>\n<p>Quantifiers are used to set the number of times a character or character group can match.\u00a0 These special values establish how \u201cgreedy\u201d a match will be. \u00a0Common characters in this class would be: *, +, ?, and {} construct.<\/p>\n<p><strong>Substitutions<\/strong><\/p>\n<p>Substitutions are utilized for replacement patterns and are how expressions output data captured in groups.\u00a0 In the named example above:<\/p>\n<p>(?&lt;one&gt;=245.{4})(\\$a[^$])<\/p>\n<p>If this expression was used in a \u201cFind\u201d, when re-outputting the data back into a \u201creplace\u201d, the user would utilize a \u201c$ number\u201d or \u201c$ name\u201d construct to reference the captured group.\u00a0 For example:<\/p>\n<p>${one} to output the named group, or $1 to output the unnamed group.<\/p>\n<h3>Most Common Expression Elements used with MarcEdit<\/h3>\n<p>The .NET regular expression language is large, and can be confusing.\u00a0 With so many options, one might ask, which are the most important.\u00a0 Doing an evaluation of the most common expressions written in response to questions on the MarcEdit listserv, the most common elements are:<\/p>\n<ul>\n<li>Character escapes: \\d\\r\\n\\$\\x##<\/li>\n<li>Character Classes [] &amp; [^]<\/li>\n<li>Grouping Elements ()<\/li>\n<li>Anchors: ^$<\/li>\n<li>Quantifiers: *?+{#}<\/li>\n<li>Substitutions: $#<\/li>\n<\/ul>\n<p>While other elements are certainly utilized, these represent the values that are most often utilized.<\/p>\n<h3>Examples<\/h3>\n<p>Regular expressions represent one of the most powerful tools in your MarcEdit arsenal.\u00a0 Learning how to use them, and when they can make a different will a nearly impossible editing task, into one that can be accomplished with a single expression. How can you get to that point?\u00a0 Like any language, exposure and practice will help. Ask questions on the listserv.\u00a0 I\u2019ve included some examples below that demonstrate different regular expressions and how they might be used within MarcEdit.<!--nextpage--><\/p>\n<p><em>Example 1:<\/em><\/p>\n<p>Add a period to the 500 if it is missing<\/p>\n<p>Find: (=500.*[^\\W])$ Replace: $1.<\/p>\n<p><em>Why does this work?<\/em><\/p>\n<p>The key part of this expression is found at the end \u2013 the [^\\W]$.\u00a0 This tells the expression that I explicitly want to evaluate the last character in the string, and that the expression should only evaluate as true, if, and only if, the last character in the string is a non-word character.<\/p>\n<p><em>Example 2:<\/em><\/p>\n<p>Delete all non-LC Subjects using the Add\/Delete Field Function<\/p>\n<p>Field: 6xx Field Data: ^=6[0-9]{2}.{3}[^7]<\/p>\n<p><em>Why this works:<\/em><\/p>\n<p>The field part of this is not part of the regular expression.\u00a0 In the Add\/Delete Field function, fields can be referenced as groups.\u00a0 The expression is what follows in the field data.\u00a0 When working with the Add\/Delete Field function, all data is exposed to the regular expression engine.\u00a0 So, the expression sets an anchor that says the start of the field must begin with an equal-sign \u201c=\u201d then match any 6xx field.\u00a0 Next, we use the .{3} syntax to match any of the next three values which would be the two spaces required by MarcEdit\u2019s mnemonic format, and the first indicator.\u00a0 The last value [^7], tells the expression engine to match any line that does not have a second indicator of 7.<\/p>\n<h3>Scoping<\/h3>\n<p>Within MarcEdit, how much data is exposed for regular expression evaluation will depend on the global editing function being utilized.\u00a0 Generally, the global editing functions are scoped as follows:<\/p>\n<ul>\n<li>Find\/Replace: Regular Expression engine can read all record data. By default, expressions are scope to a single field, but multi-line expression can be written, changing the scope from a single field, to the entire record.<\/li>\n<li>Add\/Delete Field Function: The expression engine can see all data in a field, including the equal-sign at the start of the field.<\/li>\n<li>Edit Field Function: The Expression engine can see subfield data.\u00a0 No field or indicator data is visible to the Edit Field Function.\u00a0 If you need to evaluate indicator data as part of an expression, you should use the Replace Function.<\/li>\n<li>Edit Subfield Function: The Expression engine can only see the selected subfield, including the actual subfield code.<\/li>\n<li>Copy Field Data: The Expression engine can see all field data.<\/li>\n<li>Build New Field: Regular Expressions are limited to the specific parts of a record being used to build a new field.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Chapter 2: Working with Regular Expressions &nbsp; In this Chapter: Getting Started with Regular Expressions in MarcEdit Overview of the .NET Expression Language Most Common Expression Elements Examples Scoping &nbsp; Getting Started with Regular Expressions in MarcEdit For as long as I can remember, MarcEdit has included some flavor of regular expression support as part [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":369,"menu_order":1,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520"}],"collection":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/comments?post=520"}],"version-history":[{"count":6,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520\/revisions"}],"predecessor-version":[{"id":553,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520\/revisions\/553"}],"up":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/369"}],"wp:attachment":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/media?parent=520"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}