Character Escapes<\/strong><\/p>\nCharacter escapes are essentially mnemonics that stand it for literal characters.\u00a0 If you think about it, a regular expression is just a piece of marked up text that is later interpreted by an expression engine to perform an action.\u00a0 Within this markup are characters that have special meaning within the markup.\u00a0 For example, the dollar-sign character \u201c$\u201d, is a special character within the regular expression markup.\u00a0 When found in a search string, the \u201c$\u201d translates as \u201cthe end of the line\u201d.\u00a0 So, when you see a regular expression like the following:<\/p>\n
=245.*[^\\W]$<\/p>\n
We know that the regular expression is making a statement that something must show up right before the end of the line.\u00a0 The above example would be an expression that looked at the 245 field, and noted any elements that didn\u2019t end in a punctuation mark.\u00a0 In fact, this expression uses a character class mnemonic, the \\W or match on a non-word character, to stand in for all potential punctuation characters.<\/p>\n
Within the .NET language, a range of different types of characters may find themselves represented as an character escape mnemonic.\u00a0 Primarily, these would be unprintable characters like a tab (\\t) or new line (\\n) or carriage return (\\r) \u2013 or it could represent a literal version of a character that has special meaning in the regular expression language.\u00a0 For example, a \\$ to represent a literal dollar-sign, or a \\^ to represent a carot, or a \\[ or \\] or \\\\ to represent other characters that have special meeting within the regular expression language, but can also be represented as a literal value.\u00a0 So, if I wanted to match on a subfield a, that would look like $a in MarcEdit\u2019s mnemonic format.\u00a0 To match this as an expression, we would need to remember to escape the $ — so in a regular expression, you would match using \\$a.<\/p>\n
Character Classes<\/strong><\/p>\nCharacter classes are used within the regular expression language to match a set of characters having some kind of similar set of characteristics.\u00a0 The most common of these constructions is the character grouping [characters] construct or negative [^characters] character grouping construct.\u00a0 Within the grouping construction, users can search for individual characters like 0, 1, 2, 9 using a grouping construct like [0129] or as a range [0-29].\u00a0 Likewise, users could look for items that don\u2019t have specific characters. For example, if I was evaluating the second indicator, and only wanted to match fields where that second indicator was not a 7 or 9: [^79].\u00a0 Included in this set is also the period \u201c.\u201d, which can be used to match any character.\u00a0 In regular expressions, you will often find this value paired with a quantifier to specify how many times any character should be matched.\u00a0 You could match all characters: .*, or match the any 3 characters .{3} or match a minimum of 1 time, but up to 3 characters: .{1,3}.\u00a0 When paired with a quantifier, the \u201c.\u201d or any character value allows an expression to perform \u201cgreedy\u201d matching, and because of that, users should take care to make sure their expressions aren\u2019t too \u201cgreedy\u201d as this is how data is lost.<\/p>\n
Finally, character classes include a set of common mnemonics that can be used to stand in for large classes of characters.\u00a0 The most common are:<\/p>\n
\n- \\w: any word character<\/li>\n
- \\W: any non-word character<\/li>\n
- \\s: any white-space character<\/li>\n
- \\S: any non-white-space character<\/li>\n
- \\d: matches any decimal digit<\/li>\n
- \\D: matches any non-decimal digit<\/li>\n<\/ul>\n
Character classes give users the ability to write extremely complicated expressions targeting particular types of characters, without having to create lists of all the characters that may fall into a specific class.\u00a0 Likewise, MarcEdit implements these elements as language neutral, so if you are working with Chinese data, the expression will match any word character within that language, rather than just English.<\/p>\n
Anchors:<\/strong><\/p>\nAnchors introduce positioning into an expression.\u00a0 The two most common are the carot \u201c^\u201d and dollar-sign \u201c$\u201d which not that a match must start at the beginning of the line or at the end.<\/p>\n
Grouping<\/strong><\/p>\nEssentially, regular expressions are about parsing a line or string into groups for processing.\u00a0 Groups are established by placing matching expressions within parenthesis \u201c()\u201d.\u00a0 Matched groups are then referenced by their number, the order in which they appear within the expression.\u00a0 For example:<\/p>\n
(=245.{4})(\\$a[^$])<\/p>\n
Has two groups.\u00a0 The first group (=245.{4}) would be represented as group 1 and the (\\$a[^$]) would be represented as group 2.\u00a0 Sometimes groups need to be named, for a variety of reasons, and we do that by adding a naming element to the group.\u00a0 If we take the same example: (=245.{4})(\\$a[^$]) , we can name the first group by updating this expression to:<\/p>\n
(?<one>=245.{4})(\\$a[^$])<\/p>\n
In .NET\u2019s expression language, the ?<..> syntax creates an assertion around the data \u2013 in this case, the assertion is a name.\u00a0 This also changes how we reference the other groups in the expression.\u00a0 Once a value is named, it is no long able to be referenced as a number.\u00a0 So, in this case, the named expression would be referenced by its name: one, and the second group, would become ordinal number 1.<\/p>\n
Quantifiers<\/strong><\/p>\nQuantifiers are used to set the number of times a character or character group can match.\u00a0 These special values establish how \u201cgreedy\u201d a match will be. \u00a0Common characters in this class would be: *, +, ?, and {} construct.<\/p>\n
Substitutions<\/strong><\/p>\nSubstitutions are utilized for replacement patterns and are how expressions output data captured in groups.\u00a0 In the named example above:<\/p>\n
(?<one>=245.{4})(\\$a[^$])<\/p>\n
If this expression was used in a \u201cFind\u201d, when re-outputting the data back into a \u201creplace\u201d, the user would utilize a \u201c$ number\u201d or \u201c$ name\u201d construct to reference the captured group.\u00a0 For example:<\/p>\n
${one} to output the named group, or $1 to output the unnamed group.<\/p>\n
Most Common Expression Elements used with MarcEdit<\/h3>\n
The .NET regular expression language is large, and can be confusing.\u00a0 With so many options, one might ask, which are the most important.\u00a0 Doing an evaluation of the most common expressions written in response to questions on the MarcEdit listserv, the most common elements are:<\/p>\n
\n- Character escapes: \\d\\r\\n\\$\\x##<\/li>\n
- Character Classes [] & [^]<\/li>\n
- Grouping Elements ()<\/li>\n
- Anchors: ^$<\/li>\n
- Quantifiers: *?+{#}<\/li>\n
- Substitutions: $#<\/li>\n<\/ul>\n
While other elements are certainly utilized, these represent the values that are most often utilized.<\/p>\n
Examples<\/h3>\n
Regular expressions represent one of the most powerful tools in your MarcEdit arsenal.\u00a0 Learning how to use them, and when they can make a different will a nearly impossible editing task, into one that can be accomplished with a single expression. How can you get to that point?\u00a0 Like any language, exposure and practice will help. Ask questions on the listserv.\u00a0 I\u2019ve included some examples below that demonstrate different regular expressions and how they might be used within MarcEdit.<\/p>\n
Example 1:<\/em><\/p>\nAdd a period to the 500 if it is missing<\/p>\n
Find: (=500.*[^\\W])$ Replace: $1.<\/p>\n
Why does this work?<\/em><\/p>\nThe key part of this expression is found at the end \u2013 the [^\\W]$.\u00a0 This tells the expression that I explicitly want to evaluate the last character in the string, and that the expression should only evaluate as true, if, and only if, the last character in the string is a non-word character.<\/p>\n
Example 2:<\/em><\/p>\nDelete all non-LC Subjects using the Add\/Delete Field Function<\/p>\n
Field: 6xx Field Data: ^=6[0-9]{2}.{3}[^7]<\/p>\n
Why this works:<\/em><\/p>\nThe field part of this is not part of the regular expression.\u00a0 In the Add\/Delete Field function, fields can be referenced as groups.\u00a0 The expression is what follows in the field data.\u00a0 When working with the Add\/Delete Field function, all data is exposed to the regular expression engine.\u00a0 So, the expression sets an anchor that says the start of the field must begin with an equal-sign \u201c=\u201d then match any 6xx field.\u00a0 Next, we use the .{3} syntax to match any of the next three values which would be the two spaces required by MarcEdit\u2019s mnemonic format, and the first indicator.\u00a0 The last value [^7], tells the expression engine to match any line that does not have a second indicator of 7.<\/p>\n
Scoping<\/h3>\n
Within MarcEdit, how much data is exposed for regular expression evaluation will depend on the global editing function being utilized.\u00a0 Generally, the global editing functions are scoped as follows:<\/p>\n
\n- Find\/Replace: Regular Expression engine can read all record data. By default, expressions are scope to a single field, but multi-line expression can be written, changing the scope from a single field, to the entire record.<\/li>\n
- Add\/Delete Field Function: The expression engine can see all data in a field, including the equal-sign at the start of the field.<\/li>\n
- Edit Field Function: The Expression engine can see subfield data.\u00a0 No field or indicator data is visible to the Edit Field Function.\u00a0 If you need to evaluate indicator data as part of an expression, you should use the Replace Function.<\/li>\n
- Edit Subfield Function: The Expression engine can only see the selected subfield, including the actual subfield code.<\/li>\n
- Copy Field Data: The Expression engine can see all field data.<\/li>\n
- Build New Field: Regular Expressions are limited to the specific parts of a record being used to build a new field.<\/li>\n<\/ul>\n
<\/p>\n","protected":false},"excerpt":{"rendered":"
Chapter 2: Working with Regular Expressions In this Chapter: Getting Started with Regular Expressions in MarcEdit Overview of the .NET Expression Language Most Common Expression Elements Examples Scoping Getting Started with Regular Expressions in MarcEdit For as long as I can remember, MarcEdit has included some flavor of regular expression support as part […]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":369,"menu_order":1,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520"}],"collection":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/comments?post=520"}],"version-history":[{"count":6,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520\/revisions"}],"predecessor-version":[{"id":553,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/520\/revisions\/553"}],"up":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/369"}],"wp:attachment":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/media?parent=520"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}