Characterset support has long been a thorn in the side of many technical services librarians, because MARC data comes in so many different flavors and local character encodings. \u00a0For catalogers in North America — two primary character encodings are dominant: MARC8 and UTF8. \u00a0MARC8, of course, is an imaginary characterset that Libraries invented in the 1970s to enable extended Latinate characters to be represented. \u00a0As additional characters were added, the MARC8 format was extended to provide access to Greek, Hebrew, CJK, etc. \u00a0It evolved MARC8 into an escape-based character encoding, which means that it’s virtually useless and unreadable by anyone outside of the library.<\/p>\n
Sadly, things don’t get much better if dealing with UTF8 data. \u00a0In order to preserve round-trip ability with MARC8, most library specifications require the use of the KD notation around UTF8 data — a notation that separates characters from their diacritics, continuing the trend of making non-English language materials difficult to search and find.<\/p>\n
MarcEdit provides for users a method to move data between various charactersets and character encodings. \u00a0Character encodings (like UTF8 notation) is set in the MARCEngine Preferences, as noted: in Book 1: Chapter 3: Understanding the MarcEdit Preferences[ref]http:\/\/marcedit.reeset.net\/learning_marcedit\/welcome-to-marcedit\/chapter-3-understanding-the-marcedit-preferences\/3\/<\/a>[\/ref]. \u00a0For characterset conversions, users should utilize the Characterset Conversion tool found from within the MARC Tools window.<\/p>\n Figure 1: Characterset Conversions Window<\/p><\/div>\n The MarcEdit Characterset Conversions tool provides a batch method for changing data from one characterset to another. \u00a0The program makes a couple of assumptions:<\/p>\n [table] By default, MarcEdit’s Original Encoding and Final Encoding dropdown boxes are populated with the most common charactersets encountered by the MarcEdit user community.<\/p>\n Figure 2: Available Charactersets<\/p><\/div>\n Charactersets are defined by the codepage (i.e., 1252) and then their human readable description. \u00a0The codepage part is important — while this list represents the most common set of charactersets encountered by MarcEdit users, it is by no means an exhaustive list. \u00a0Each operating system provides a set of supported charactersets…MarcEdit can utilize any characterset supported by the operating system so long as the codepage number is entered into the Original or Final encoding. \u00a0For Windows users, Microsoft makes a list of supported codepages available in the Windows knowledge-base[ref]https:\/\/msdn.microsoft.com\/en-us\/library\/dd317756(VS.85).aspx<\/a>[\/ref].<\/p>\n To use this function, follow these steps:<\/p>\n Switching between charactersets can be tricky, meaning that a handful of issues can cause hard to find problems.<\/p>\n So you have a data file and you want to encoding it into UTF8 but don’t know the original character encoding….that is a problem. \u00a0Unfortunately, MarcEdit can’t automatically detect the file’s characterset. \u00a0This is partly because many charactersets look identical at the binary level. However, I work with enough of them to maybe be able to give you some help.<\/p>\n Finally, the best place to get information about a files characterset encoding is to ask the individual or agency that provided you with the file. \u00a0Good luck!<\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" In this Chapter Getting Started Working with Charactersets Available Charactersets Getting Help Getting Started Characterset support has long been a thorn in the side of many technical services librarians, because MARC data comes in so many different flavors and local character encodings. \u00a0For catalogers in North America — two primary character encodings are dominant: MARC8 […]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":9,"menu_order":1,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343"}],"collection":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/comments?post=343"}],"version-history":[{"count":13,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343\/revisions"}],"predecessor-version":[{"id":552,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343\/revisions\/552"}],"up":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/9"}],"wp:attachment":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/media?parent=343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Working with Charactersets<\/h3>\n
<\/a>\n
<\/a>\u00a0[attr style=”width:90px”], “MarcEdit do not have an automated tool for determining the characterset for your file. \u00a0This is partly because at the binary level, many charactersets look the same. \u00a0“[\/table]<\/p>\n
<\/a>\n
Important Notes<\/h4>\n
\n
Getting Help<\/h3>\n
\n