{"id":343,"date":"2016-01-04T20:46:48","date_gmt":"2016-01-04T20:46:48","guid":{"rendered":"http:\/\/marcedit.reeset.net\/learning_marcedit\/?page_id=343"},"modified":"2018-01-03T20:20:55","modified_gmt":"2018-01-03T20:20:55","slug":"dealing-with-character-encodings-in-marcedit","status":"publish","type":"page","link":"https:\/\/marcedit.reeset.net\/learning_marcedit\/9-2\/dealing-with-character-encodings-in-marcedit\/","title":{"rendered":"Chapter 2:  Dealing with Character encodings in MarcEdit"},"content":{"rendered":"<h3>In this Chapter<\/h3>\n<ul>\n<li><strong>Getting Started<\/strong><\/li>\n<li><strong>Working with Charactersets<\/strong><\/li>\n<li><strong>Available Charactersets<\/strong><\/li>\n<li><strong>Getting Help<\/strong><\/li>\n<\/ul>\n<h3>Getting Started<\/h3>\n<p>Characterset support has long been a thorn in the side of many technical services librarians, because MARC data comes in so many different flavors and local character encodings. \u00a0For catalogers in North America &#8212; two primary character encodings are dominant: MARC8 and UTF8. \u00a0MARC8, of course, is an imaginary characterset that Libraries invented in the 1970s to enable extended Latinate characters to be represented. \u00a0As additional characters were added, the MARC8 format was extended to provide access to Greek, Hebrew, CJK, etc. \u00a0It evolved MARC8 into an escape-based character encoding, which means that it&#8217;s virtually useless and unreadable by anyone outside of the library.<\/p>\n<p>Sadly, things don&#8217;t get much better if dealing with UTF8 data. \u00a0In order to preserve round-trip ability with MARC8, most library specifications require the use of the KD notation around UTF8 data &#8212; a notation that separates characters from their diacritics, continuing the trend of making non-English language materials difficult to search and find.<\/p>\n<p>MarcEdit provides for users a method to move data between various charactersets and character encodings. \u00a0Character encodings (like UTF8 notation) is set in the MARCEngine Preferences, as noted: in Book 1: Chapter 3: Understanding the MarcEdit Preferences[ref]<a href=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/welcome-to-marcedit\/chapter-3-understanding-the-marcedit-preferences\/3\/\" target=\"_blank\" rel=\"noopener\">http:\/\/marcedit.reeset.net\/learning_marcedit\/welcome-to-marcedit\/chapter-3-understanding-the-marcedit-preferences\/3\/<\/a>[\/ref]. \u00a0For characterset conversions, users should utilize the Characterset Conversion tool found from within the MARC Tools window.<\/p>\n<h3>Working with Charactersets<\/h3>\n<div id=\"attachment_347\" style=\"width: 655px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/character_conversions1.png\" rel=\"attachment wp-att-347\"><img aria-describedby=\"caption-attachment-347\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-347\" src=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/character_conversions1.png\" alt=\"Figure 1: Characterset Conversions Window\" width=\"645\" height=\"363\" srcset=\"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/character_conversions1.png 645w, https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/character_conversions1-300x169.png 300w, https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/character_conversions1-624x351.png 624w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><\/a><p id=\"caption-attachment-347\" class=\"wp-caption-text\">Figure 1: Characterset Conversions Window<\/p><\/div>\n<p>The MarcEdit Characterset Conversions tool provides a batch method for changing data from one characterset to another. \u00a0The program makes a couple of assumptions:<\/p>\n<ol>\n<li>That all the data in the file is in the same characterset<\/li>\n<li>That the user knows the original characterset encoding<\/li>\n<\/ol>\n<p>[table]<a href=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2013\/05\/tip.png\"><img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2013\/05\/tip.png\" alt=\"tip\" width=\"85\" height=\"75\" \/><\/a>\u00a0[attr style=&#8221;width:90px&#8221;], &#8220;MarcEdit do not have an automated tool for determining the characterset for your file. \u00a0This is partly because at the binary level, many charactersets look the same. \u00a0&#8220;[\/table]<\/p>\n<p>By default, MarcEdit&#8217;s Original Encoding and Final Encoding dropdown boxes are populated with the most common charactersets encountered by the MarcEdit user community.<\/p>\n<div id=\"attachment_349\" style=\"width: 755px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/available_chara_sets.png\" rel=\"attachment wp-att-349\"><img aria-describedby=\"caption-attachment-349\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-349\" src=\"http:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/available_chara_sets.png\" alt=\"Figure 2: Available Charactersets\" width=\"745\" height=\"642\" srcset=\"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/available_chara_sets.png 745w, https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/available_chara_sets-300x259.png 300w, https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-content\/uploads\/2016\/01\/available_chara_sets-624x538.png 624w\" sizes=\"(max-width: 745px) 100vw, 745px\" \/><\/a><p id=\"caption-attachment-349\" class=\"wp-caption-text\">Figure 2: Available Charactersets<\/p><\/div>\n<p>Charactersets are defined by the codepage (i.e., 1252) and then their human readable description. \u00a0The codepage part is important &#8212; while this list represents the most common set of charactersets encountered by MarcEdit users, it is by no means an exhaustive list. \u00a0Each operating system provides a set of supported charactersets&#8230;MarcEdit can utilize any characterset supported by the operating system so long as the codepage number is entered into the Original or Final encoding. \u00a0For Windows users, Microsoft makes a list of supported codepages available in the Windows knowledge-base[ref]<a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dd317756(VS.85).aspx\" target=\"_blank\" rel=\"noopener\">https:\/\/msdn.microsoft.com\/en-us\/library\/dd317756(VS.85).aspx<\/a>[\/ref].<\/p>\n<p>To use this function, follow these steps:<\/p>\n<ol>\n<li>Select the file to process and set it to the Source Text Box<\/li>\n<li>Set a \u00a0save path in the Destination Text Box. \u00a0This cannot be the same as the source file.<\/li>\n<li>Set the Original File Encoding. \u00a0This is the encoding of the Source File.<\/li>\n<li>Set the Final Encoding. \u00a0This is the encoding that the data should end up in<\/li>\n<\/ol>\n<h4>Important Notes<\/h4>\n<p>Switching between charactersets can be tricky, meaning that a handful of issues can cause hard to find problems.<\/p>\n<ol>\n<li>When converting to a characterset that is not UTF8 or MARC8, it is generally better to convert your data to UTF8 first, and then convert to the final encoding. \u00a0You can go straight to your desired encoding, but if something goes wrong, its easier to debug the process.<\/li>\n<li>MarcEdit&#8217;s character conversion tool will not work if any structural errors exist in the record. Encoding changes require the evaluation of very precise sequences &#8211; if the file structure is incorrect, I can guarantee the conversion likely will be as well.<\/li>\n<li>Remember, MARC field length and record length is calculated by bytes, not characters. \u00a0In MARC8, these two are the same thing. \u00a0When dealing with UTF8 data, they are not. \u00a0For records already approaching record or field limits &#8212; translating large amounts of diacritics to UTF8 could potentially cause structural errors (though this doesn&#8217;t happen often &#8212; and mostly occurs when moving data from XML into MARC).<\/li>\n<\/ol>\n<h3>Getting Help<\/h3>\n<p>So you have a data file and you want to encoding it into UTF8 but don&#8217;t know the original character encoding&#8230;.that is a problem. \u00a0Unfortunately, MarcEdit can&#8217;t automatically detect the file&#8217;s characterset. \u00a0This is partly because many charactersets look identical at the binary level. However, I work with enough of them to maybe be able to give you some help.<\/p>\n<ul>\n<li><em>From Z39.50<\/em>: Surprisingly, many Z39.50 servers return data in\u00a028591 (ISO-8859-1) format (especially Innovative Interfaces Servers unless requested otherwise). \u00a0This looks a lot like MARC8, but it isn&#8217;t. \u00a0If you try processing data to UTF8 using MARC8 and the data isn&#8217;t correct, trying using this setting (or 1252 (US\/Western European) &#8212; if the 28591 codepage doesn&#8217;t work).<\/li>\n<li><em>My codepage no longer is in use<\/em>: \u00a0This sometimes happens. \u00a0If you look at the MarcEdit supported codepages, you&#8217;ll see one such codepage on this list:\u00a05426 (ISO-5426). \u00a0Codepage 5426[ref]<a href=\"http:\/\/www.iso.org\/iso\/iso_catalogue\/catalogue_tc\/catalogue_detail.htm?csnumber=11468\" target=\"_blank\" rel=\"noopener\">http:\/\/www.iso.org\/iso\/iso_catalogue\/catalogue_tc\/catalogue_detail.htm?csnumber=11468<\/a>[\/ref] is an extension of the Latin alphabet used primary in minor European language and obsolete typography. \u00a0However, there is a significant number of UNIMARC records in France that are stuck in this legacy format. \u00a0To support the transition to UTF8, I added this to MarcEdit. \u00a0So, if you run across a codepage that is no longer supported, let me know. \u00a0There may be options.<\/li>\n<\/ul>\n<p>Finally, the best place to get information about a files characterset encoding is to ask the individual or agency that provided you with the file. \u00a0Good luck!<!--nextpage--><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this Chapter Getting Started Working with Charactersets Available Charactersets Getting Help Getting Started Characterset support has long been a thorn in the side of many technical services librarians, because MARC data comes in so many different flavors and local character encodings. \u00a0For catalogers in North America &#8212; two primary character encodings are dominant: MARC8 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":9,"menu_order":1,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343"}],"collection":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/comments?post=343"}],"version-history":[{"count":13,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343\/revisions"}],"predecessor-version":[{"id":552,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/343\/revisions\/552"}],"up":[{"embeddable":true,"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/pages\/9"}],"wp:attachment":[{"href":"https:\/\/marcedit.reeset.net\/learning_marcedit\/wp-json\/wp\/v2\/media?parent=343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}