{"id":1210,"date":"2018-10-30T16:24:15","date_gmt":"2018-10-30T23:24:15","guid":{"rendered":"http:\/\/cknotes.com\/?p=1210"},"modified":"2020-05-04T05:21:00","modified_gmt":"2020-05-04T12:21:00","slug":"never-handle-non-text-binary-data-as-a-string","status":"publish","type":"post","link":"https:\/\/cknotes.com\/never-handle-non-text-binary-data-as-a-string\/","title":{"rendered":"Never Handle non-Text Binary Data as a String"},"content":{"rendered":"<p>The bytes of a binary file, such as a JPG, PDF, etc. should never be treated as a string.\u00a0 Loading a binary file into a string, and then saving back to a binary file will surely result in a different file that is corrupted. \u00a0 This rule should be followed for all programming languages.\u00a0 <strong>Don&#8217;t treat binary non-text bytes as text characters.<\/strong><\/p>\n<p>Consider this Visual FoxPro code:<\/p>\n<pre>LOCAL x\r\nx = FILETOSTR( \"in.pdf\" )\r\nSTRTOFILE( x, \"out.pdf\" )\r\n<\/pre>\n<p>It is likely that out.pdf is corrupt. Here&#8217;s why:<\/p>\n<p>When reading a text file, the bytes must be interpreted according to some character encoding.<\/p>\n<p>For example, consider this character: \u00c9<br \/>\nIn the windows-1252 character encoding, it is represented by a single byte: 0xC9<br \/>\nIn the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89<br \/>\nIn the utf-16 character encoding, it is represented by a two bytes: 0x00 0xC9<\/p>\n<p>In this case, FoxPro is probably assuming ANSI (i.e. Windows-1252 for USA\/Western European computers, 1 byte per char). Internally, FoxPro most likely holds strings in the utf-16 byte representation. Therefore, each incoming ANSI byte is converted to 2-byte per char utf-16.<\/p>\n<p>Now have a look at the Windows-1252 charset:<\/p>\n<p><a href=\"https:\/\/cknotes.com\/wp-content\/uploads\/2018\/10\/windows1252.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1211\" src=\"https:\/\/cknotes.com\/wp-content\/uploads\/2018\/10\/windows1252.gif\" alt=\"windows-1252\" width=\"640\" height=\"845\" \/><\/a><\/p>\n<p>Notice the &#8220;NOT USED&#8221; bytes, such as 8D, 9D, 8E, 9E, etc.<\/p>\n<p>These byte values will never appear in valid Windows-1252 text.\u00a0 However, they will likely appear in a binary non-text file.\u00a0 If the binary file is large enough, you can be sure these bytes will be present.\u00a0 They&#8217;ll likely get converted to a &#8220;?&#8221; char.\u00a0 That&#8217;s why you see &#8220;?&#8221; or some other standard char when non-text is loaded.<\/p>\n<p>When you write the text back to the file, all of the &#8220;NOT USED&#8221; bytes are written as &#8220;?&#8221; chars.\u00a0 This is the corruption.\u00a0 By trying to handle binary data as text, incoming bytes are implicitly converted to the byte representation used to hold strings (likely utf-8 or utf-16).\u00a0 Writing the file (in the case of STRTOFILE) involves an implicit conversion to the 1-byte per char ANSI representation.<\/p>\n<p>The round-trip of ANSI &#8211;&gt; Internal Representation &#8211;&gt; ANSI corrupts the data.<\/p>\n<p>Also, it doesn&#8217;t matter what charsets are involved.\u00a0 It could be utf-8, utf-16, etc.\u00a0 Reading a text file implicitly involves interpreting bytes according to a charset, and if those bytes don&#8217;t actually represent text in the given charset, impossible byte values or byte sequences will be present that cause some sort of error char to be substituted (or the error sequences are simply dropped), and the round-trip always results in corruption.<\/p>\n<p>The rule that should never be broken is: never treat binary data as text.\u00a0 Don&#8217;t use string data types to hold non-text binary data.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The bytes of a binary file, such as a JPG, PDF, etc. should never be treated as a string.\u00a0 Loading a binary file into a string, and then saving back to a binary file will surely result in a different file that is corrupted. \u00a0 This rule should be followed for all programming languages.\u00a0 Don&#8217;t [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[752],"tags":[666,664,665,667],"class_list":["post-1210","post","type-post","status-publish","format-standard","hentry","category-character-encoding","tag-binary","tag-filetostr","tag-strtofile","tag-text"],"_links":{"self":[{"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/posts\/1210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/comments?post=1210"}],"version-history":[{"count":1,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/posts\/1210\/revisions"}],"predecessor-version":[{"id":1212,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/posts\/1210\/revisions\/1212"}],"wp:attachment":[{"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/media?parent=1210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/categories?post=1210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cknotes.com\/wp-json\/wp\/v2\/tags?post=1210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}