-
Notifications
You must be signed in to change notification settings - Fork 291
windows-1253 are not detected #197
Description
Problem:
The program does not detect "windows-1253" encoding. Any text encoded either using "ISO-8859-7" encoding or "windows-1253" encoding is marked as having the "ISO-8859-7" encoding, thus making any reference to "windows-1253" encoding useless.
The only real differences between "ISO-8859-7" and "windows-1253" lay in Character Mapping Table places:
#A2, #B5, #B6
In Character Mapping Table for "ISO-8859-7" the '\u0386'(GREEK CAPITAL LETTER ALPHA WITH TONOS) lays in place 0xB6 while the same letter in Character Mapping Table for "windows-1253" lays in place 0xA2.
In Character Mapping Table for "ISO-8859-7" in place 0xA2 a "90" is used, indicating that '\u2019' (RIGHT SINGLE QUOTATION MARK), which is used in that place in "ISO-8859-7" encoding, is not a punctuation.
How to repeat:
Save a 'utf8' text, written in Greek and containing at least once the '\u0386'(GREEK CAPITAL LETTER ALPHA WITH TONOS), to two different files, one using the "ISO-8859-7" encoding and a second using "windows-1253" encoding (three texts are included as attachments).
Possible solutions:
- Character Mapping Table for "ISO-8859-7" in place 0xA2 should be changed from 90 to 253.
- In case of finding a good «positive_ratio» for "ISO-8859-7" encoding, code should check also the "windows-1253" encoding.