Skip to content

windows-1253 are not detected #197

@Xoristzatziki

Description

@Xoristzatziki

Problem:
The program does not detect "windows-1253" encoding. Any text encoded either using "ISO-8859-7" encoding or "windows-1253" encoding is marked as having the "ISO-8859-7" encoding, thus making any reference to "windows-1253" encoding useless.

The only real differences between "ISO-8859-7" and "windows-1253" lay in Character Mapping Table places:
#A2, #B5, #B6
In Character Mapping Table for "ISO-8859-7" the '\u0386'(GREEK CAPITAL LETTER ALPHA WITH TONOS) lays in place 0xB6 while the same letter in Character Mapping Table for "windows-1253" lays in place 0xA2.
In Character Mapping Table for "ISO-8859-7" in place 0xA2 a "90" is used, indicating that '\u2019' (RIGHT SINGLE QUOTATION MARK), which is used in that place in "ISO-8859-7" encoding, is not a punctuation.

How to repeat:
Save a 'utf8' text, written in Greek and containing at least once the '\u0386'(GREEK CAPITAL LETTER ALPHA WITH TONOS), to two different files, one using the "ISO-8859-7" encoding and a second using "windows-1253" encoding (three texts are included as attachments).

Possible solutions:

  1. Character Mapping Table for "ISO-8859-7" in place 0xA2 should be changed from 90 to 253.
  2. In case of finding a good «positive_ratio» for "ISO-8859-7" encoding, code should check also the "windows-1253" encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions