Skip to content

Do not use chardet to detect encoding because of poor accuracy #18

@zgoda

Description

@zgoda

Please do not use chardet to detect document encoding. For UTF-8 texts it works more or less reliably only for Latin-1 and Latin-1 Supplement unicode blocks, for Latin Extended-A and Extended-B it fails in about 50% cases wrongly detecting Windows encodings, eg. for UTF-8 document with Latin Extended-A content:

$ chardet docs/index.rst 
docs/index.rst: Windows-1252 with confidence 0.594336283186
$ file docs/index.rst 
docs/index.rst: UTF-8 Unicode text

In fact chardet reports low confidence in this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue/PR relates to a bug.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions