Skip to content

UTF-8 files with a single accented character are incorrectly detected as ISO-8859-1 or MacRoman #308

@afontenot

Description

@afontenot

I encountered a bug due to chardet detecting a UTF-8 XML file with a single non-ASCII character (é) as MacRoman. This was very surprising to me, because obviously the string \xc3\xa9 is very uncommon in MacRoman.

chardet.detect(b'___!" (Pok\xc3\xa9mon slogan)')
{'encoding': 'MacRoman', 'confidence': 0.5087878787878788, 'language': ''}

If the problem were limited to a very short example like this, it might be a non-issue because the sample is just not large enough to be detected accurately. But that's not the case - this example is shortened from a real XML file containing a crossword in the Crossword Compiler XML format, and the whole file detects as MacRoman with an even higher confidence:

chardetect xword.bin
xword.bin: MacRoman with confidence 0.7265238095238096

I did some debugging. I was able to reproduce the problem with a random source of ASCII text (the Project Gutenberg version of the Declaration of Independence).

import chardet
import requests
r = requests.get("https://gutenberg.org/cache/epub/16780/pg16780.txt")
c = r.content
# trim off BOM and footer text (which contains trademark character)
print(chardet.detect(c[3:5000] + "é".encode() + c[5001:10000]))
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

I quickly modified chardet to print the prober confidences, which are as follows:

utf-8 0.505
ISO-8859-9 0.6261635566264946
ISO-8859-1 0.73
MacRoman 0.7292705835331734

I see two issues here:

  • MacRoman recovers from its low prior probability far too quickly. It should not be even close to competitive with latin-1, IMO, unless there are very clear signs that it is the correct encoding.
  • The single byte encodings should not be rewarded for rare sequences like \xc3\xa9. This decodes to √© in MacRoman and é in latin-1. I doubt there are more than a few legitimate files in existence that contain those characters in that sequence. Meanwhile, UTF-8 is an overwhelmingly popular format; a valid two-byte sequence encoding an accented character is very strong evidence that the file is UTF-8.

If you add another é, the problem starts to go away. The confidence appears to double with every valid accented character you add. But this is a weird jump, right? A file with two é characters is not way more likely to be UTF-8 than a file containing just one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions