-
Notifications
You must be signed in to change notification settings - Fork 291
UTF-8 files with a single accented character are incorrectly detected as ISO-8859-1 or MacRoman #308
Description
I encountered a bug due to chardet detecting a UTF-8 XML file with a single non-ASCII character (é) as MacRoman. This was very surprising to me, because obviously the string \xc3\xa9 is very uncommon in MacRoman.
chardet.detect(b'___!" (Pok\xc3\xa9mon slogan)')
{'encoding': 'MacRoman', 'confidence': 0.5087878787878788, 'language': ''}
If the problem were limited to a very short example like this, it might be a non-issue because the sample is just not large enough to be detected accurately. But that's not the case - this example is shortened from a real XML file containing a crossword in the Crossword Compiler XML format, and the whole file detects as MacRoman with an even higher confidence:
chardetect xword.bin
xword.bin: MacRoman with confidence 0.7265238095238096
I did some debugging. I was able to reproduce the problem with a random source of ASCII text (the Project Gutenberg version of the Declaration of Independence).
import chardet
import requests
r = requests.get("https://gutenberg.org/cache/epub/16780/pg16780.txt")
c = r.content
# trim off BOM and footer text (which contains trademark character)
print(chardet.detect(c[3:5000] + "é".encode() + c[5001:10000])){'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
I quickly modified chardet to print the prober confidences, which are as follows:
utf-8 0.505
ISO-8859-9 0.6261635566264946
ISO-8859-1 0.73
MacRoman 0.7292705835331734
I see two issues here:
- MacRoman recovers from its low prior probability far too quickly. It should not be even close to competitive with latin-1, IMO, unless there are very clear signs that it is the correct encoding.
- The single byte encodings should not be rewarded for rare sequences like
\xc3\xa9. This decodes toéin MacRoman and̩in latin-1. I doubt there are more than a few legitimate files in existence that contain those characters in that sequence. Meanwhile, UTF-8 is an overwhelmingly popular format; a valid two-byte sequence encoding an accented character is very strong evidence that the file is UTF-8.
If you add another é, the problem starts to go away. The confidence appears to double with every valid accented character you add. But this is a weird jump, right? A file with two é characters is not way more likely to be UTF-8 than a file containing just one.