UTF-8 files with a single accented character are incorrectly detected as ISO-8859-1 or MacRoman

I encountered a bug due to chardet detecting a UTF-8 XML file with a single non-ASCII character (é) as MacRoman. This was very surprising to me, because obviously the string `\xc3\xa9` is very uncommon in MacRoman.

```
chardet.detect(b'___!" (Pok\xc3\xa9mon slogan)')
{'encoding': 'MacRoman', 'confidence': 0.5087878787878788, 'language': ''}
```

If the problem were limited to a very short example like this, it might be a non-issue because the sample is just not large enough to be detected accurately. But that's not the case - this example is shortened from a real XML file containing a crossword in the [Crossword Compiler XML](https://crossword-compiler.com/en/help/html/XML.htm) format, and the whole file detects as MacRoman with an even *higher* confidence:

```
chardetect xword.bin
xword.bin: MacRoman with confidence 0.7265238095238096
```

I did some debugging. I was able to reproduce the problem with a random source of ASCII text (the Project Gutenberg version of the Declaration of Independence).

```python
import chardet
import requests
r = requests.get("https://gutenberg.org/cache/epub/16780/pg16780.txt")
c = r.content
# trim off BOM and footer text (which contains trademark character)
print(chardet.detect(c[3:5000] + "é".encode() + c[5001:10000]))
```

    {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

I quickly modified chardet to print the prober confidences, which are as follows:

```
utf-8 0.505
ISO-8859-9 0.6261635566264946
ISO-8859-1 0.73
MacRoman 0.7292705835331734
```

I see two issues here:

 * MacRoman recovers from its [low prior probability](https://github.com/chardet/chardet/blob/8e8dfcd93c572c2cbe37585e01662a90b16fbab6/chardet/macromanprober.py#L123) far too quickly. It should not be even close to competitive with latin-1, IMO, unless there are very clear signs that it is the correct encoding.
 * The single byte encodings should not be rewarded for rare sequences like `\xc3\xa9`. This decodes to `√©` in MacRoman and `Ã©` in latin-1. I doubt there are more than a few legitimate files in existence that contain those characters in that sequence. Meanwhile, UTF-8 is an overwhelmingly popular format; a valid two-byte sequence encoding an accented character is very strong evidence that the file is UTF-8.

If you add another é, the problem starts to go away. The confidence appears to double with every valid accented character you add. But this is a weird jump, right? A file with two é characters is not way more likely to be UTF-8 than a file containing just one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 files with a single accented character are incorrectly detected as ISO-8859-1 or MacRoman #308

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UTF-8 files with a single accented character are incorrectly detected as ISO-8859-1 or MacRoman #308

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions