Skip to content

Using unicode purple heart should not cause detection to lose confidence about utf-8 #128

@blueyed

Description

@blueyed

I've noticed that the following file gets detected in a quite different way in chardet 3.0.3 compared to 2.3:

scriptencoding utf-8
" :purple_heart: 💜
" set list listchars=tab:»·,trail:·,eol:¬,nbsp:_,extends:❯,precedes:❮

chardet 2.3.0:
{'encoding': 'ISO-8859-2', 'confidence': 0.6680924803464797}

chardet 3.0.3:
{'encoding': 'Windows-1254', 'confidence': 0.5658124254347925, 'language': 'Turkish'}

This seems to be related to ISO-8859-2 being disabled currently, but I think it should be utf-8 after all?!
Remove the unicode glyph (purple heart, <💜> 128156, Hex 0001f49c, Octal 372234) makes it {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}.

(I've noticed this for the chardet.detect used in vint)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions