Skip to content

No support for UHC for Korean #164

@cslycord

Description

@cslycord

Universal Hangul Code (UHC), which python also calls CP949/949/MS949, is an encoding for writing Korean characters. It's a superset of EUC-KR (which chardet supports) and covers over 8000 additional Hangul characters that can't be encoded in EUC-KR.

Because of this, some files detected as EUC-KR will end up having the occasional character display strangely, since it had some UHC characters. And I've seen other UHC subtitle files of mine detected as Turkish/ Windows-1254, which results in the encoding being a complete failure.

Also, since EUC-KR is a strict subset of UHC, anything that is being detected as EUC-KR can be treated/encoded/decoded as UHC.

Also, for what it's worth, cchardet/uchardet detects these UHC files with no issues (although, it detects files as UHC that chardet detects as EUC-KR but that works perfectly because of them being strict sub/super sets of each other).

https://en.wikipedia.org/wiki/Unified_Hangul_Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions