No support for UHC for Korean

Universal Hangul Code (UHC), which python also calls CP949/949/MS949, is an encoding for writing Korean characters. It's a superset of EUC-KR (which chardet supports) and covers over 8000 additional Hangul characters that can't be encoded in EUC-KR.

Because of this, some files detected as EUC-KR will end up having the occasional character display strangely, since it had some UHC characters. And I've seen other UHC subtitle files of mine detected as Turkish/ Windows-1254, which results in the encoding being a complete failure.

Also, since EUC-KR is a strict subset of UHC, anything that is being detected as EUC-KR can be treated/encoded/decoded as UHC.

Also, for what it's worth, cchardet/uchardet detects these UHC files with no issues (although, it detects files as UHC that chardet detects as EUC-KR but that works perfectly because of them being strict sub/super sets of each other).

https://en.wikipedia.org/wiki/Unified_Hangul_Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for UHC for Korean #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No support for UHC for Korean #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions