-
Notifications
You must be signed in to change notification settings - Fork 291
No support for UHC for Korean #164
Description
Universal Hangul Code (UHC), which python also calls CP949/949/MS949, is an encoding for writing Korean characters. It's a superset of EUC-KR (which chardet supports) and covers over 8000 additional Hangul characters that can't be encoded in EUC-KR.
Because of this, some files detected as EUC-KR will end up having the occasional character display strangely, since it had some UHC characters. And I've seen other UHC subtitle files of mine detected as Turkish/ Windows-1254, which results in the encoding being a complete failure.
Also, since EUC-KR is a strict subset of UHC, anything that is being detected as EUC-KR can be treated/encoded/decoded as UHC.
Also, for what it's worth, cchardet/uchardet detects these UHC files with no issues (although, it detects files as UHC that chardet detects as EUC-KR but that works perfectly because of them being strict sub/super sets of each other).