Return GB18030 for Simplified Chinese files.#33
Return GB18030 for Simplified Chinese files.#33atbest wants to merge 1 commit intochardet:masterfrom atbest:patch-1
Conversation
GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.
|
We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet. |
|
Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port |
|
I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. |
|
I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it. |
|
W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder. |
|
Any news? cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
cur_encoding = 'GBK' |
It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming. |
GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to
UnicodeDecodeError. Changing to GB18030 can fix the problem.