Skip to content

Return GB18030 for Simplified Chinese files.#33

Closed
atbest wants to merge 1 commit intochardet:masterfrom
atbest:patch-1
Closed

Return GB18030 for Simplified Chinese files.#33
atbest wants to merge 1 commit intochardet:masterfrom
atbest:patch-1

Conversation

@atbest
Copy link
Copy Markdown
Contributor

@atbest atbest commented Sep 12, 2014

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to UnicodeDecodeError. Changing to GB18030 can fix the problem.

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.
@sigmavirus24
Copy link
Copy Markdown
Member

We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet.

@atbest
Copy link
Copy Markdown
Contributor Author

atbest commented Sep 12, 2014

Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port juniversalchardet also use this 'fix'.

@sigmavirus24
Copy link
Copy Markdown
Member

I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. juniversalchardet is a separate project whose decisions does not impact this project's.

@dan-blanchard
Copy link
Copy Markdown
Member

I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it.

@ericlingit
Copy link
Copy Markdown

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

@honglei
Copy link
Copy Markdown

honglei commented Nov 28, 2020

Any news?

cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
    cur_encoding = 'GBK'

@dan-blanchard
Copy link
Copy Markdown
Member

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants