Return GB18030 for Simplified Chinese files. by atbest · Pull Request #33 · chardet/chardet

atbest · 2014-09-12T02:06:20Z

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to UnicodeDecodeError. Changing to GB18030 can fix the problem.

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.

sigmavirus24 · 2014-09-12T02:39:43Z

We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet.

atbest · 2014-09-12T02:55:52Z

Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port juniversalchardet also use this 'fix'.

sigmavirus24 · 2014-09-12T02:58:27Z

I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. juniversalchardet is a separate project whose decisions does not impact this project's.

dan-blanchard · 2014-12-02T16:31:01Z

I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it.

ericlingit · 2018-08-27T06:52:37Z

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

honglei · 2020-11-28T18:29:16Z

Any news?

cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
    cur_encoding = 'GBK'

dan-blanchard · 2022-06-29T03:34:33Z

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming.

Return GB18030 for Simplified Chinese files.

caf0676

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.

dan-blanchard closed this Dec 2, 2014

dan-blanchard reopened this Dec 2, 2014

dan-blanchard closed this Dec 2, 2014

dan-blanchard mentioned this pull request Nov 18, 2015

Is it OK to replace GB2312 by GB18030? #79

Closed

atbest mentioned this pull request Nov 16, 2016

GB18030 for Chinese #94

Closed

atbest deleted the patch-1 branch December 13, 2016 07:05

grzhan mentioned this pull request May 13, 2018

关于 Python chardet 库处理 GB2312、GBK、GB18030 grzhan/keng#1

Open

x1angli mentioned this pull request Dec 21, 2018

GB18030 encoded file incorrectly classified as GB2312 #168

Closed

RaiKoHoff mentioned this pull request Mar 7, 2019

Encoding detection detects GB18030 instead of GB2312 rizonesoft/Notepad3#998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return GB18030 for Simplified Chinese files.#33

Return GB18030 for Simplified Chinese files.#33
atbest wants to merge 1 commit intochardet:masterfrom
atbest:patch-1

atbest commented Sep 12, 2014

Uh oh!

sigmavirus24 commented Sep 12, 2014

Uh oh!

atbest commented Sep 12, 2014

Uh oh!

sigmavirus24 commented Sep 12, 2014

Uh oh!

dan-blanchard commented Dec 2, 2014

Uh oh!

ericlingit commented Aug 27, 2018

Uh oh!

honglei commented Nov 28, 2020 •

edited

Loading

Uh oh!

dan-blanchard commented Jun 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

atbest commented Sep 12, 2014

Uh oh!

sigmavirus24 commented Sep 12, 2014

Uh oh!

atbest commented Sep 12, 2014

Uh oh!

sigmavirus24 commented Sep 12, 2014

Uh oh!

dan-blanchard commented Dec 2, 2014

Uh oh!

ericlingit commented Aug 27, 2018

Uh oh!

honglei commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dan-blanchard commented Jun 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

honglei commented Nov 28, 2020 •

edited

Loading