-
Notifications
You must be signed in to change notification settings - Fork 291
Incorrect Detection of HZ-GB-2312 with ASCII Text #82
Copy link
Copy link
Closed
Description
Recently had an issue with chardet usage in the in the requests module where it was incorrectly detecting the encoding on a JSON Blob. I discovered the response.apparent_encoding and that chardet is used to set it.
I was able to identify what in my data was causing the wrong detection and distill it down to the occurrence of these simple strings.
$ cat test_chardet
~{,
~},
$ file test_chardet
test_chardetect: ASCII text
$ chardetect test_chardet
test_chardetect: HZ-GB-2312 with confidence 0.99
Originally it was a JSON Blog of all ASCII characters encoded as UTF-8. As a work around I have set up the service to specify the following in the header -
{'Content-Type': 'application/json; charset=utf-8'}
Which will have requests set the encoding properly.
I can add ASCII characters to the file, white space, quotes and numbers. It will still detect as HZ-GB-2312.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels