-
Notifications
You must be signed in to change notification settings - Fork 291
GB18030 BOM confuses detection #178
Copy link
Copy link
Closed
Description
While it isnt common for text to start with a GB18030 BOM (\uFEFF), it results in non-detection and mis-detection.
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
text = '我没有埋怨,磋砣的只是一些时间。'
import chardet
print(chardet.detect(('\uFEFF' + text).encode('GB18030')))result is {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''} , when without the BOM the result is {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
See also http://www.0x08.org/posts/UTF8-BOM
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels