-
Notifications
You must be signed in to change notification settings - Fork 292
chardet detect UTF-8 XML File as EUC_KR - Possibility to exclude encodings? #287
Copy link
Copy link
Closed as duplicate of#301
Closed as duplicate of#301
Copy link
Description
Hello,
i've got an utf-8 xml File that will be detected as EUC_KR. Is there a possibility to exclude encodings from detection?
I would like to exclude EUC_KR and EUC_JP from encodings getting detected, but i don't find any method to exclude encodings.
This is my code:
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import sys,os,csv
import chardet
from chardet.universaldetector import UniversalDetector
from datetime import datetime
from datetime import date
print(chardet.__version__)
3.0.4
def detect_encode_generic(file):
detector = UniversalDetector()
detector.reset()
with open(file, 'rb') as f:
for row in f:
detector.feed(row)
if detector.done: break
detector.close()
return detector.result
infile=os.path.realpath("./2023-01-08_C12_DE35435545485488415265_EUR_000123.xml")
result_gen = detect_encode_generic(infile)
print(f" {infile} is encoded in '{result_gen['encoding']}' with confidence level of {result_gen['confidence']}")
/foo/bar/2023-01-08_C12_DE35435545485488415265_EUR_000123.xml is encoded in 'EUC-KR' with confidence level of 0.99
XML UTF-8 File that gets detected as EUC_KR is in the attachment:
2023-01-08_C12_DE35435545485488415265_EUR_000123.zip
Best Regards,
Thomas
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels