I noticed a very big difference in execution time between using chardet.detect with a complete bytes string and using UniversalDetector.feed with chunks of 1 MB.
-
With a 100 MB file, composed only of "tests tests tests tests [....]":
chardet.detect takes ~64 seconds.
UniversalDetector.feed takes ~3 seconds.
-
With the previous file on which I appended a file in MacRoman of ~10 KB (containing the character ’ in MacRoman):
chardet.detect: I interrupted the execution after 20 minutes...
UniversalDetector.feed takes ~3 seconds.
In case you wonder what code I used, I compared the execution time of the following:
print(detect(original_txt))
num_chunks_processed = 0
for start, end in _get_chunk_slice_intervals(len(original_txt), CHUNK_SIZE):
chunk = original_txt[start:end]
detector.feed(chunk)
num_chunks_processed += 1
detector.close()