Universal character encoding detector.
chardet 7 is a ground-up, 0BSD-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x, just much faster and more accurate. Python 3.10+, zero runtime dependencies, works on PyPy.
99.3% accuracy on 2,517 test files. 47x faster than chardet 6.0.0 and 1.5x faster than charset-normalizer 3.4.6. Language detection for every result. MIME type detection for binary files. 0BSD licensed.
| chardet 7.4.0 (mypyc) | chardet 6.0.0 | charset-normalizer 3.4.6 | |
|---|---|---|---|
| Accuracy (2,517 files) | 99.3% | 88.2% | 85.4% |
| Speed | 551 files/s | 12 files/s | 376 files/s |
| Language detection | 95.7% | 40.0% | 59.2% |
| Peak memory | 52.9 MiB | 29.5 MiB | 78.8 MiB |
| Streaming detection | yes | yes | no |
| Encoding era filtering | yes | no | no |
| Encoding filters | yes | no | yes |
| MIME type detection | yes | no | no |
| Supported encodings | 99 | 84 | 99 |
| License | 0BSD | LGPL | MIT |
pip install chardetimport chardet
chardet.detect(b"Hello, world!")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': 'en', 'mime_type': 'text/plain'}
# UTF-8 with typographic punctuation
chardet.detect("It\u2019s a lovely day \u2014 let\u2019s grab coffee.".encode("utf-8"))
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': 'es', 'mime_type': 'text/plain'}
# Japanese EUC-JP
chardet.detect("これは日本語のテストです。文字コードの検出を行います。".encode("euc-jp"))
# {'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'ja', 'mime_type': 'text/plain'}
# Get all candidate encodings ranked by confidence
text = "Le café est une boisson très populaire en France et dans le monde entier."
results = chardet.detect_all(text.encode("windows-1252"))
for r in results[:4]:
print(r["encoding"], round(r["confidence"], 2))
# Windows-1252 0.44
# iso8859-15 0.44
# ISO-8859-1 0.44
# MacRoman 0.42For large files or network streams, use UniversalDetector to feed data incrementally:
from chardet import UniversalDetector
detector = UniversalDetector()
with open("unknown.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break
result = detector.close()
print(result)Restrict detection to specific encoding eras to reduce false positives:
from chardet import detect_all
from chardet.enums import EncodingEra
data = "Москва является столицей Российской Федерации и крупнейшим городом страны.".encode("windows-1251")
# All encoding eras are considered by default — 4 candidates across eras
for r in detect_all(data):
print(r["encoding"], round(r["confidence"], 2))
# Windows-1251 0.5
# MacCyrillic 0.47
# KZ1048 0.22
# ptcp154 0.22
# Restrict to modern web encodings — 1 confident result
for r in detect_all(data, encoding_era=EncodingEra.MODERN_WEB):
print(r["encoding"], round(r["confidence"], 2))
# Windows-1251 0.5Restrict detection to specific encodings, or exclude encodings you don't want:
# Only consider UTF-8 and Windows-1252
chardet.detect(data, include_encodings=["utf-8", "windows-1252"])
# Consider everything except EBCDIC
chardet.detect(data, exclude_encodings=["cp037", "cp500"])chardetect somefile.txt
# somefile.txt: utf-8 with confidence 0.99
chardetect --minimal somefile.txt
# utf-8
# Include detected language
chardetect -l somefile.txt
# somefile.txt: utf-8 en (English) with confidence 0.99
# Only consider specific encodings
chardetect -i utf-8,windows-1252 somefile.txt
# somefile.txt: utf-8 with confidence 0.99
# Pipe from stdin
cat somefile.txt | chardetect
# stdin: utf-8 with confidence 0.99- 0BSD license (previous versions were LGPL)
- Ground-up rewrite: 13-stage detection pipeline using BOM detection, magic number identification, structural probing, byte validity filtering, and bigram statistical models
- 47x faster than chardet 6.0.0 with mypyc, 1.5x faster than charset-normalizer 3.4.6
- 99.3% accuracy: +11.1pp vs chardet 6.0.0, +13.9pp vs charset-normalizer 3.4.6
- Language detection: 95.7% accuracy across 49 languages, returned with every result
- MIME type detection: identifies 40+ binary file formats (images, audio/video, archives, documents, executables, fonts) via magic number signatures, plus
text/html,text/xml, andtext/x-pythonfor markup - Encoding filters:
include_encodingsandexclude_encodingsparameters to restrict or exclude specific encodings from the candidate set - 99 encodings: full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families
- Optional mypyc compilation: 1.67x additional speedup on CPython
- Thread-safe:
detect()anddetect_all()are safe to call concurrently; scales on free-threaded Python - Same API:
detect(),detect_all(),UniversalDetector, and thechardetectCLI all work as before
Full documentation is available at chardet.readthedocs.io.
chardet was originally created by Mark Pilgrim in 2006 as a Python port of Mozilla's universal charset detection library. He released versions 1.0 (2006) and 1.0.1 (2008) on PyPI, then developed an unreleased Python 3 port (2.0.1) on Google Code. After Mark deleted his online accounts in 2011, the project was continued by David Cramer, Erik Rose, Toshio Kuratomi, Ian Cordasco, and Dan Blanchard.
In 2026, Dan Blanchard rewrote chardet using Claude, releasing chardet 7.0
under a new license. All releases after 7 are not derivative of the original
chardet code, but are released under the same name to allow an easier
transition for users who can immediately benefit from the speed and accuracy
improvements. For historical preservation and to allow easier comparison with
the other releases, Dan has restored Mark's lost commits to this repository
in the history/pilgrim branch.
To see the full history from 2006 to present in git log, fetch the graft
refs:
git fetch origin 'refs/replace/*:refs/replace/*'