Releases: chardet/chardet
chardet 7.4.0
chardet 7.4.0 brings accuracy up to 99.3% (from 98.6% in 7.3.0) and significantly faster cold start thanks to a new dense model format.
What's New
Performance:
- New dense zlib-compressed model format (v2) drops cold start (import + first detect) from ~75ms to ~13ms with mypyc
Accuracy (98.6% → 99.3%):
- Eliminated train/test data overlap via content fingerprinting
- Added MADLAD-400 and Wikipedia as supplemental training sources
- Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training and weighted by per-bigram IDF
- Encoding-aware substitution filtering (substitutions only apply for characters the target encoding can't represent)
- Increased training samples from 15K to 25K per language/encoding pair
Bug fixes:
- Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS (these were previously sharing their base encoding's byte-range analyzer, missing extended ranges)
Metrics
| chardet 7.4.0 (mypyc) | chardet 6.0.0 | charset-normalizer 3.4.6 | |
|---|---|---|---|
| Accuracy (2,517 files) | 99.3% | 88.2% | 85.4% |
| Speed | 551 files/s | 12 files/s | 376 files/s |
| Language detection | 95.7% | 40.0% | 59.2% |
Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html
7.3.0
License
- 0BSD license — the project license has been changed from MIT to 0BSD, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release.
Features
- Added
mime_typefield to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in alldetect(),detect_all(), andUniversalDetectorresults. (#350) - New
pipeline/magic.pymodule detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (#350)
Bug Fixes
- Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable
Performance
- Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules
- Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss.
- Replaced
dataclasses.replace()with directDetectionResultconstruction on hot paths, eliminating ~354k function calls per full test suite run
Build
chardet 7.2.0
Features
- Added
include_encodingsandexclude_encodingsparameters todetect(),detect_all(), andUniversalDetector— restrict or exclude specific encodings from the candidate set, with corresponding-i/--include-encodingsand-x/--exclude-encodingsCLI flags (#343) - Added
no_match_encoding(default"cp1252") andempty_input_encoding(default"utf-8") parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (#343) - Added
-l/--languageflag tochardetectCLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (#342)
Fixes
Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html
chardet 7.1.0
Features
- Added PEP 263 encoding declaration detection —
# -*- coding: ... -*-and# coding=...declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249) - Added
chardet.universaldetectorbackward-compatibility stub so thatfrom chardet.universaldetector import UniversalDetectorworks with a deprecation warning (#341)
Fixes
- Fixed false UTF-7 detection of ASCII text containing
++or+wordpatterns (#332) - Fixed 0.5s startup cost on first
detect()call — model norms are now computed during loading instead of lazily iterating 21M entries (#333) - Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
detect()now returns chardet 5.x-compatible names by default (#338) - Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
- Fixed silent truncation of corrupt model data (
iter_unpackyielded fewer tuples instead of raising) - Fixed incorrect date in LICENSE
Performance
- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of
load_models() - ~40% faster model parsing via
struct.iter_unpackfor bulk entry extraction (eliminates ~305K individualunpackcalls)
New API parameters
- Added
compat_namesparameter (defaultTrue) todetect(),detect_all(), andUniversalDetector— set toFalseto get raw Python codec names instead of chardet 5.x/6.x compatible display names - Added
prefer_supersetparameter (defaultFalse) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default toTruein the next major version (8.0). - Deprecated
should_rename_legacyin favor ofprefer_superset— a deprecation warning is emitted when used
Improvements
- Switched internal canonical encoding names to Python codec names (e.g.,
"utf-8"instead of"UTF-8"), withcompat_namescontrolling the public output format - Added
lookup_encoding()toregistryfor case-insensitive resolution of arbitrary encoding name input to canonical names - Achieved 100% line coverage across all source modules (+31 tests)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
- Pinned test-data cloning to chardet release version tags for reproducible builds
Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html
7.0.1
Fixes
- Fixed false UTF-7 detection of SHA-1 git hashes (#324, fixing #323) — requirements files with VCS pins (e.g.,
+4bafdea3...) were misdetected as UTF-7, breaking tools like tox - Fixed
_SINGLE_LANG_MAPmissing aliases for single-language encoding lookup (e.g.,big5→big5hkscs) - Fixed PyPy
TypeErrorin UTF-7 codec handling
Improvements
- Retrained bigram models — 24 previously failing test cases now pass
- Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)
New Contributors
7.0.0
Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. Just way faster and more accurate!
Highlights:
- MIT license (previous versions were LGPL)
- 96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
- 41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
- Language detection for every result (90.5% accuracy across 49 languages)
- 99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
- 12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
- Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
- Optional mypyc compilation — 1.49x additional speedup on CPython
- Thread-safe
detect()anddetect_all()with no measurable overhead; scales on free-threaded Python 3.13t+ - Negligible import memory (96 B)
- Zero runtime dependencies
Breaking changes vs 6.0.0:
detect()anddetect_all()now default toencoding_era=EncodingEra.ALL(6.0.0 defaulted toMODERN_WEB)- Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilteris accepted but ignored (deprecation warning emitted)chunk_sizeis accepted but ignored (deprecation warning emitted)
6.0.0.post1
- Fixed version number in chardet/version.py still being set to
6.0.0dev0. Otherwise identical to 6.0.0.
6.0.0
Features
- Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case
Latin1ProberandMacRomanProberheuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings. - 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
EncodingErafiltering: Newencoding_eraparameter todetectallows filtering by anEncodingEraflag enum (MODERN_WEB,LEGACY_ISO,LEGACY_MAC,LEGACY_REGIONAL,DOS,MAINFRAME,ALL) allows callers to restrict detection to encodings from a specific era.detect()anddetect_all()default toMODERN_WEB. The newMODERN_WEBdefault should drastically improve accuracy for users who are not working with legacy data. The tiers are:MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
--encoding-eraCLI flag: ThechardetectCLI now accepts-e/--encoding-erato control which encoding eras are considered during detection.max_bytesandchunk_sizeparameters:detect(),detect_all(), andUniversalDetectornow acceptmax_bytes(default 200KB) andchunk_size(default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)- Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
- Charset metadata registry: New
chardet.metadata.charsetsmodule provides structured metadata about all supported encodings, including their era classification and language filter. should_rename_legacynow defaults intelligently: When set toNone(the new default), legacy renaming is automatically enabled whenencoding_eraisMODERN_WEB.- Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
- EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
- Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
- Python 3.12, 3.13, and 3.14 support (#283, @hugovk; #311)
- GitHub Codespace support (#312, @oxygen-dioxide)
Fixes
- Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @nenw)
- Fix SJIS distribution analysis: Fixed
SJISDistributionAnalysisdiscarding valid second-byte range >= 0x80. (#315, @bysiber) - Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a
MIN_RATIOthreshold alongside the existingEXPECTED_RATIO. - Fix
get_charsetcrash: Resolved a crash when looking up unknown charset names. - Fix GB18030
char_len_table: Corrected the character length table for GB18030 multi-byte sequences. - Fix UTF-8 state machine: Updated to be more spec-compliant.
- Fix
detect_all()returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded. - Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
- Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.
Breaking changes
- Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @hugovk)
- Removed
Latin1ProberandMacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected bySingleByteCharSetProberwith trained language models, giving better accuracy and language identification. - Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
LanguageFilter.NONEremoved: Use specific language filters orLanguageFilter.ALLinstead.- Enum types changed:
InputState,ProbingState,MachineState,SequenceLikelihood, andCharacterCategoryare nowIntEnum(previously plain classes orEnum).LanguageFiltervalues changed from hardcoded hex toauto(). detect()default behavior change:detect()now defaults toencoding_era=EncodingEra.MODERN_WEBandshould_rename_legacy=None(auto-enabled forMODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.
Misc changes
- Switched from Poetry/setuptools to uv + hatchling: Build system modernized with
hatch-vcsfor version management. - License text updated: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (#304, #307, @musicinmybrain)
- CulturaX-based model training: The
create_language_model.pytraining script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models. Languageclass converted to frozen dataclass: The language metadata class now uses@dataclass(frozen=True)withnum_training_docsandnum_training_charsfields replacingwiki_start_pages.- Test infrastructure: Added
pytest-timeoutandpytest-xdistfor faster parallel test execution. Reorganized test data directories.
Contributors
Thank you to everyone who contributed to this release!
- @dan-blanchard (Dan Blanchard)
- @bysiber (Kadir Can Ozden)
- @musicinmybrain (Ben Beasley)
- @hugovk (Hugo van Kemenade)
- @oxygen-dioxide
- @nenw
And a special thanks to @helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.
chardet 5.2.0
Adds support for running chardet CLI via python -m chardet (0e9b7bc, @dan-blanchard)
chardet 5.1.0
Features
- Add
should_rename_legacyargument to most functions, which will rename older encodings to their more modern equivalents (e.g.,GB2312becomesGB18030) (#264, @dan-blanchard) - Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)
- Add a prober for MacRoman encoding (#5 updated as c292b52, Rob Speer and @dan-blanchard )
- Add
--minimalflag tochardetectcommand (#214, @dan-blanchard) - Add type annotations to the project and run mypy on CI (#261, @jdufresne)
- Add support for Python 3.11 (#274, @hugovk)
Fixes
- Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
- Remove support for EOL Python 3.6 (#260, @jdufresne)
- Remove unnecessary guards for non-falsey values (#259, @jdufresne)
Misc changes
- Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
- Remove setup.py in favor of build package (#262, @jdufresne)
- Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)