Releases · chardet/chardet

chardet 7.4.0 brings accuracy up to 99.3% (from 98.6% in 7.3.0) and significantly faster cold start thanks to a new dense model format.

What's New

Performance:

New dense zlib-compressed model format (v2) drops cold start (import + first detect) from ~75ms to ~13ms with mypyc

Accuracy (98.6% → 99.3%):

Eliminated train/test data overlap via content fingerprinting
Added MADLAD-400 and Wikipedia as supplemental training sources
Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training and weighted by per-bigram IDF
Encoding-aware substitution filtering (substitutions only apply for characters the target encoding can't represent)
Increased training samples from 15K to 25K per language/encoding pair

Bug fixes:

Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS (these were previously sharing their base encoding's byte-range analyzer, missing extended ranges)

Metrics

	chardet 7.4.0 (mypyc)	chardet 6.0.0	charset-normalizer 3.4.6
Accuracy (2,517 files)	99.3%	88.2%	85.4%
Speed	551 files/s	12 files/s	376 files/s
Language detection	95.7%	40.0%	59.2%

Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html

License

0BSD license — the project license has been changed from MIT to 0BSD, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release.

Features

Added mime_type field to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in all detect(), detect_all(), and UniversalDetector results. (#350)
New pipeline/magic.py module detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (#350)

Bug Fixes

Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable

Performance

Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules
Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss.
Replaced dataclasses.replace() with direct DetectionResult construction on hot paths, eliminating ~354k function calls per full test suite run

Build

Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (#348, thanks @gounthar)

Features

Added include_encodings and exclude_encodings parameters to detect(), detect_all(), and UniversalDetector — restrict or exclude specific encodings from the candidate set, with corresponding -i/--include-encodings and -x/--exclude-encodings CLI flags (#343)
Added no_match_encoding (default "cp1252") and empty_input_encoding (default "utf-8") parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (#343)
Added -l/--language flag to chardetect CLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (#342)

Fixes

Fixed null-separated ASCII data being misdetected as UTF-16-BE (#346, #347)

Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html

Features

Added PEP 263 encoding declaration detection — # -*- coding: ... -*- and # coding=... declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249)
Added chardet.universaldetector backward-compatibility stub so that from chardet.universaldetector import UniversalDetector works with a deprecation warning (#341)

Fixes

Fixed false UTF-7 detection of ASCII text containing ++ or +word patterns (#332)
Fixed 0.5s startup cost on first detect() call — model norms are now computed during loading instead of lazily iterating 21M entries (#333)
Fixed undocumented encoding name changes between chardet 5.x and 7.0 — detect() now returns chardet 5.x-compatible names by default (#338)
Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
Fixed silent truncation of corrupt model data (iter_unpack yielded fewer tuples instead of raising)
Fixed incorrect date in LICENSE

Performance

5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of load_models()
~40% faster model parsing via struct.iter_unpack for bulk entry extraction (eliminates ~305K individual unpack calls)

New API parameters

Added compat_names parameter (default True) to detect(), detect_all(), and UniversalDetector — set to False to get raw Python codec names instead of chardet 5.x/6.x compatible display names
Added prefer_superset parameter (default False) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to True in the next major version (8.0).
Deprecated should_rename_legacy in favor of prefer_superset — a deprecation warning is emitted when used

Improvements

Switched internal canonical encoding names to Python codec names (e.g., "utf-8" instead of "UTF-8"), with compat_names controlling the public output format
Added lookup_encoding() to registry for case-insensitive resolution of arbitrary encoding name input to canonical names
Achieved 100% line coverage across all source modules (+31 tests)
Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
Pinned test-data cloning to chardet release version tags for reproducible builds

Full changelog: https://chardet.readthedocs.io/en/latest/changelog.html

@rembish

Fixes

Fixed false UTF-7 detection of SHA-1 git hashes (#324, fixing #323) — requirements files with VCS pins (e.g., +4bafdea3...) were misdetected as UTF-7, breaking tools like tox
Fixed _SINGLE_LANG_MAP missing aliases for single-language encoding lookup (e.g., big5 → big5hkscs)
Fixed PyPy TypeError in UTF-7 codec handling

Improvements

Retrained bigram models — 24 previously failing test cases now pass
Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)

New Contributors

@rembish made their first contribution — both reporting the UTF-7 false detection issue and submitting the fix! (#323, #324)

Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. Just way faster and more accurate!

Highlights:

MIT license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
Negligible import memory (96 B)
Zero runtime dependencies

Breaking changes vs 6.0.0:

detect() and detect_all() now default to encoding_era=EncodingEra.ALL (6.0.0 defaulted to MODERN_WEB)
Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilter is accepted but ignored (deprecation warning emitted)
chunk_size is accepted but ignored (deprecation warning emitted)

Fixed version number in chardet/version.py still being set to 6.0.0dev0. Otherwise identical to 6.0.0.

@bysiber

Features

Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case Latin1Prober and MacRomanProber heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.
38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
EncodingEra filtering: New encoding_era parameter to detect allows filtering by an EncodingEra flag enum (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL) allows callers to restrict detection to encodings from a specific era. detect() and detect_all() default to MODERN_WEB. The new MODERN_WEB default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
- MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
- LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
- LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
- LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
- DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)
- MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
--encoding-era CLI flag: The chardetect CLI now accepts -e/--encoding-era to control which encoding eras are considered during detection.
max_bytes and chunk_size parameters: detect(), detect_all(), and UniversalDetector now accept max_bytes (default 200KB) and chunk_size (default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)
Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
Charset metadata registry: New chardet.metadata.charsets module provides structured metadata about all supported encodings, including their era classification and language filter.
should_rename_legacy now defaults intelligently: When set to None (the new default), legacy renaming is automatically enabled when encoding_era is MODERN_WEB.
Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
Python 3.12, 3.13, and 3.14 support (#283, @hugovk; #311)
GitHub Codespace support (#312, @oxygen-dioxide)

Fixes

Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @nenw)
Fix SJIS distribution analysis: Fixed SJISDistributionAnalysis discarding valid second-byte range >= 0x80. (#315, @bysiber)
Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a MIN_RATIO threshold alongside the existing EXPECTED_RATIO.
Fix get_charset crash: Resolved a crash when looking up unknown charset names.
Fix GB18030 char_len_table: Corrected the character length table for GB18030 multi-byte sequences.
Fix UTF-8 state machine: Updated to be more spec-compliant.
Fix detect_all() returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.
Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @hugovk)
Removed Latin1Prober and MacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by SingleByteCharSetProber with trained language models, giving better accuracy and language identification.
Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
LanguageFilter.NONE removed: Use specific language filters or LanguageFilter.ALL instead.
Enum types changed: InputState, ProbingState, MachineState, SequenceLikelihood, and CharacterCategory are now IntEnum (previously plain classes or Enum). LanguageFilter values changed from hardcoded hex to auto().
detect() default behavior change: detect() now defaults to encoding_era=EncodingEra.MODERN_WEB and should_rename_legacy=None (auto-enabled for MODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

Switched from Poetry/setuptools to uv + hatchling: Build system modernized with hatch-vcs for version management.
License text updated: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (#304, #307, @musicinmybrain)
CulturaX-based model training: The create_language_model.py training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.
Language class converted to frozen dataclass: The language metadata class now uses @dataclass(frozen=True) with num_training_docs and num_training_chars fields replacing wiki_start_pages.
Test infrastructure: Added pytest-timeout and pytest-xdist for faster parallel test execution. Reorganized test data directories.

Contributors

Thank you to everyone who contributed to this release!

@dan-blanchard (Dan Blanchard)
@bysiber (Kadir Can Ozden)
@musicinmybrain (Ben Beasley)
@hugovk (Hugo van Kemenade)
@oxygen-dioxide
@nenw

And a special thanks to @helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.

@dan-blanchard

Adds support for running chardet CLI via python -m chardet (0e9b7bc, @dan-blanchard)

@dan-blanchard

Features

Add should_rename_legacy argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312 becomes GB18030) (#264, @dan-blanchard)
Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)
Add a prober for MacRoman encoding (#5 updated as c292b52, Rob Speer and @dan-blanchard )
Add --minimal flag to chardetect command (#214, @dan-blanchard)
Add type annotations to the project and run mypy on CI (#261, @jdufresne)
Add support for Python 3.11 (#274, @hugovk)

Fixes

Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
Remove support for EOL Python 3.6 (#260, @jdufresne)
Remove unnecessary guards for non-falsey values (#259, @jdufresne)

Misc changes

Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
Remove setup.py in favor of build package (#262, @jdufresne)
Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)

Releases: chardet/chardet

chardet 7.4.0

What's New

Metrics

Uh oh!

7.3.0

License

Features

Bug Fixes

Performance

Build

Uh oh!

chardet 7.2.0

Features

Fixes

Uh oh!

chardet 7.1.0

Features

Fixes

Performance

New API parameters

Improvements

Uh oh!

7.0.1

Fixes

Improvements

New Contributors

Contributors

Uh oh!

7.0.0

Uh oh!

6.0.0.post1

Uh oh!

6.0.0

Features

Fixes

Breaking changes

Misc changes

Contributors

Contributors

Uh oh!

chardet 5.2.0

Contributors

Uh oh!

chardet 5.1.0

Features

Fixes

Misc changes

Contributors

Uh oh!