Skip to content

chardet 7.0: ground-up MIT-licensed rewrite#322

Merged
dan-blanchard merged 264 commits intomainfrom
rewrite-7.0
Mar 2, 2026
Merged

chardet 7.0: ground-up MIT-licensed rewrite#322
dan-blanchard merged 264 commits intomainfrom
rewrite-7.0

Conversation

@dan-blanchard
Copy link
Copy Markdown
Member

@dan-blanchard dan-blanchard commented Mar 2, 2026

Summary

This PR is for a ground-up, MIT-licensed rewrite of chardet. It maintains API compatibility with chardet 5.x and 6.x, but with 27x improvements to detection speed, and highly accurate support for even more encodings. It fixes numerous longstanding issues (like poor accuracy on short strings and poor performance), and is just all-around better than previous versions of chardet in every possible respect. It's even faster, more memory efficient, and more accurate than charset-normalizer, which is something I'm particularly proud of. The test data has also been moved to a separate repo to help prevent any licensing issues having it here might have lead to.

Highlights

chardet 7.0 chardet 6.0.0 charset-normalizer
Accuracy (2,161 files) 96.6% 94.5% 89.0%
Speed (pure Python) 334 files/s 12 files/s 66 files/s
Speed (mypyc compiled) 484 files/s -- 66 files/s
Language detection accuracy 90.9% 47.0% --
Peak memory 22.5 MiB 16.4 MiB 102.2 MiB
Streaming detection yes yes no
Encoding era filtering yes yes no
Supported encodings 99 84 99
Optional mypyc compilation yes no yes
License MIT LGPL MIT

What's New

  • MIT license — previous versions were LGPL
  • Ground-up rewrite — 12-stage detection pipeline using BOM detection, structural probing, byte validity filtering, and bigram statistical models
  • 27x faster than chardet 6.0.0, 5x faster than charset-normalizer (pure Python)
  • 96.6% accuracy — +2.1pp vs chardet 6.0.0, +7.6pp vs charset-normalizer
  • Language detection — 90.9% accuracy across 48 languages, returned with every result
  • 99 encodings — full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families
  • EncodingEra filtering — scope detection to modern web encodings, legacy ISO/Mac/DOS, mainframe, or all
  • Optional mypyc compilation — 1.45x additional speedup on CPython
  • Thread-safedetect() and detect_all() are safe to call concurrently; scales on free-threaded Python
  • Same APIdetect(), detect_all(), UniversalDetector, and the chardetect CLI all work as before

Testing

  • 2,161 test files across 99 encodings and 48 languages
  • 26 test modules covering API, pipeline stages, CLI, thread safety, streaming, and edge cases
  • Accuracy tests dynamically parametrized from chardet/test-data (auto-cloned on first run)
  • Parallel execution via pytest-xdist
  • Language detection tests, thread safety tests, and benchmark tests
  • Test data lives in a separate repo to avoid licensing concerns

Issues Closed

Closes #19 — Merge with cChardet (rewrite is 27x faster, optional mypyc makes cChardet unnecessary)
Closes #24 — Misdetects iso8859-1 as windows-1251 (improved statistical models + confusion groups)
Closes #45 — Interface to see confidence of all encodings (detect_all() returns ranked candidates)
Closes #48 — Retraining and storing data (compact binary models, train.py script)
Closes #62 — Incorrectly detecting UTF-16 as UTF-32LE (proper BOM validation + utf1632 stage)
Closes #64 — iso-8859-7 NBSP detection (byte validity filtering)
Closes #71 — Warning about encodings not supported by Python (registry validates all python_codec values)
Closes #77 — ISO-8859-2 not recognised (in registry with statistical models)
Closes #80 — No support for Arabic/cp1256 (windows-1256 in registry with Arabic/Farsi models)
Closes #82 — False HZ-GB-2312 on ASCII (validated HZ regions)
Closes #87 — windows-1250 detection (in registry with MODERN_WEB era)
Closes #105 — UTF-16 without BOM detection (dedicated utf1632.py stage)
Closes #122 — Support for EBCDIC detection (6 EBCDIC codepages in MAINFRAME era)
Closes #124 — IndexError on ISO-8859-7 (no manual array indexing)
Closes #128 — Purple heart loses UTF-8 confidence (4-byte UTF-8 validation)
Closes #132 — Misdetect win1251 as MacCyrillic (confusion groups + era filtering)
Closes #134 — Wrong UTF-8 detection (structural validator)
Closes #135 — ISO-8859-1 instead of SHIFT-JIS (structural probing for CJK)
Closes #136 — English text detected as Turkish (niche Latin demotion)
Closes #137 — Add windows-1256 support (same as #80)
Closes #138 — ISO-8859-1 instead of UTF-8 (UTF-8 validation runs before single-byte)
Closes #148 — windows-1254 instead of UTF-8 (UTF-8 priority + 1254 demotion)
Closes #149 — Use big5-2003 instead of big5 (big5hkscs is canonical name)
Closes #164 — No support for UHC for Korean (cp949/UHC in registry)
Closes #168 — GB18030 classified as GB2312 (gb18030 is canonical, gb2312 is alias)
Closes #170 — CP949 illegal multibyte sequence (validated python_codec values)
Closes #177 — logging NullHandler (rewrite doesn't use logging module)
Closes #178 — GB18030 BOM confuses detection (proper BOM handling + multi-stage pipeline)
Closes #183 — chardet stuck in infinity loop (linear pipeline, no recursive probers)
Closes #185 — UTF-8 sentence detected as Windows-1252 (early UTF-8 stage)
Closes #196 — Reads entire ASCII file without threshold (ASCII detected in single O(n) pass)
Closes #197 — windows-1253 not detected (in registry with Greek models)
Closes #231 — Problematic licensing of tests (test data in separate repo)
Closes #271 — Documentation licensing (MIT license, test data in separate repo)
Closes #280 — Failed to detect CP932 (in registry with structural probing)
Closes #286 — detect() slower than UniversalDetector.feed (same pipeline now)
Closes #288 — Wrong detection with ö (UTF-8 structural validation)
Closes #289 — detect() quadratic complexity DoS (linear pipeline + max_bytes cap)
Closes #292 — Invalid windows-1254 chars not caught (byte validity filtering)
Closes #293 — Chinese encoding to be UTF-8 (improved CJK gating + UTF-8 priority)
Closes #294 — Wrong encoding with 2-char Chinese (CJK gating improvements)
Closes #305 — Degree character detection (UTF-8 structural validation)
Closes #308 — Single accented char misdetected (UTF-8 baseline confidence 0.80)
Closes #317 — Single euro sign byte detected as None (byte validity + statistical scoring)
Closes #321 — Loops in coverage tool on Orange Pi (no per-language Python model files)

Development note

I put this together using Claude Code with Opus 4.6 with the amazing https://github.com/obra/superpowers plugin in less than a week. It took a fair amount of iteration to get it dialed in quite like I wanted, but it took a project I had been putting off for many years and made it take ~4 days.

dan-blanchard and others added 30 commits March 1, 2026 12:34
Add binary model format (models.bin) for encoding bigram weights,
runtime loading/scoring utilities, and a training script that builds
models from Wikipedia articles via Hugging Face datasets. The binary
format avoids giant dict literals that trigger CPython 3.12 bugs.

- src/chardet/models/__init__.py: load_models() and score_bigrams()
- src/chardet/models/models.bin: trained models for 73 encodings (308 KB)
- scripts/train.py: Wikipedia-based training with caching and HTML samples
- tests/test_models.py: 6 tests for model loading and scoring
- pyproject.toml: add datasets dev dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire together binary detection, BOM, markup charset extraction, ASCII,
UTF-8 validation, byte validity filtering, structural probing, and
statistical scoring into a single run_pipeline() entry point. Markup
charset extraction is checked before ASCII/UTF-8 so explicit encoding
declarations in HTML/XML are honoured even when content bytes are pure
ASCII.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add command-line interface for character encoding detection supporting
file arguments, stdin input, --minimal output, --version flag, and
encoding era filtering. Also add __main__.py for python -m chardet support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a shared conftest.py fixture that resolves chardet test data (from
local tests/data/ or by sparse-cloning from GitHub), and an accuracy
test that runs chardet.detect() against all test files. Current baseline
accuracy is 31.6% (682/2161) with threshold set at 30% as a regression
guard to raise as detection improves.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Treat functionally equivalent encodings (e.g., utf-16 vs utf-16-le,
gb18030 vs gb2312, shift_jis vs cp932) as correct matches. This
eliminates ~322 false failures and raises measured accuracy from
31.6% to 73.4%, with the minimum threshold raised to 55%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects ISO-2022-JP, ISO-2022-KR, and HZ-GB-2312 by checking for their
characteristic escape/tilde sequences early in the pipeline, before
binary detection (which would reject ESC bytes) and ASCII detection
(which would match HZ-GB-2312's printable-only byte range).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e classes

Weight high-byte bigrams (0x80+) 8x more heavily than ASCII-only bigrams in
statistical scoring. This focuses the scoring on the byte ranges that actually
distinguish single-byte encodings, dramatically improving Western European
encoding detection.

Add encoding equivalence classes for iso-8859-11/tis-620, koi8-t/koi8-r,
EBCDIC variants (cp037/cp500/cp1026), DOS (cp850/cp858), and Hebrew DOS
(cp856/cp862).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Retrain bigram models using the updated training pipeline with CulturaX
dataset, expanded language mappings, and character substitution tables.
Use 5000 samples per language for broader coverage.

Remove questionable equivalence classes (cp856/cp862, koi8-t/koi8-r)
that grouped genuinely different encodings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cp856, cp862, and cp864 back to ENCODING_LANG_MAP so they have
bigram models and can be detected by statistical scoring. Remove unused
_cached_article_count function. Add unit test for high-byte bigram
weighting behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design targets matching/exceeding charset-normalizer's accuracy via
CJK gating, era-based tiebreaking, training improvements, and
windows-1252 fallback. Includes diagnostic scripts for per-encoding
accuracy analysis, encoding equivalence verification, and strict
library comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8-task plan covering directional equivalence classes, CJK gating,
era-based tiebreaking, Cyrillic model retraining, windows-1252
fallback, and diagnostic script updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…asses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a gate between Stage 2a (byte validity) and Stage 2b (structural
probing) that eliminates CJK multi-byte candidates when the data lacks
actual multi-byte sequences. This prevents permissive encodings like
gb18030 from winning as false positives for EBCDIC, Latin, and DOS
codepage data.

Key changes:
- Orchestrator: gate multi-byte candidates using compute_structural_score
  with a minimum threshold of 0.05 (valid sequences / lead bytes)
- Structural scorer: tighten _score_gb18030 to only count strict GB2312
  2-byte pairs and 4-byte sequences, excluding the overly permissive GBK
  extension range that caused EBCDIC data to score 1.0
- Accuracy improves from 73.9% to 75.6% (1634/2161); threshold raised
  from 0.73 to 0.74

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When statistical scores are within 10% relative, prefer encodings from
higher-priority eras (MODERN_WEB > LEGACY_ISO > LEGACY_REGIONAL > DOS >
LEGACY_MAC > MAINFRAME). This helps resolve ambiguity between close-scoring
encodings like mac-latin2 vs windows-1250.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 580d15ac9f9ad3c75887308bce444ecdeb3488c6.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds --encodings argument to specify which encodings to retrain.
When specified, existing models for other encodings are preserved
by loading and merging with the existing models.bin. Defaults to
all encodings when none are specified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mination

koi8-r trained on Russian only (was all Cyrillic languages).
cp866 trained without Ukrainian (moved to cp1125).
koi8-r accuracy: 69.2% → 100%. Overall: 74.9% → 75.3%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added fi, is, id, ms to iso-8859-1, windows-1252, iso-8859-15, and
mac-roman training sets. Overall accuracy: 75.3% → 76.8%, exceeding
the 76.0% target. Threshold raised to 0.76.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicates ~280 lines of identical directional equivalence logic
that was copied across test_accuracy.py, diagnose_accuracy.py,
compare_strict.py, and compare_detectors.py. All four files now
import from the single source of truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dan-blanchard and others added 10 commits March 1, 2026 23:00
The README and supported-encodings.rst were understating coverage by
counting only REGISTRY entries (86) rather than the unique Python codecs
reachable via aliases (99). Updated generate_encoding_table.py to count
alias-inclusive encodings and regenerated the docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix score_best_language to not short-circuit when a pre-computed
BigramProfile is provided with empty data. Move _CATEGORY_TO_INT from
runtime confusion.py to scripts/confusion_training.py since it is only
needed at training time; inline _INT_TO_CATEGORY directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ALL is more accurate than MODERN_WEB (96.6% vs 78.9%) because the
confusion resolver only examines the top 2 candidates — removing
encodings via era filtering changes which pair triggers, sometimes
causing wrong resolutions.

- Change default encoding_era from MODERN_WEB to ALL in detect(),
  detect_all(), and UniversalDetector
- Change should_rename_legacy default from None to True, removing
  the _resolve_rename() indirection entirely
- Move test data helpers from conftest.py to scripts/utils.py
- Use pytest.mark.parametrize for independent xfail sets per test
- Add test_detect_era_filtered with its own known-failure tracking
- Update xfail sets: remove 4 files now passing with ALL default,
  add iso-8859-16-romanian/_ude_1.txt to era-filtered failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded version with dynamic versioning using hatch-vcs
(setuptools-scm wrapper for Hatchling). The git tag is now the single
source of truth: tagged commits get clean versions (e.g., 7.0.0),
and dev builds get auto-incremented versions (e.g., 7.0.1.dev3+g...).

- Add hatch-vcs to build-system requires and configure VCS version source
- Import __version__ from auto-generated _version.py
- Add fetch-depth: 0 to all CI/release checkout steps for tag access
- Relax hardcoded version assertions in tests
- Gitignore the generated _version.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace hardcoded "Version 6.1.0+" with hatch-vcs versioning section
- Fix mypyc compiled modules list (7 modules, not 3)
- Add post-processing pipeline stage (confusion groups, niche Latin, KOI8-T)
- Note mypyc exception for `from __future__ import annotations`

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d missing pipeline stages

- usage.rst: reframe encoding eras as restrict (not broaden), ALL is default, add CLI note
- faq.rst: fix accuracy (96.6%), speed (27x), memory (51 B), comparisons
- how-it-works.rst: add CJK Gating (stage 9) and Post-processing (stage 12)
- api/index.rst: add DetectionResult, DetectionDict, DEFAULT_MAX_BYTES, MINIMUM_THRESHOLD
- supported-encodings.rst: fix era default reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The default encoding_era is now ALL, not MODERN_WEB. Update the
EncodingEra filtering example to show ALL as the default and
MODERN_WEB as the explicit restriction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add top-level permissions block to limit the CI workflow token
to contents:read, following the principle of least privilege.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dan-blanchard dan-blanchard marked this pull request as ready for review March 2, 2026 05:48
dan-blanchard and others added 2 commits March 2, 2026 01:11
…nd tests

Improve error handling for corrupt/empty bundled data files: warn on empty
models.bin and confusion.bin, wrap confusion.bin deserialization in
try-except, add detect() error handling in CLI, and narrow overly broad
ValueError catches in UTF-16/32 detection.

Harden the public API by clamping confidence to [0.0, 1.0] at the
run_pipeline boundary, adding an assertion for non-empty results, removing
the dead UnicodeDecodeError catch from _to_utf8, and computing weight_sum
inside BigramProfile.from_weighted_freq instead of accepting it on trust.

Fix inaccurate comments: Windows-1254 C1 range, bidirectional equivalents
description, stage numbering across pipeline modules, sentinel value
rationale, DETERMINISTIC_CONFIDENCE user list, text-quality score range,
and stale language-count references.

Add 12 new tests covering fallback paths, niche Latin demotion, KOI8-T
promotion, language filling, confidence clamping, max_bytes=True rejection,
detect_all with should_rename_legacy=False, and CLI partial failure.
Strengthen two weak assertions in existing tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- how-it-works.rst: correct confidence tiers (UTF-8 is 0.80-0.99, not
  0.95; binary is 0.95, not None), add clamping note, fix language count
- faq.rst: fix binary detection example confidence from None to 0.95
- README.md: fix 11-stage to 12-stage, 48 to 49 languages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dan-blanchard dan-blanchard merged commit 6ebd090 into main Mar 2, 2026
7 of 14 checks passed
@dan-blanchard dan-blanchard deleted the rewrite-7.0 branch March 2, 2026 06:22
@jkyeung
Copy link
Copy Markdown

jkyeung commented Mar 2, 2026

Kudos! It sounds like you leveraged AI tools the way they were "meant to be used" - as a way to augment and accelerate development, rather than to replace human expertise, ingenuity, and understanding. Your extensive work with previous versions of this project lends confidence that it's not a sloppy hack job.

It's only fair for me to point out that I have not looked at or tested the code personally, and I am not an expert in this problem domain. My hope and expectation is that this project is used often enough and extensively enough in the wild that people more knowledgeable than me will soon corroborate or disprove my assessment.

@Jak2k
Copy link
Copy Markdown

Jak2k commented Mar 4, 2026

LLM code is very likely not copyrightable. Please update the license accordingly.

@williewillus
Copy link
Copy Markdown

How much of the previous codebase was supplied to the LLM in this "rewrite"? Did you audit the output for verbatim snippets of the old code?

Doing a "ground-up rewrite" of LGPL code to launder it into MIT seems highly unethical and/or illegal unless the rewrite was carried out with no reference to the original, but even then this repo is probably in the LLM's training data.

@tomoyoirl
Copy link
Copy Markdown

@williewillus yeah, this presents an interesting form of software-supply-chain risk (especially as the MIT-licensed edition of the code is being supplied without warranty, including warranty that its claim of originality is accurate). it's not a new risk, but it might be exacerbated by this new technology.

i don't suppose there's any good sites for tracking these kind of risks yet? something like CVEs, but for license-shenanigans or other software-quality problems in software projects that won't recognize them as problematic.

(i've been using chardet shims that delegate to PyYoshi/cChardet (MPL) in $PROJECT for a while for entirely unrelated license reasons, and will probably be keeping them)

@fallbackerik
Copy link
Copy Markdown

This rewrite is very likely not holding up to the standard of a "clean room design" to be able to relicense the code under a less strict license.

I state this here again so clearly, because then you can't even claim you didn't know. Please don't move forward with it.

What does Claude say to the relicensing attempt?

And why do you want to relicense it in the first place? Maybe there's a solution that can gain more community support and fulfill your needs.

@chardet chardet locked and limited conversation to collaborators Mar 8, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.