chardet 7.0: ground-up MIT-licensed rewrite#322
Conversation
Add binary model format (models.bin) for encoding bigram weights, runtime loading/scoring utilities, and a training script that builds models from Wikipedia articles via Hugging Face datasets. The binary format avoids giant dict literals that trigger CPython 3.12 bugs. - src/chardet/models/__init__.py: load_models() and score_bigrams() - src/chardet/models/models.bin: trained models for 73 encodings (308 KB) - scripts/train.py: Wikipedia-based training with caching and HTML samples - tests/test_models.py: 6 tests for model loading and scoring - pyproject.toml: add datasets dev dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire together binary detection, BOM, markup charset extraction, ASCII, UTF-8 validation, byte validity filtering, structural probing, and statistical scoring into a single run_pipeline() entry point. Markup charset extraction is checked before ASCII/UTF-8 so explicit encoding declarations in HTML/XML are honoured even when content bytes are pure ASCII. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add command-line interface for character encoding detection supporting file arguments, stdin input, --minimal output, --version flag, and encoding era filtering. Also add __main__.py for python -m chardet support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a shared conftest.py fixture that resolves chardet test data (from local tests/data/ or by sparse-cloning from GitHub), and an accuracy test that runs chardet.detect() against all test files. Current baseline accuracy is 31.6% (682/2161) with threshold set at 30% as a regression guard to raise as detection improves. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Treat functionally equivalent encodings (e.g., utf-16 vs utf-16-le, gb18030 vs gb2312, shift_jis vs cp932) as correct matches. This eliminates ~322 false failures and raises measured accuracy from 31.6% to 73.4%, with the minimum threshold raised to 55%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects ISO-2022-JP, ISO-2022-KR, and HZ-GB-2312 by checking for their characteristic escape/tilde sequences early in the pipeline, before binary detection (which would reject ESC bytes) and ASCII detection (which would match HZ-GB-2312's printable-only byte range). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e classes Weight high-byte bigrams (0x80+) 8x more heavily than ASCII-only bigrams in statistical scoring. This focuses the scoring on the byte ranges that actually distinguish single-byte encodings, dramatically improving Western European encoding detection. Add encoding equivalence classes for iso-8859-11/tis-620, koi8-t/koi8-r, EBCDIC variants (cp037/cp500/cp1026), DOS (cp850/cp858), and Hebrew DOS (cp856/cp862). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Retrain bigram models using the updated training pipeline with CulturaX dataset, expanded language mappings, and character substitution tables. Use 5000 samples per language for broader coverage. Remove questionable equivalence classes (cp856/cp862, koi8-t/koi8-r) that grouped genuinely different encodings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cp856, cp862, and cp864 back to ENCODING_LANG_MAP so they have bigram models and can be detected by statistical scoring. Remove unused _cached_article_count function. Add unit test for high-byte bigram weighting behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design targets matching/exceeding charset-normalizer's accuracy via CJK gating, era-based tiebreaking, training improvements, and windows-1252 fallback. Includes diagnostic scripts for per-encoding accuracy analysis, encoding equivalence verification, and strict library comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8-task plan covering directional equivalence classes, CJK gating, era-based tiebreaking, Cyrillic model retraining, windows-1252 fallback, and diagnostic script updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…asses Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a gate between Stage 2a (byte validity) and Stage 2b (structural probing) that eliminates CJK multi-byte candidates when the data lacks actual multi-byte sequences. This prevents permissive encodings like gb18030 from winning as false positives for EBCDIC, Latin, and DOS codepage data. Key changes: - Orchestrator: gate multi-byte candidates using compute_structural_score with a minimum threshold of 0.05 (valid sequences / lead bytes) - Structural scorer: tighten _score_gb18030 to only count strict GB2312 2-byte pairs and 4-byte sequences, excluding the overly permissive GBK extension range that caused EBCDIC data to score 1.0 - Accuracy improves from 73.9% to 75.6% (1634/2161); threshold raised from 0.73 to 0.74 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When statistical scores are within 10% relative, prefer encodings from higher-priority eras (MODERN_WEB > LEGACY_ISO > LEGACY_REGIONAL > DOS > LEGACY_MAC > MAINFRAME). This helps resolve ambiguity between close-scoring encodings like mac-latin2 vs windows-1250. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 580d15ac9f9ad3c75887308bce444ecdeb3488c6.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds --encodings argument to specify which encodings to retrain. When specified, existing models for other encodings are preserved by loading and merging with the existing models.bin. Defaults to all encodings when none are specified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mination koi8-r trained on Russian only (was all Cyrillic languages). cp866 trained without Ukrainian (moved to cp1125). koi8-r accuracy: 69.2% → 100%. Overall: 74.9% → 75.3%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added fi, is, id, ms to iso-8859-1, windows-1252, iso-8859-15, and mac-roman training sets. Overall accuracy: 75.3% → 76.8%, exceeding the 76.0% target. Threshold raised to 0.76. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicates ~280 lines of identical directional equivalence logic that was copied across test_accuracy.py, diagnose_accuracy.py, compare_strict.py, and compare_detectors.py. All four files now import from the single source of truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The README and supported-encodings.rst were understating coverage by counting only REGISTRY entries (86) rather than the unique Python codecs reachable via aliases (99). Updated generate_encoding_table.py to count alias-inclusive encodings and regenerated the docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix score_best_language to not short-circuit when a pre-computed BigramProfile is provided with empty data. Move _CATEGORY_TO_INT from runtime confusion.py to scripts/confusion_training.py since it is only needed at training time; inline _INT_TO_CATEGORY directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ALL is more accurate than MODERN_WEB (96.6% vs 78.9%) because the confusion resolver only examines the top 2 candidates — removing encodings via era filtering changes which pair triggers, sometimes causing wrong resolutions. - Change default encoding_era from MODERN_WEB to ALL in detect(), detect_all(), and UniversalDetector - Change should_rename_legacy default from None to True, removing the _resolve_rename() indirection entirely - Move test data helpers from conftest.py to scripts/utils.py - Use pytest.mark.parametrize for independent xfail sets per test - Add test_detect_era_filtered with its own known-failure tracking - Update xfail sets: remove 4 files now passing with ALL default, add iso-8859-16-romanian/_ude_1.txt to era-filtered failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded version with dynamic versioning using hatch-vcs (setuptools-scm wrapper for Hatchling). The git tag is now the single source of truth: tagged commits get clean versions (e.g., 7.0.0), and dev builds get auto-incremented versions (e.g., 7.0.1.dev3+g...). - Add hatch-vcs to build-system requires and configure VCS version source - Import __version__ from auto-generated _version.py - Add fetch-depth: 0 to all CI/release checkout steps for tag access - Relax hardcoded version assertions in tests - Gitignore the generated _version.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace hardcoded "Version 6.1.0+" with hatch-vcs versioning section - Fix mypyc compiled modules list (7 modules, not 3) - Add post-processing pipeline stage (confusion groups, niche Latin, KOI8-T) - Note mypyc exception for `from __future__ import annotations` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d missing pipeline stages - usage.rst: reframe encoding eras as restrict (not broaden), ALL is default, add CLI note - faq.rst: fix accuracy (96.6%), speed (27x), memory (51 B), comparisons - how-it-works.rst: add CJK Gating (stage 9) and Post-processing (stage 12) - api/index.rst: add DetectionResult, DetectionDict, DEFAULT_MAX_BYTES, MINIMUM_THRESHOLD - supported-encodings.rst: fix era default reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The default encoding_era is now ALL, not MODERN_WEB. Update the EncodingEra filtering example to show ALL as the default and MODERN_WEB as the explicit restriction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add top-level permissions block to limit the CI workflow token to contents:read, following the principle of least privilege. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd tests Improve error handling for corrupt/empty bundled data files: warn on empty models.bin and confusion.bin, wrap confusion.bin deserialization in try-except, add detect() error handling in CLI, and narrow overly broad ValueError catches in UTF-16/32 detection. Harden the public API by clamping confidence to [0.0, 1.0] at the run_pipeline boundary, adding an assertion for non-empty results, removing the dead UnicodeDecodeError catch from _to_utf8, and computing weight_sum inside BigramProfile.from_weighted_freq instead of accepting it on trust. Fix inaccurate comments: Windows-1254 C1 range, bidirectional equivalents description, stage numbering across pipeline modules, sentinel value rationale, DETERMINISTIC_CONFIDENCE user list, text-quality score range, and stale language-count references. Add 12 new tests covering fallback paths, niche Latin demotion, KOI8-T promotion, language filling, confidence clamping, max_bytes=True rejection, detect_all with should_rename_legacy=False, and CLI partial failure. Strengthen two weak assertions in existing tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- how-it-works.rst: correct confidence tiers (UTF-8 is 0.80-0.99, not 0.95; binary is 0.95, not None), add clamping note, fix language count - faq.rst: fix binary detection example confidence from None to 0.95 - README.md: fix 11-stage to 12-stage, 48 to 49 languages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Kudos! It sounds like you leveraged AI tools the way they were "meant to be used" - as a way to augment and accelerate development, rather than to replace human expertise, ingenuity, and understanding. Your extensive work with previous versions of this project lends confidence that it's not a sloppy hack job. It's only fair for me to point out that I have not looked at or tested the code personally, and I am not an expert in this problem domain. My hope and expectation is that this project is used often enough and extensively enough in the wild that people more knowledgeable than me will soon corroborate or disprove my assessment. |
|
LLM code is very likely not copyrightable. Please update the license accordingly. |
|
How much of the previous codebase was supplied to the LLM in this "rewrite"? Did you audit the output for verbatim snippets of the old code? Doing a "ground-up rewrite" of LGPL code to launder it into MIT seems highly unethical and/or illegal unless the rewrite was carried out with no reference to the original, but even then this repo is probably in the LLM's training data. |
|
@williewillus yeah, this presents an interesting form of software-supply-chain risk (especially as the MIT-licensed edition of the code is being supplied without warranty, including warranty that its claim of originality is accurate). it's not a new risk, but it might be exacerbated by this new technology. i don't suppose there's any good sites for tracking these kind of risks yet? something like CVEs, but for license-shenanigans or other software-quality problems in software projects that won't recognize them as problematic. (i've been using chardet shims that delegate to PyYoshi/cChardet (MPL) in $PROJECT for a while for entirely unrelated license reasons, and will probably be keeping them) |
|
This rewrite is very likely not holding up to the standard of a "clean room design" to be able to relicense the code under a less strict license. I state this here again so clearly, because then you can't even claim you didn't know. Please don't move forward with it. What does Claude say to the relicensing attempt? And why do you want to relicense it in the first place? Maybe there's a solution that can gain more community support and fulfill your needs. |
Summary
This PR is for a ground-up, MIT-licensed rewrite of chardet. It maintains API compatibility with chardet 5.x and 6.x, but with 27x improvements to detection speed, and highly accurate support for even more encodings. It fixes numerous longstanding issues (like poor accuracy on short strings and poor performance), and is just all-around better than previous versions of chardet in every possible respect. It's even faster, more memory efficient, and more accurate than charset-normalizer, which is something I'm particularly proud of. The test data has also been moved to a separate repo to help prevent any licensing issues having it here might have lead to.
Highlights
What's New
EncodingErafiltering — scope detection to modern web encodings, legacy ISO/Mac/DOS, mainframe, or alldetect()anddetect_all()are safe to call concurrently; scales on free-threaded Pythondetect(),detect_all(),UniversalDetector, and thechardetectCLI all work as beforeTesting
Issues Closed
Closes #19 — Merge with cChardet (rewrite is 27x faster, optional mypyc makes cChardet unnecessary)
Closes #24 — Misdetects iso8859-1 as windows-1251 (improved statistical models + confusion groups)
Closes #45 — Interface to see confidence of all encodings (
detect_all()returns ranked candidates)Closes #48 — Retraining and storing data (compact binary models,
train.pyscript)Closes #62 — Incorrectly detecting UTF-16 as UTF-32LE (proper BOM validation + utf1632 stage)
Closes #64 — iso-8859-7 NBSP detection (byte validity filtering)
Closes #71 — Warning about encodings not supported by Python (registry validates all
python_codecvalues)Closes #77 — ISO-8859-2 not recognised (in registry with statistical models)
Closes #80 — No support for Arabic/cp1256 (windows-1256 in registry with Arabic/Farsi models)
Closes #82 — False HZ-GB-2312 on ASCII (validated HZ regions)
Closes #87 — windows-1250 detection (in registry with MODERN_WEB era)
Closes #105 — UTF-16 without BOM detection (dedicated
utf1632.pystage)Closes #122 — Support for EBCDIC detection (6 EBCDIC codepages in MAINFRAME era)
Closes #124 — IndexError on ISO-8859-7 (no manual array indexing)
Closes #128 — Purple heart loses UTF-8 confidence (4-byte UTF-8 validation)
Closes #132 — Misdetect win1251 as MacCyrillic (confusion groups + era filtering)
Closes #134 — Wrong UTF-8 detection (structural validator)
Closes #135 — ISO-8859-1 instead of SHIFT-JIS (structural probing for CJK)
Closes #136 — English text detected as Turkish (niche Latin demotion)
Closes #137 — Add windows-1256 support (same as #80)
Closes #138 — ISO-8859-1 instead of UTF-8 (UTF-8 validation runs before single-byte)
Closes #148 — windows-1254 instead of UTF-8 (UTF-8 priority + 1254 demotion)
Closes #149 — Use big5-2003 instead of big5 (
big5hkscsis canonical name)Closes #164 — No support for UHC for Korean (cp949/UHC in registry)
Closes #168 — GB18030 classified as GB2312 (
gb18030is canonical,gb2312is alias)Closes #170 — CP949 illegal multibyte sequence (validated
python_codecvalues)Closes #177 — logging NullHandler (rewrite doesn't use logging module)
Closes #178 — GB18030 BOM confuses detection (proper BOM handling + multi-stage pipeline)
Closes #183 — chardet stuck in infinity loop (linear pipeline, no recursive probers)
Closes #185 — UTF-8 sentence detected as Windows-1252 (early UTF-8 stage)
Closes #196 — Reads entire ASCII file without threshold (ASCII detected in single O(n) pass)
Closes #197 — windows-1253 not detected (in registry with Greek models)
Closes #231 — Problematic licensing of tests (test data in separate repo)
Closes #271 — Documentation licensing (MIT license, test data in separate repo)
Closes #280 — Failed to detect CP932 (in registry with structural probing)
Closes #286 — detect() slower than UniversalDetector.feed (same pipeline now)
Closes #288 — Wrong detection with ö (UTF-8 structural validation)
Closes #289 — detect() quadratic complexity DoS (linear pipeline +
max_bytescap)Closes #292 — Invalid windows-1254 chars not caught (byte validity filtering)
Closes #293 — Chinese encoding to be UTF-8 (improved CJK gating + UTF-8 priority)
Closes #294 — Wrong encoding with 2-char Chinese (CJK gating improvements)
Closes #305 — Degree character detection (UTF-8 structural validation)
Closes #308 — Single accented char misdetected (UTF-8 baseline confidence 0.80)
Closes #317 — Single euro sign byte detected as None (byte validity + statistical scoring)
Closes #321 — Loops in coverage tool on Orange Pi (no per-language Python model files)
Development note
I put this together using Claude Code with Opus 4.6 with the amazing https://github.com/obra/superpowers plugin in less than a week. It took a fair amount of iteration to get it dialed in quite like I wanted, but it took a project I had been putting off for many years and made it take ~4 days.