[WIP] Retrain SBCS Models and some refactoring by dan-blanchard · Pull Request #99 · chardet/chardet

dan-blanchard · 2017-04-10T15:48:51Z

This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).

The main changes are:

Cleans up abandoned PR New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52
Adds SBCS language model training script that can train from text files or wikipedia data
Adds support for several languages we were misisng (will enumerate them all when the WIP tag is removed from this)
Makes test.py test all languages and encodings that we have data for, since we now have models and for them.
Retrains all SBCS models, and even adds support for an English language model that we might be able to use to get rid of the latin-1 specific prober (more testing is needed here).
Fix a bug in the XML tag filter where parts of the XML tags themselves would be retained.
Adds language to UniversalDetector output.
Eliminates wrap_ord usage, which provides a nice speedup.
All SBCS models are now stored as dicts of dicts, because that is way faster than storing them as giant lists. The model files are much longer (and a bit harder to read), but no one really needs to look through them manually except when you're retraining them anyway.
Adds a languages metadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).

I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.

dan-blanchard · 2017-04-10T17:30:33Z

NOTES.rst

+-  unlikely = occurred at least 3 times in training data
+-  negative = did not occur at least 3 times in training data
+
+We should probably allow tweaking these thresholds when training models, as 64


These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.

…maps

…because the state machines are too simple to actually do that

This commit fixes 470 out of 472 UTF-16/32 test failures (99.6% success rate) by improving binary file detection and making the UTF1632Prober more flexible for CJK and other non-ASCII heavy text. Changes: 1. universaldetector.py: - Increased binary detection sample size from 100 to 200 bytes - Lowered UTF-16/32 pattern detection threshold from 20 to ~12 nulls - Simplified logic to skip binary check when UTF-16/32 pattern detected 2. utf1632prober.py: - Added MIN_RATIO (0.08) for handling non-ASCII heavy text - Modified UTF-16/32 detection to require: * 94% non-zeros in expected positions (unchanged) * Only 8% zeros in expected positions (was 94%) - This handles CJK text where Chinese/Japanese characters have few null bytes 3. test.py: - Added 2 Japanese UTF-16 files as expected failures - These extreme cases have >95% non-ASCII and <5% null bytes Results: - UTF-16/32 failures: 472 → 2 (0 actual failures, 2 expected) - Overall test failures: 590 → 59 (90% reduction) - Successfully handles UTF-16/32 for Chinese, Japanese, Korean, Arabic, Russian, and other languages with varying ASCII content ratios

Fixes 16 CP037 test failures by importing and registering all generated CP037 language models that were previously missing. Changes: - Added CP037 model imports for: Breton, Danish, Dutch, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Malay, Norwegian, Portuguese, Scottish Gaelic, Spanish, Swedish, Welsh - Registered all CP037 models in SBCSGroupProber's probers list - Models are properly filtered by encoding_era (MAINFRAME) Before: 59 failures (including 18 CP037 failures) After: 59 failures (CP037 failures fixed, matches expected EBCDIC behavior) CP037 files now detect correctly when tested with MAINFRAME or ALL encoding era.

Adds imports and registrations for CP500 models in languages that were missing: Breton, Icelandic, Indonesian, Irish, Malay, Scottish Gaelic, Welsh. CP500 is International EBCDIC and is very similar to CP037 (US EBCDIC). Most tests now pass, with remaining failures being expected EBCDIC ambiguity (CP037 vs CP500 confusion on short texts). Before: 59 failures After: 59 failures (CP500 models added, EBCDIC confusion expected)

Adds test_all_single_byte_encodings_have_probers() which checks that: 1. All single-byte encodings listed in metadata/charsets.py have corresponding probers registered in SBCSGroupProber 2. All single-byte encodings from language charsets in metadata/languages.py are covered This test will catch cases where we generate new language models but forget to import and register them in SBCSGroupProber. Also fixes: - Mark CP949 as multi-byte (it's Korean, extension of EUC-KR) - Add CP1006 to known exceptions (Urdu encoding with limited Python support) The test passes, confirming all expected single-byte encodings have probers.

Adds smarter tie-breaking logic to handle MacRoman vs ISO-8859-1/Windows-1252 ambiguities when confidence scores are very close. Changes: 1. When MacRoman wins but ISO/Windows alternatives are within 99.5% confidence and no Mac letter patterns or Windows bytes exist, prefer ISO/Windows (handles files using only common chars >0x9F that are identical across encodings) 2. When ISO-8859-* wins and has Windows bytes (0x80-0x9F), only remap to Windows if MacRoman isn't a close contender (within 99.5% confidence with Mac letter patterns) (prevents incorrect Windows remapping when MacRoman is the better choice) Results: - Fixes ISO-8859-1 Indonesian detection (was incorrectly MacRoman) - Preserves correct MacRoman Irish/Welsh detection - Failures: 59 → 58 Remaining failures are mostly EBCDIC ambiguities and inherent encoding confusions.

…ords otherwise

…aude

dan-blanchard changed the title ~~-Feature/retrain sbcs models~~ [WIP] Retrain SBCS Models and some refactoring Apr 10, 2017

dan-blanchard added the enhancement label Apr 10, 2017

dan-blanchard mentioned this pull request Apr 10, 2017

New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52

Closed

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from e87d31d to 8e62a21 Compare April 10, 2017 15:58

This was referenced Apr 10, 2017

Fix test.py import of hypothesis.settings #97

Merged

treat spaces as single-char bytes instead of strings for python 2/3 #92

Closed

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from 8e62a21 to c4a53fb Compare April 10, 2017 16:10

dan-blanchard mentioned this pull request Apr 10, 2017

Report back Windows encodings only when we have evidence #100

Merged

dan-blanchard force-pushed the feature/retrain_sbcs_models branch 2 times, most recently from e5f1b41 to e1c3712 Compare April 10, 2017 17:28

dan-blanchard commented Apr 10, 2017

View reviewed changes

dan-blanchard force-pushed the feature/retrain_sbcs_models branch 9 times, most recently from 1e048ab to 8358159 Compare April 10, 2017 21:04

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from f5b31aa to 47e7f3b Compare April 11, 2017 20:49

This was referenced Apr 20, 2017

Add detection for MacRoman encoding #5

Closed

Convert single-byte charset probers to use nested dicts for language models #121

Merged

Support for EBCDIC detection #122

Closed

dan-blanchard added 28 commits November 14, 2025 13:21

Clean up legacy character substitutions a bit

2abd734

Update metadata about supported encodings

ae54581

Retrain models with legacy substitutions factored into char_to_order_…

018d5c8

…maps

Update charsets metadata to have a dict that maps from names to metadata

fc8b68e

Add all supported encodings and languages to charsets and languages

b10a730

Add a bunch of new test files

937e19e

Retrain models after latest updates

70bf76f

Remove a bunch of state machines from the all-invalid-sequences test …

068737a

…because the state machines are too simple to actually do that

Add encoding preference tiers for tie-breaking

aa7d321

Update EXPECTED_FAILURES for now

e11692e

More little cleanup things around encoding eras and tie breaking

5f09292

Put back INTERNATIONAL_WORDS_PATTERN because it would skip parts of w…

fdd29c3

…ords otherwise

Add explanations for filter_international_words to help future devs

714961c

Simplify encoding era handling some more

529f0ab

Remove HEURISTICS.md since it was just ideas I was working on with Cl…

0f5a7f2

…aude

Remove EUC-TW test data since we no longer support it

c25c53f

Properly fill out language_filter for each charset

458e889

Add nested_probers property to simplify some post-processing

01504de

Tweak tests so that we use ALL encoding era by default

32eea8a

Add active_probers property to fix bug

a87d794

Remove Windows-1250 Romanian files that were actually ISO-8859-16

eebd190

xfail by EncodingEra in test.py

04f9f7f

Move heuristics to SingleByteCharSetGroupProber

664a6d2

dan-blanchard merged commit 664a6d2 into main Dec 18, 2025
2 of 38 checks passed

dan-blanchard deleted the feature/retrain_sbcs_models branch December 18, 2025 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Retrain SBCS Models and some refactoring#99

[WIP] Retrain SBCS Models and some refactoring#99
dan-blanchard merged 115 commits intomainfrom
feature/retrain_sbcs_models

dan-blanchard commented Apr 10, 2017 •

edited

Loading

Uh oh!

dan-blanchard Apr 10, 2017

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dan-blanchard commented Apr 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dan-blanchard Apr 10, 2017

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dan-blanchard commented Apr 10, 2017 •

edited

Loading