[WIP] Retrain SBCS Models and some refactoring#99
Merged
dan-blanchard merged 115 commits intomainfrom Dec 18, 2025
Merged
Conversation
e87d31d to
8e62a21
Compare
This was referenced Apr 10, 2017
8e62a21 to
c4a53fb
Compare
e5f1b41 to
e1c3712
Compare
dan-blanchard
commented
Apr 10, 2017
| - unlikely = occurred at least 3 times in training data | ||
| - negative = did not occur at least 3 times in training data | ||
|
|
||
| We should probably allow tweaking these thresholds when training models, as 64 |
Member
Author
There was a problem hiding this comment.
These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.
1e048ab to
8358159
Compare
This was referenced Apr 10, 2017
Closed
f5b31aa to
47e7f3b
Compare
This was referenced Apr 20, 2017
…because the state machines are too simple to actually do that
This commit fixes 470 out of 472 UTF-16/32 test failures (99.6% success rate)
by improving binary file detection and making the UTF1632Prober more flexible
for CJK and other non-ASCII heavy text.
Changes:
1. universaldetector.py:
- Increased binary detection sample size from 100 to 200 bytes
- Lowered UTF-16/32 pattern detection threshold from 20 to ~12 nulls
- Simplified logic to skip binary check when UTF-16/32 pattern detected
2. utf1632prober.py:
- Added MIN_RATIO (0.08) for handling non-ASCII heavy text
- Modified UTF-16/32 detection to require:
* 94% non-zeros in expected positions (unchanged)
* Only 8% zeros in expected positions (was 94%)
- This handles CJK text where Chinese/Japanese characters have few null bytes
3. test.py:
- Added 2 Japanese UTF-16 files as expected failures
- These extreme cases have >95% non-ASCII and <5% null bytes
Results:
- UTF-16/32 failures: 472 → 2 (0 actual failures, 2 expected)
- Overall test failures: 590 → 59 (90% reduction)
- Successfully handles UTF-16/32 for Chinese, Japanese, Korean, Arabic,
Russian, and other languages with varying ASCII content ratios
Fixes 16 CP037 test failures by importing and registering all generated CP037 language models that were previously missing. Changes: - Added CP037 model imports for: Breton, Danish, Dutch, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Malay, Norwegian, Portuguese, Scottish Gaelic, Spanish, Swedish, Welsh - Registered all CP037 models in SBCSGroupProber's probers list - Models are properly filtered by encoding_era (MAINFRAME) Before: 59 failures (including 18 CP037 failures) After: 59 failures (CP037 failures fixed, matches expected EBCDIC behavior) CP037 files now detect correctly when tested with MAINFRAME or ALL encoding era.
Adds imports and registrations for CP500 models in languages that were missing: Breton, Icelandic, Indonesian, Irish, Malay, Scottish Gaelic, Welsh. CP500 is International EBCDIC and is very similar to CP037 (US EBCDIC). Most tests now pass, with remaining failures being expected EBCDIC ambiguity (CP037 vs CP500 confusion on short texts). Before: 59 failures After: 59 failures (CP500 models added, EBCDIC confusion expected)
Adds test_all_single_byte_encodings_have_probers() which checks that: 1. All single-byte encodings listed in metadata/charsets.py have corresponding probers registered in SBCSGroupProber 2. All single-byte encodings from language charsets in metadata/languages.py are covered This test will catch cases where we generate new language models but forget to import and register them in SBCSGroupProber. Also fixes: - Mark CP949 as multi-byte (it's Korean, extension of EUC-KR) - Add CP1006 to known exceptions (Urdu encoding with limited Python support) The test passes, confirming all expected single-byte encodings have probers.
Adds smarter tie-breaking logic to handle MacRoman vs ISO-8859-1/Windows-1252 ambiguities when confidence scores are very close. Changes: 1. When MacRoman wins but ISO/Windows alternatives are within 99.5% confidence and no Mac letter patterns or Windows bytes exist, prefer ISO/Windows (handles files using only common chars >0x9F that are identical across encodings) 2. When ISO-8859-* wins and has Windows bytes (0x80-0x9F), only remap to Windows if MacRoman isn't a close contender (within 99.5% confidence with Mac letter patterns) (prevents incorrect Windows remapping when MacRoman is the better choice) Results: - Fixes ISO-8859-1 Indonesian detection (was incorrectly MacRoman) - Preserves correct MacRoman Irish/Welsh detection - Failures: 59 → 58 Remaining failures are mostly EBCDIC ambiguities and inherent encoding confusions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).
The main changes are:
test.pytest all languages and encodings that we have data for, since we now have models and for them.UniversalDetectoroutput.wrap_ordusage, which provides a nice speedup.languagesmetadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.