Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset#315
Merged
dan-blanchard merged 1 commit intochardet:mainfrom Feb 21, 2026
Conversation
The SJIS get_order() method rejects all two-byte characters whose second byte is >= 0x80, but the valid SJIS trail byte ranges are 0x40-0x7E and 0x80-0xFC. This discards roughly half of all valid SJIS kanji from the frequency analysis, reducing detection confidence. Also fixes the first-byte offset for the 0xE0-0xEF range — it was using (first_char - 0x81) instead of (first_char - 0xE0 + 31), which produced overlapping order values with the 0x81-0x9F range. The fix mirrors the approach used by Big5DistributionAnalysis, which correctly handles its similar dual-range second byte.
Member
|
Thanks for the fix! I'll add a regression test directly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SJISDistributionAnalysis.get_order()has two issues:1. Second-byte range >= 0x80 rejected entirely
The valid SJIS trail byte ranges are 0x40–0x7E and 0x80–0xFC, but the current code sets
order = -1for any second byte > 0x7F. This discards roughly half of all valid SJIS two-byte characters from the frequency analysis, significantly reducing detection confidence for Shift_JIS text.For comparison,
Big5DistributionAnalysis.get_order()correctly handles its similar dual-range second byte (0x40–0x7E and 0xA1–0xFE) by adjusting the offset for the second range.The fix replaces
order = -1withorder -= 1to account for the gap at 0x7F (which is not a valid SJIS trail byte), producing a continuous index across both ranges.2. First-byte offset wrong for 0xE0–0xEF range
The second else-if branch uses
first_char - 0x81instead offirst_char - 0xE0 + 31. Since the 0x81–0x9F range has 31 first-byte values, the 0xE0 range should offset by 31 to avoid overlapping indices. The comment in the code also had a typo in the second byte range (0x81 -- oxfe→0x80 -- 0xfc).