Skip to content

Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset#315

Merged
dan-blanchard merged 1 commit intochardet:mainfrom
bysiber:fix/sjis-distribution-second-byte
Feb 21, 2026
Merged

Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset#315
dan-blanchard merged 1 commit intochardet:mainfrom
bysiber:fix/sjis-distribution-second-byte

Conversation

@bysiber
Copy link
Copy Markdown
Contributor

@bysiber bysiber commented Feb 20, 2026

SJISDistributionAnalysis.get_order() has two issues:

1. Second-byte range >= 0x80 rejected entirely

The valid SJIS trail byte ranges are 0x40–0x7E and 0x80–0xFC, but the current code sets order = -1 for any second byte > 0x7F. This discards roughly half of all valid SJIS two-byte characters from the frequency analysis, significantly reducing detection confidence for Shift_JIS text.

For comparison, Big5DistributionAnalysis.get_order() correctly handles its similar dual-range second byte (0x40–0x7E and 0xA1–0xFE) by adjusting the offset for the second range.

The fix replaces order = -1 with order -= 1 to account for the gap at 0x7F (which is not a valid SJIS trail byte), producing a continuous index across both ranges.

2. First-byte offset wrong for 0xE0–0xEF range

The second else-if branch uses first_char - 0x81 instead of first_char - 0xE0 + 31. Since the 0x81–0x9F range has 31 first-byte values, the 0xE0 range should offset by 31 to avoid overlapping indices. The comment in the code also had a typo in the second byte range (0x81 -- oxfe0x80 -- 0xfc).

The SJIS get_order() method rejects all two-byte characters whose
second byte is >= 0x80, but the valid SJIS trail byte ranges are
0x40-0x7E and 0x80-0xFC. This discards roughly half of all valid
SJIS kanji from the frequency analysis, reducing detection confidence.

Also fixes the first-byte offset for the 0xE0-0xEF range — it was
using (first_char - 0x81) instead of (first_char - 0xE0 + 31),
which produced overlapping order values with the 0x81-0x9F range.

The fix mirrors the approach used by Big5DistributionAnalysis, which
correctly handles its similar dual-range second byte.
@dan-blanchard
Copy link
Copy Markdown
Member

Thanks for the fix! I'll add a regression test directly.

@dan-blanchard dan-blanchard enabled auto-merge (squash) February 21, 2026 15:42
@dan-blanchard dan-blanchard merged commit 85bf440 into chardet:main Feb 21, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants