Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset by bysiber · Pull Request #315 · chardet/chardet

bysiber · 2026-02-20T13:03:55Z

SJISDistributionAnalysis.get_order() has two issues:

1. Second-byte range >= 0x80 rejected entirely

The valid SJIS trail byte ranges are 0x40–0x7E and 0x80–0xFC, but the current code sets order = -1 for any second byte > 0x7F. This discards roughly half of all valid SJIS two-byte characters from the frequency analysis, significantly reducing detection confidence for Shift_JIS text.

For comparison, Big5DistributionAnalysis.get_order() correctly handles its similar dual-range second byte (0x40–0x7E and 0xA1–0xFE) by adjusting the offset for the second range.

The fix replaces order = -1 with order -= 1 to account for the gap at 0x7F (which is not a valid SJIS trail byte), producing a continuous index across both ranges.

2. First-byte offset wrong for 0xE0–0xEF range

The second else-if branch uses first_char - 0x81 instead of first_char - 0xE0 + 31. Since the 0x81–0x9F range has 31 first-byte values, the 0xE0 range should offset by 31 to avoid overlapping indices. The comment in the code also had a typo in the second byte range (0x81 -- oxfe → 0x80 -- 0xfc).

The SJIS get_order() method rejects all two-byte characters whose second byte is >= 0x80, but the valid SJIS trail byte ranges are 0x40-0x7E and 0x80-0xFC. This discards roughly half of all valid SJIS kanji from the frequency analysis, reducing detection confidence. Also fixes the first-byte offset for the 0xE0-0xEF range — it was using (first_char - 0x81) instead of (first_char - 0xE0 + 31), which produced overlapping order values with the 0x81-0x9F range. The fix mirrors the approach used by Big5DistributionAnalysis, which correctly handles its similar dual-range second byte.

dan-blanchard · 2026-02-21T15:42:50Z

Thanks for the fix! I'll add a regression test directly.

dan-blanchard enabled auto-merge (squash) February 21, 2026 15:42

dan-blanchard disabled auto-merge February 21, 2026 15:46

dan-blanchard merged commit 85bf440 into chardet:main Feb 21, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset#315

Fix SJISDistributionAnalysis discarding valid second-byte range and wrong first-byte offset#315
dan-blanchard merged 1 commit intochardet:mainfrom
bysiber:fix/sjis-distribution-second-byte

bysiber commented Feb 20, 2026

Uh oh!

dan-blanchard commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bysiber commented Feb 20, 2026

1. Second-byte range >= 0x80 rejected entirely

2. First-byte offset wrong for 0xE0–0xEF range

Uh oh!

dan-blanchard commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants