Fix null-separated ASCII misdetected as UTF-16-BE by dan-blanchard · Pull Request #347 · chardet/chardet

dan-blanchard · 2026-03-17T17:58:07Z

Summary

Fix ASCII text with null byte separators (e.g., find -print0, git ls-tree -z) being misdetected as utf-16-be with confidence 0.95
Add a guard in the UTF-16 detector that rejects candidates when null bytes are sparse and all other bytes are printable ASCII — these are field separators, not encoding artifacts
Extend ASCII detection to tolerate up to 5% null bytes, returning confidence 0.99 (vs 1.0 for pure ASCII) so consumers can distinguish
Reorder the pipeline to compute ASCII before binary detection (matching the existing UTF-8 precheck pattern) so null-containing ASCII isn't falsely classified as binary

Fixes #346

Test plan

Exact byte string from ASCII text with null separators detected as utf-16-be #346 returns ascii at 0.99
find -print0 style output returns ascii at 0.99
Real UTF-16-BE/LE text (Latin and CJK) still detected correctly
High null fraction (>5%) rejected by ASCII detector
Nulls mixed with non-ASCII high bytes rejected
Boundary tests at exactly 5% null fraction
Full accuracy suite: 7459 passed, 0 regressions
Full test suite: 8092 passed, 0 failures

🤖 Generated with Claude Code

Addresses #346 — ASCII text with null byte separators (common in Unix CLI output) being misdetected as utf-16-be. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Clarify UTF-16 guard applies in both single/dual candidate paths - Note mypyc compilation constraint for utf1632.py - Detail ASCII implementation using existing _ALLOWED_ASCII table - Clarify pipeline reorder: computation order vs return order - Note UniversalDetector propagation - Fix language=None vs "" discrepancy in test expectations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

6-task TDD plan covering UTF-16 guard, null-tolerant ASCII detection, and pipeline reorder. Addresses #346. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ication (#346) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec and plan are preserved in git history but don't need to be in the final merge diff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Real-world null-separator data (find -print0, git ls-tree -z) is 1-3.5% nulls. 5% covers all realistic cases while staying well below the UTF-16 guard threshold (15%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-17T17:58:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (a98f097) to head (43421c3).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #347   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1436      1449   +13     
=========================================
+ Hits          1436      1449   +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace Python-level all() loop with C-level bytes.translate(), matching the pattern used in binary.py and ascii.py. Cross-references the shared ASCII byte set with ascii.py's _ALLOWED_ASCII. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract ASCII_TEXT_BYTES to pipeline/__init__.py and use it in both ascii.py and utf1632.py to prevent drift between the two definitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jkugler · 2026-03-17T19:10:32Z

Excellent, thank you!

dan-blanchard · 2026-03-18T01:03:24Z

Happy to help! 7.2.0 is out now with the fix

dan-blanchard and others added 10 commits March 17, 2026 13:08

docs: add design spec for null separator tolerance

1c15b87

Addresses #346 — ASCII text with null byte separators (common in Unix CLI output) being misdetected as utf-16-be. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add implementation plan for null separator tolerance

e80e7fa

6-task TDD plan covering UTF-16 guard, null-tolerant ASCII detection, and pipeline reorder. Addresses #346. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add failing tests for null-separator UTF-16 false positive (#346)

bc50bc5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: reject null-separator false positives in UTF-16 detector (#346)

f638b41

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: add failing tests for null-tolerant ASCII detection (#346)

6614864

feat: tolerate sparse null separators in ASCII detection (#346)

627f86e

fix: reorder pipeline so ASCII precheck prevents false binary classif…

a9d6313

…ication (#346) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: remove planning docs from branch

abc35d8

Spec and plan are preserved in git history but don't need to be in the final merge diff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dan-blanchard and others added 2 commits March 17, 2026 14:11

refactor: share ASCII byte-set constant across pipeline modules

43421c3

Extract ASCII_TEXT_BYTES to pipeline/__init__.py and use it in both ascii.py and utf1632.py to prevent drift between the two definitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dan-blanchard merged commit 89a9a4c into main Mar 17, 2026
17 checks passed

dan-blanchard deleted the null-separator-tolerance branch March 17, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix null-separated ASCII misdetected as UTF-16-BE#347

Fix null-separated ASCII misdetected as UTF-16-BE#347
dan-blanchard merged 12 commits intomainfrom
null-separator-tolerance

dan-blanchard commented Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

jkugler commented Mar 17, 2026

Uh oh!

dan-blanchard commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dan-blanchard commented Mar 17, 2026

Summary

Test plan

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jkugler commented Mar 17, 2026

Uh oh!

dan-blanchard commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 17, 2026 •

edited

Loading