Skip to content

Fix null-separated ASCII misdetected as UTF-16-BE#347

Merged
dan-blanchard merged 12 commits intomainfrom
null-separator-tolerance
Mar 17, 2026
Merged

Fix null-separated ASCII misdetected as UTF-16-BE#347
dan-blanchard merged 12 commits intomainfrom
null-separator-tolerance

Conversation

@dan-blanchard
Copy link
Copy Markdown
Member

Summary

  • Fix ASCII text with null byte separators (e.g., find -print0, git ls-tree -z) being misdetected as utf-16-be with confidence 0.95
  • Add a guard in the UTF-16 detector that rejects candidates when null bytes are sparse and all other bytes are printable ASCII — these are field separators, not encoding artifacts
  • Extend ASCII detection to tolerate up to 5% null bytes, returning confidence 0.99 (vs 1.0 for pure ASCII) so consumers can distinguish
  • Reorder the pipeline to compute ASCII before binary detection (matching the existing UTF-8 precheck pattern) so null-containing ASCII isn't falsely classified as binary

Fixes #346

Test plan

  • Exact byte string from ASCII text with null separators detected as utf-16-be #346 returns ascii at 0.99
  • find -print0 style output returns ascii at 0.99
  • Real UTF-16-BE/LE text (Latin and CJK) still detected correctly
  • High null fraction (>5%) rejected by ASCII detector
  • Nulls mixed with non-ASCII high bytes rejected
  • Boundary tests at exactly 5% null fraction
  • Full accuracy suite: 7459 passed, 0 regressions
  • Full test suite: 8092 passed, 0 failures

🤖 Generated with Claude Code

dan-blanchard and others added 10 commits March 17, 2026 13:08
Addresses #346 — ASCII text with null byte separators
(common in Unix CLI output) being misdetected as utf-16-be.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify UTF-16 guard applies in both single/dual candidate paths
- Note mypyc compilation constraint for utf1632.py
- Detail ASCII implementation using existing _ALLOWED_ASCII table
- Clarify pipeline reorder: computation order vs return order
- Note UniversalDetector propagation
- Fix language=None vs "" discrepancy in test expectations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6-task TDD plan covering UTF-16 guard, null-tolerant ASCII detection,
and pipeline reorder. Addresses #346.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ication (#346)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spec and plan are preserved in git history but don't need to be in the
final merge diff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Real-world null-separator data (find -print0, git ls-tree -z) is
1-3.5% nulls. 5% covers all realistic cases while staying well below
the UTF-16 guard threshold (15%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (a98f097) to head (43421c3).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #347   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1436      1449   +13     
=========================================
+ Hits          1436      1449   +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dan-blanchard and others added 2 commits March 17, 2026 14:11
Replace Python-level all() loop with C-level bytes.translate(), matching
the pattern used in binary.py and ascii.py. Cross-references the shared
ASCII byte set with ascii.py's _ALLOWED_ASCII.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract ASCII_TEXT_BYTES to pipeline/__init__.py and use it in both
ascii.py and utf1632.py to prevent drift between the two definitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dan-blanchard dan-blanchard merged commit 89a9a4c into main Mar 17, 2026
17 checks passed
@dan-blanchard dan-blanchard deleted the null-separator-tolerance branch March 17, 2026 18:29
@jkugler
Copy link
Copy Markdown

jkugler commented Mar 17, 2026

Excellent, thank you!

@dan-blanchard
Copy link
Copy Markdown
Member Author

Happy to help! 7.2.0 is out now with the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ASCII text with null separators detected as utf-16-be

2 participants