Add include/exclude encoding filters by dan-blanchard · Pull Request #343 · chardet/chardet

dan-blanchard · 2026-03-15T01:49:06Z

Summary

Add include_encodings and exclude_encodings parameters to detect(), detect_all(), and UniversalDetector for fine-grained control over which encodings are considered during detection
Add no_match_encoding and empty_input_encoding parameters to customize what's returned when detection is inconclusive or input is empty
Add -i/--include-encodings, -x/--exclude-encodings, --no-match-encoding, --empty-input-encoding CLI flags
All four parameters are keyword-only with backward-compatible defaults
Empty include_encodings=[] raises ValueError (almost certainly a user error)

Closes #301

Details

The candidate set is built once from get_candidates() (which incorporates encoding_era + include_encodings + exclude_encodings) and used to gate all pipeline stages uniformly — both early-exit stages (BOM, UTF-8, ASCII, markup charset, escape sequences) and the statistical scoring pipeline. This means encoding_era now gates early-exit stages too, which is more consistent than the previous behavior (where only escape sequences were era-gated).

Binary detection (encoding=None) is intentionally exempt from filtering. When a fallback or empty-input encoding is itself filtered out, a UserWarning is emitted and encoding=None is returned.

Test plan

🤖 Generated with Claude Code

codecov · 2026-03-15T01:49:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (63e90b5) to head (a6b99ba).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #343   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1390      1436   +46     
=========================================
+ Hits          1390      1436   +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Addresses #301 — adds include_encodings and exclude_encodings parameters to the public API for fine-grained control over which encodings are considered during detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- include_encodings now gates early-exit stages too (BOM, UTF-8, etc.) - Add fallback_encoding and empty_encoding customizable parameters - Filtered fallback/empty results emit UserWarning and return None - Binary detection explicitly unaffected by encoding filters - Clarify encoding_era + include_encodings intersection semantics - Add CLI flags for fallback-encoding and empty-encoding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Clarify prefer_superset/compat_names interaction with filters - Note cache growth characteristics for get_candidates() - Document escape stage era gate preserved alongside new filter - Note era gating of other early-exit stages is out of scope - Add post-processing invariant (never introduces new encodings) - Clarify overlap behavior path through fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

9 tasks across 4 chunks: registry layer, pipeline orchestrator, public API, and CLI. TDD approach with tests before implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>