Skip to content

Add include/exclude encoding filters#343

Merged
dan-blanchard merged 21 commits intomainfrom
filter-by-lister
Mar 15, 2026
Merged

Add include/exclude encoding filters#343
dan-blanchard merged 21 commits intomainfrom
filter-by-lister

Conversation

@dan-blanchard
Copy link
Copy Markdown
Member

@dan-blanchard dan-blanchard commented Mar 15, 2026

Summary

  • Add include_encodings and exclude_encodings parameters to detect(), detect_all(), and UniversalDetector for fine-grained control over which encodings are considered during detection
  • Add no_match_encoding and empty_input_encoding parameters to customize what's returned when detection is inconclusive or input is empty
  • Add -i/--include-encodings, -x/--exclude-encodings, --no-match-encoding, --empty-input-encoding CLI flags
  • All four parameters are keyword-only with backward-compatible defaults
  • Empty include_encodings=[] raises ValueError (almost certainly a user error)

Closes #301

Details

The candidate set is built once from get_candidates() (which incorporates encoding_era + include_encodings + exclude_encodings) and used to gate all pipeline stages uniformly — both early-exit stages (BOM, UTF-8, ASCII, markup charset, escape sequences) and the statistical scoring pipeline. This means encoding_era now gates early-exit stages too, which is more consistent than the previous behavior (where only escape sequences were era-gated).

Binary detection (encoding=None) is intentionally exempt from filtering. When a fallback or empty-input encoding is itself filtered out, a UserWarning is emitted and encoding=None is returned.

Test plan

  • normalize_encodings() — valid names, aliases, unknown names raise ValueError, empty raises ValueError
  • get_candidates() — include-only, exclude-only, both, era intersection, empty result
  • detect() / detect_all() integration — include narrows, exclude removes, BOM/ASCII/UTF-8 gating, overlap returns None
  • Fallback/empty customization — custom values, filtered fallback emits warning
  • Binary detection unaffected by filters
  • UniversalDetector — all four params work through streaming interface
  • CLI — -i, -x, --no-match-encoding, --empty-input-encoding, comma-separated parsing, invalid encoding error
  • Performance — no regression (8.43s vs 8.42s pure-vs-pure against 7.1.0, benchmark threshold tests pass)
  • Accuracy — 98.2% encoding, 95.2% language (unchanged from main)
  • Full suite: 8070 passed, 76 xfailed

🤖 Generated with Claude Code

@dan-blanchard dan-blanchard changed the title feat: add include/exclude encoding filters Add include/exclude encoding filters Mar 15, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (63e90b5) to head (a6b99ba).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #343   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1390      1436   +46     
=========================================
+ Hits          1390      1436   +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dan-blanchard and others added 21 commits March 14, 2026 23:35
Addresses #301 — adds include_encodings and
exclude_encodings parameters to the public API for fine-grained
control over which encodings are considered during detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- include_encodings now gates early-exit stages too (BOM, UTF-8, etc.)
- Add fallback_encoding and empty_encoding customizable parameters
- Filtered fallback/empty results emit UserWarning and return None
- Binary detection explicitly unaffected by encoding filters
- Clarify encoding_era + include_encodings intersection semantics
- Add CLI flags for fallback-encoding and empty-encoding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify prefer_superset/compat_names interaction with filters
- Note cache growth characteristics for get_candidates()
- Document escape stage era gate preserved alongside new filter
- Note era gating of other early-exit stages is out of scope
- Add post-processing invariant (never introduces new encodings)
- Clarify overlap behavior path through fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 tasks across 4 chunks: registry layer, pipeline orchestrator,
public API, and CLI. TDD approach with tests before implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t_all()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- UniversalDetector: unknown exclude/fallback/empty encoding raises
- detect_all: custom empty_encoding and fallback_encoding
- CLI: invalid encoding name in -i reports error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all `from chardet.detector import UniversalDetector` imports
from inside test functions to the top-level imports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add validate_encoding() helper in registry.py, used by
  normalize_encodings() and all three entry points
- Add filtered_out() and fallback_or_none() closures in
  _run_pipeline_core() to avoid threading include/exclude through
  every call site
- Fix _make_fallback_or_none() docstring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…acklevel)

- Rename fallback_encoding -> no_match_encoding and empty_encoding ->
  empty_input_encoding across the entire codebase (API, CLI, detector,
  orchestrator, tests)
- Remove fallback_or_none closure in _run_pipeline_core; call
  _make_fallback_or_none directly and update stacklevel from 4 to 5
- Swap @functools.cache to @functools.lru_cache(maxsize=256) on
  get_candidates()
- Raise ValueError for empty include_encodings in normalize_encodings()
- Rename abbreviated variables in detect()/detect_all(): inc->include,
  exc->exclude, fb->no_match, em->empty
- Mark test_detect_default_params_no_regression as @pytest.mark.benchmark

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explicit None takes the same code path as implicit None — the test
was measuring nothing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k filter

Build the allowed encoding set once from get_candidates() (which
incorporates era + include + exclude) and use it to gate all
early-exit stages. This is simpler and more consistent — era now
gates early exits too, not just escape sequences.

Removes _is_filtered_out() helper (no longer needed), removes
REGISTRY import from orchestrator, and simplifies _make_fallback_or_none
to take the allowed set directly.

One accuracy test newly fails with era filtering (markup charset
declares windows-1250 but era is LEGACY_ISO), one previously-failing
test now passes (era gating helps the Finnish UTF-8/Latin-1 case).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dan-blanchard dan-blanchard merged commit e1428c3 into main Mar 15, 2026
17 checks passed
@dan-blanchard dan-blanchard deleted the filter-by-lister branch March 15, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Request] Add option to Exclude encodings(or specify list of encodings)

1 participant