Add include/exclude encoding filters#343
Merged
dan-blanchard merged 21 commits intomainfrom Mar 15, 2026
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #343 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 23 23
Lines 1390 1436 +46
=========================================
+ Hits 1390 1436 +46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Addresses #301 — adds include_encodings and exclude_encodings parameters to the public API for fine-grained control over which encodings are considered during detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- include_encodings now gates early-exit stages too (BOM, UTF-8, etc.) - Add fallback_encoding and empty_encoding customizable parameters - Filtered fallback/empty results emit UserWarning and return None - Binary detection explicitly unaffected by encoding filters - Clarify encoding_era + include_encodings intersection semantics - Add CLI flags for fallback-encoding and empty-encoding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify prefer_superset/compat_names interaction with filters - Note cache growth characteristics for get_candidates() - Document escape stage era gate preserved alongside new filter - Note era gating of other early-exit stages is out of scope - Add post-processing invariant (never introduces new encodings) - Clarify overlap behavior path through fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 tasks across 4 chunks: registry layer, pipeline orchestrator, public API, and CLI. TDD approach with tests before implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t_all() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- UniversalDetector: unknown exclude/fallback/empty encoding raises - detect_all: custom empty_encoding and fallback_encoding - CLI: invalid encoding name in -i reports error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all `from chardet.detector import UniversalDetector` imports from inside test functions to the top-level imports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add validate_encoding() helper in registry.py, used by normalize_encodings() and all three entry points - Add filtered_out() and fallback_or_none() closures in _run_pipeline_core() to avoid threading include/exclude through every call site - Fix _make_fallback_or_none() docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…acklevel) - Rename fallback_encoding -> no_match_encoding and empty_encoding -> empty_input_encoding across the entire codebase (API, CLI, detector, orchestrator, tests) - Remove fallback_or_none closure in _run_pipeline_core; call _make_fallback_or_none directly and update stacklevel from 4 to 5 - Swap @functools.cache to @functools.lru_cache(maxsize=256) on get_candidates() - Raise ValueError for empty include_encodings in normalize_encodings() - Rename abbreviated variables in detect()/detect_all(): inc->include, exc->exclude, fb->no_match, em->empty - Mark test_detect_default_params_no_regression as @pytest.mark.benchmark Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explicit None takes the same code path as implicit None — the test was measuring nothing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k filter Build the allowed encoding set once from get_candidates() (which incorporates era + include + exclude) and use it to gate all early-exit stages. This is simpler and more consistent — era now gates early exits too, not just escape sequences. Removes _is_filtered_out() helper (no longer needed), removes REGISTRY import from orchestrator, and simplifies _make_fallback_or_none to take the allowed set directly. One accuracy test newly fails with era filtering (markup charset declares windows-1250 but era is LEGACY_ISO), one previously-failing test now passes (era gating helps the Finnish UTF-8/Latin-1 case). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b74ba98 to
a6b99ba
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
include_encodingsandexclude_encodingsparameters todetect(),detect_all(), andUniversalDetectorfor fine-grained control over which encodings are considered during detectionno_match_encodingandempty_input_encodingparameters to customize what's returned when detection is inconclusive or input is empty-i/--include-encodings,-x/--exclude-encodings,--no-match-encoding,--empty-input-encodingCLI flagsinclude_encodings=[]raisesValueError(almost certainly a user error)Closes #301
Details
The candidate set is built once from
get_candidates()(which incorporatesencoding_era+include_encodings+exclude_encodings) and used to gate all pipeline stages uniformly — both early-exit stages (BOM, UTF-8, ASCII, markup charset, escape sequences) and the statistical scoring pipeline. This meansencoding_eranow gates early-exit stages too, which is more consistent than the previous behavior (where only escape sequences were era-gated).Binary detection (
encoding=None) is intentionally exempt from filtering. When a fallback or empty-input encoding is itself filtered out, aUserWarningis emitted andencoding=Noneis returned.Test plan
normalize_encodings()— valid names, aliases, unknown names raiseValueError, empty raisesValueErrorget_candidates()— include-only, exclude-only, both, era intersection, empty resultdetect()/detect_all()integration — include narrows, exclude removes, BOM/ASCII/UTF-8 gating, overlap returns NoneUniversalDetector— all four params work through streaming interface-i,-x,--no-match-encoding,--empty-input-encoding, comma-separated parsing, invalid encoding error🤖 Generated with Claude Code