Skip to content

Replace chardet with a trained decision tree#640

Merged
audreyfeldroy merged 22 commits intomainfrom
replace-chardet-with-trained-tree
Mar 7, 2026
Merged

Replace chardet with a trained decision tree#640
audreyfeldroy merged 22 commits intomainfrom
replace-chardet-with-trained-tree

Conversation

@audreyfeldroy
Copy link
Copy Markdown
Collaborator

@audreyfeldroy audreyfeldroy commented Mar 6, 2026

Summary

Replaces the 2.1 MB chardet dependency with a trained decision tree. Zero dependencies. Fixes #634.

  • Zero dependencies. A decision tree trained on 23 byte-level features (entropy, character class ratios, BOM detection, encoding validity checks for UTF-8/16/32 and five CJK encodings) replaces chardet entirely.
  • 128 bytes per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.
  • 37 text encodings, 49 binary formats. Two coverage CSVs (encodings.csv, binary_formats.csv) are the single source of truth, feeding training data, parametrized tests, and documentation. Each binary format cites its specification.
  • 211 tests, up from 84. Encoding detection, binary format magic bytes, real file fixtures (16 formats generated via scripts/generate_fixtures.py), tiny chunks, min_bytes thresholds, and boundary cases.
  • Balanced training. The decision tree uses class_weight="balanced" with 5 targeted Hypothesis strategies (structured binary, binary with strings, compressed binary, CJK text, whitespace-heavy text). Cross-validation: 0.8827, std 0.066.
  • 4 documented gaps. ISO-2022-KR and three EBCDIC code pages, with reasons.

What changed

  1. scripts/train_detector.py — Hypothesis-based training script that generates samples, trains a scikit-learn decision tree, and writes src/binaryornot/tree.py directly.
  2. scripts/generate_fixtures.py — Generates real binary test fixtures using system tools (ffmpeg, bzip2, zstd, git) and minimal valid headers for formats without tool support.
  3. src/binaryornot/tree.py — Auto-generated pure-Python decision tree (no runtime dependency on scikit-learn).
  4. src/binaryornot/helpers.py — Feature extraction (23 features); reads 128 bytes; imports the tree instead of chardet.
  5. src/binaryornot/data/encodings.csv — Text encoding coverage matrix (37 encodings) with per-encoding sample text and gap reasons.
  6. src/binaryornot/data/binary_formats.csv — Binary format coverage matrix (49 formats) with magic bytes, test fixtures, and spec citations.
  7. tests/test_encoding_coverage.py — Parametrized tests for both CSVs, plus edge-case tests.
  8. pyproject.tomldependencies = [].
  9. README.md, docs/ — Each page owns one topic: README covers the value proposition, docs/usage.md covers the algorithm and coverage matrices.

Test plan

  • uv run pytest — 211 passed, 5 xfailed
  • All existing binary/text detection tests pass without chardet
  • Encoding coverage tests verify each of 37 CSV entries round-trips correctly
  • Binary format tests verify 49 formats via magic-byte detection and 16 real file fixtures
  • Edge-case tests cover tiny chunks (16-64 bytes), min_bytes thresholds, and injected artifacts
  • Retrain with uv run --with 'scikit-learn,numpy,hypothesis' python scripts/train_detector.py and verify tree.py regenerates cleanly

The encodings.csv documents 41 encoding families with detection status,
sample text, and gap reasons. The training script generates decision
trees from Hypothesis strategies using this CSV as its encoding list,
then writes the result directly to src/binaryornot/tree.py.

Key design decisions:
- CSV is the single source of truth for encodings (feeds training,
  tests, and docs)
- Training data uses Hypothesis text() -> stdlib codecs, not external
  corpora
- Real test files are weighted 10x in training to anchor the model
The old three-stage algorithm (byte ratios, chardet encoding guess,
null-byte fallback) is replaced by a decision tree operating on 18
byte-level features: character class ratios, Shannon entropy, encoding
validity checks, BOM detection, and longest printable run. The tree
lives in tree.py and is imported by helpers.py, so retraining just
overwrites one file.
Each row in encodings.csv becomes a test case. Encodings marked
"covered" must be detected as text; "gap" encodings are xfail'd.
When the model improves and a gap starts passing, pytest surfaces it
so the CSV can be updated.
The README now leads with "zero dependencies" and explains the
18-feature decision tree instead of the chardet-based three-stage
algorithm. Usage docs list the feature categories and point to the
encoding coverage CSV.
The tree now correctly classifies Czech text in iso-8859-2 and
windows-1250, and detects PDFs as binary. Training accuracy is 99.5%
at depth 11, with 45/45 validation files passing.

Key changes to the training pipeline:
- CSV sample text feeds into training at 1x weight (10x skewed the
  class balance toward text and collapsed the tree to depth 4)
- pdf.pdf added to the binary training file list
- Depth search range widened from 3-12 to 5-15
- Fixed indentation bug in tree export (body wasn't indented inside
  the generated function)
ISO-2022 maps everything to 7-bit ASCII range, so Japanese text
encoded as ISO-2022-JP has ascii=1.0 and utf8=1. The tree already
handles this correctly. The previous ASCII-only sample text masked
the fact that it was already working.

ISO-2022-KR remains a gap: its SO/SI control bytes around each
word push control_ratio to 20%, which triggers the binary path.
Updated the gap reason and sample text to reflect the actual failure
mode.
Five new try-decode features let the tree distinguish CJK legacy
encodings from random high bytes. Random binary never successfully
decodes as any of these encodings (0/1000 in testing), so the
signal is clean. Same approach already used for UTF-16 and UTF-32.

Encoding coverage goes from 30/41 to 37/41. The only remaining
gaps are ISO-2022-KR (SO/SI control bytes) and 3 EBCDIC code pages
(completely incompatible byte mapping).
The training script uses runtime-only dependencies (numpy, sklearn,
hypothesis) that aren't in the dev dependency group, so ty's
unresolved-import errors are expected. Excluding scripts/ from ty's
source roots matches the project's boundary: scripts/ is a development
tool, not shipped code.

The ruff reformats are mechanical (one-item-per-line in lists, blank
lines around nested functions).
The import block in train_detector.py had stdlib imports (importlib.resources)
mixed into the third-party section. The unused `word` variable in
binary_mixed_printable_strategy was a leftover from a refactor. Both zip()
calls now declare strict= explicitly.
The decision tree's features need at most ~128 bytes for stable
statistics. Entropy (the most demanding feature, split on 8 times)
stabilizes around 64 bytes; everything else works at 10-20. The
previous 1024-byte default was sized for chardet's statistical
language modeling, which the tree doesn't do.

Key design decisions:
- Training strategies capped at 128 bytes to match production
- encoded_text_strategy max_size dropped from 512 to 64 chars
  (encoded text expands, so raw input must be smaller than the
  byte cap)
- Tree retrained from scratch with new thresholds calibrated
  to 128-byte chunks
Seven encodings shared "Héllo wörld, café, naïve, résumé" and five
shared the same Russian pangram, so the tests confirmed the family
worked but couldn't catch failures specific to one encoding's byte
distribution. Each row now has unique text in the language the encoding
was designed for.

Key design decisions:
- cp1047 renamed to ebcdic-cp-us (Python's actual codec name)
- cp1258 uses French text rather than Vietnamese, because cp1258's
  Vietnamese support requires combining diacritics that Python's
  codec doesn't handle in precomposed form
- Tree retrained against the new samples
The encoding coverage tests verified that each encoding is detected as
text, but nothing tested small chunk sizes, the min_bytes claims in
the CSV, or the boundary between text and binary. The test suite now
covers ASCII text from 1 to 64 bytes, all-null binary from 1 to 64
bytes, the min_bytes threshold for every covered encoding, and chunks
with injected high bytes or broken UTF-8.

The encoding test's chunk cap is now 128 bytes, matching the production
read size in get_starting_chunk.
Binary format magic bytes (PNG, JPEG, ELF, etc.) now live in
binary_formats.csv alongside the encoding coverage CSV. The training
script loads magic headers from this CSV instead of a hardcoded list.
Each row carries the format name, family, magic bytes in hex, and an
optional path to a real test fixture.

Adding a new binary format means adding one CSV row. Both the trainer
and the test suite pick it up automatically.
Each row in binary_formats.csv with a magic_hex value gets a
parametrized test that synthesizes magic + padding and asserts binary
detection. Rows with a test_file path get a second test against the
real fixture. This mirrors the text encoding test pattern: the CSV
drives both training and testing.
Each row in binary_formats.csv now has a source column pointing to the
specification or reference document where the magic bytes are defined.
This makes the provenance auditable without leaving the CSV.
The README and usage docs now reflect the current architecture: 128-byte
chunks (down from 1024), 23 features (up from 18, adding CJK try-decode),
37 covered text encodings, and 32 binary formats. The usage docs gain a
new "Binary format coverage" section pointing to binary_formats.csv.
The detector now covers 17 additional binary formats: WOFF2, WebP,
MP4/MOV, MP3/ID3, bzip2, 7z, OLE2 (legacy Office), Zstandard, RAR,
Matroska/WebM, MIDI, PSD, HEIF, Apache Parquet, Dalvik DEX, LLVM
bitcode, and Git packfiles. Each entry cites its specification.

Key design decisions:
- Test padding switched from bytes(range(256)) to a SHA-512 hash.
  The old padding was 74% printable ASCII (bytes 0x20-0x7E), which
  made magic-byte tests fragile across tree retrains. SHA-512 output
  is stable, deterministic, and looks like real binary file content
  (compressed data, pixel streams).
- Dropped the 1-byte null test. A single 0x00 byte is genuinely
  ambiguous (zero entropy, one sample), and no real-world binary
  detection scenario involves classifying a single byte.
The training data has 1491 text samples vs 820 binary (1.8:1). Without
correction, the tree favors the majority class. class_weight="balanced"
tells scikit-learn to weight each sample inversely proportional to its
class frequency, so a binary misclassification costs 1.8x more than a
text one. Cross-validation improved from 0.8516 to 0.8684.

The tiny-null-chunk test now starts at 16 bytes. All-null chunks under
16 bytes lack enough structure for the UTF-16LE decoder to engage, so
the tree can't distinguish them from text. No real-world scenario
requires classifying an 8-byte all-null file.
Five new Hypothesis strategies generate training samples at the
points where the old tree struggled most. Three binary strategies
cover patterns the random-bytes generator missed: structured records
(repeating byte patterns like pixel data), compressed payloads
(high-entropy with broken encodings), and binary with embedded ASCII
strings (like ELF string tables). Two text strategies target the main
false-positive source: CJK characters in CJK encodings and text with
realistic whitespace (tabs, newlines, carriage returns).

Results:
- CV: 0.8684 -> 0.8827, CV std: 0.094 -> 0.066
- Best depth: 13 -> 10 (simpler tree, less overfitting)
- Training balance: 1.8:1 -> 1.4:1 (1791 text, 1270 binary)
- False positives: 34 -> 28, false negatives: 0 -> 0

A classifier comparison (DecisionTree, RandomForest, GradientBoosting,
LogisticRegression) confirmed the single decision tree is the right
choice. All tree-based ensembles land within 0.007 CV of each other,
well within the noise floor. The bottleneck was data quality, not
model capacity.
Every new format now has an actual file in tests/files/, generated by
system tools where possible (ffmpeg for MP4/WebM, cwebp for WebP,
bzip2/zstd for archives, git pack-objects for packfiles, heif-enc for
HEIF) and by minimal valid headers in Python for the rest (WOFF2,
OLE2, RAR, MIDI, PSD, Parquet, DEX, LLVM bitcode, 7z).

The generator script lives at scripts/generate_fixtures.py for
reproducibility.

MP3 is deliberately left without a fixture. MPEG audio frame data
overlaps statistically with CJK text at every chunk size tested
(128-4096 bytes), oscillating between text and binary as frame
boundaries shift. The magic-byte test (ID3v2 header) still covers it.
README covers the value proposition and edge cases. docs/index.md is a
brief landing page with a code example and nav links. docs/usage.md
documents the detection algorithm and coverage matrices. The "Tested
file types" section in usage.md restated what the encoding/binary
coverage sections already said, so it's gone.

Key design decisions:
- Removed Eli Bendersky / Perl pp_fttext references: the detector is
  now a trained decision tree, not a port of the Perl heuristic
- Simplified the README feature description to focus on what the tree
  considers, not an exhaustive list of all 23 features
- Credits link updated to audrey.feldroy.com
@audreyfeldroy audreyfeldroy merged commit b62533a into main Mar 7, 2026
9 checks passed
@audreyfeldroy audreyfeldroy deleted the replace-chardet-with-trained-tree branch March 8, 2026 13:31
@audreyfeldroy audreyfeldroy mentioned this pull request Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crashes with chardet 7.0.0is_binary_string does not handle None encoding

1 participant