Replace chardet with a trained decision tree by audreyfeldroy · Pull Request #640 · binaryornot/binaryornot

audreyfeldroy · 2026-03-06T17:09:49Z

Summary

Replaces the 2.1 MB chardet dependency with a trained decision tree. Zero dependencies. Fixes #634.

Zero dependencies. A decision tree trained on 23 byte-level features (entropy, character class ratios, BOM detection, encoding validity checks for UTF-8/16/32 and five CJK encodings) replaces chardet entirely.
128 bytes per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.
37 text encodings, 49 binary formats. Two coverage CSVs (encodings.csv, binary_formats.csv) are the single source of truth, feeding training data, parametrized tests, and documentation. Each binary format cites its specification.
211 tests, up from 84. Encoding detection, binary format magic bytes, real file fixtures (16 formats generated via scripts/generate_fixtures.py), tiny chunks, min_bytes thresholds, and boundary cases.
Balanced training. The decision tree uses class_weight="balanced" with 5 targeted Hypothesis strategies (structured binary, binary with strings, compressed binary, CJK text, whitespace-heavy text). Cross-validation: 0.8827, std 0.066.
4 documented gaps. ISO-2022-KR and three EBCDIC code pages, with reasons.

What changed

scripts/train_detector.py — Hypothesis-based training script that generates samples, trains a scikit-learn decision tree, and writes src/binaryornot/tree.py directly.
scripts/generate_fixtures.py — Generates real binary test fixtures using system tools (ffmpeg, bzip2, zstd, git) and minimal valid headers for formats without tool support.
src/binaryornot/tree.py — Auto-generated pure-Python decision tree (no runtime dependency on scikit-learn).
src/binaryornot/helpers.py — Feature extraction (23 features); reads 128 bytes; imports the tree instead of chardet.
src/binaryornot/data/encodings.csv — Text encoding coverage matrix (37 encodings) with per-encoding sample text and gap reasons.
src/binaryornot/data/binary_formats.csv — Binary format coverage matrix (49 formats) with magic bytes, test fixtures, and spec citations.
tests/test_encoding_coverage.py — Parametrized tests for both CSVs, plus edge-case tests.
pyproject.toml — dependencies = [].
README.md, docs/ — Each page owns one topic: README covers the value proposition, docs/usage.md covers the algorithm and coverage matrices.

Test plan

uv run pytest — 211 passed, 5 xfailed
All existing binary/text detection tests pass without chardet
Encoding coverage tests verify each of 37 CSV entries round-trips correctly
Binary format tests verify 49 formats via magic-byte detection and 16 real file fixtures
Edge-case tests cover tiny chunks (16-64 bytes), min_bytes thresholds, and injected artifacts
Retrain with uv run --with 'scikit-learn,numpy,hypothesis' python scripts/train_detector.py and verify tree.py regenerates cleanly

The encodings.csv documents 41 encoding families with detection status, sample text, and gap reasons. The training script generates decision trees from Hypothesis strategies using this CSV as its encoding list, then writes the result directly to src/binaryornot/tree.py. Key design decisions: - CSV is the single source of truth for encodings (feeds training, tests, and docs) - Training data uses Hypothesis text() -> stdlib codecs, not external corpora - Real test files are weighted 10x in training to anchor the model

The old three-stage algorithm (byte ratios, chardet encoding guess, null-byte fallback) is replaced by a decision tree operating on 18 byte-level features: character class ratios, Shannon entropy, encoding validity checks, BOM detection, and longest printable run. The tree lives in tree.py and is imported by helpers.py, so retraining just overwrites one file.

Each row in encodings.csv becomes a test case. Encodings marked "covered" must be detected as text; "gap" encodings are xfail'd. When the model improves and a gap starts passing, pytest surfaces it so the CSV can be updated.

The README now leads with "zero dependencies" and explains the 18-feature decision tree instead of the chardet-based three-stage algorithm. Usage docs list the feature categories and point to the encoding coverage CSV.

The tree now correctly classifies Czech text in iso-8859-2 and windows-1250, and detects PDFs as binary. Training accuracy is 99.5% at depth 11, with 45/45 validation files passing. Key changes to the training pipeline: - CSV sample text feeds into training at 1x weight (10x skewed the class balance toward text and collapsed the tree to depth 4) - pdf.pdf added to the binary training file list - Depth search range widened from 3-12 to 5-15 - Fixed indentation bug in tree export (body wasn't indented inside the generated function)

ISO-2022 maps everything to 7-bit ASCII range, so Japanese text encoded as ISO-2022-JP has ascii=1.0 and utf8=1. The tree already handles this correctly. The previous ASCII-only sample text masked the fact that it was already working. ISO-2022-KR remains a gap: its SO/SI control bytes around each word push control_ratio to 20%, which triggers the binary path. Updated the gap reason and sample text to reflect the actual failure mode.

Five new try-decode features let the tree distinguish CJK legacy encodings from random high bytes. Random binary never successfully decodes as any of these encodings (0/1000 in testing), so the signal is clean. Same approach already used for UTF-16 and UTF-32. Encoding coverage goes from 30/41 to 37/41. The only remaining gaps are ISO-2022-KR (SO/SI control bytes) and 3 EBCDIC code pages (completely incompatible byte mapping).

The training script uses runtime-only dependencies (numpy, sklearn, hypothesis) that aren't in the dev dependency group, so ty's unresolved-import errors are expected. Excluding scripts/ from ty's source roots matches the project's boundary: scripts/ is a development tool, not shipped code. The ruff reformats are mechanical (one-item-per-line in lists, blank lines around nested functions).

The import block in train_detector.py had stdlib imports (importlib.resources) mixed into the third-party section. The unused `word` variable in binary_mixed_printable_strategy was a leftover from a refactor. Both zip() calls now declare strict= explicitly.

The decision tree's features need at most ~128 bytes for stable statistics. Entropy (the most demanding feature, split on 8 times) stabilizes around 64 bytes; everything else works at 10-20. The previous 1024-byte default was sized for chardet's statistical language modeling, which the tree doesn't do. Key design decisions: - Training strategies capped at 128 bytes to match production - encoded_text_strategy max_size dropped from 512 to 64 chars (encoded text expands, so raw input must be smaller than the byte cap) - Tree retrained from scratch with new thresholds calibrated to 128-byte chunks

Seven encodings shared "Héllo wörld, café, naïve, résumé" and five shared the same Russian pangram, so the tests confirmed the family worked but couldn't catch failures specific to one encoding's byte distribution. Each row now has unique text in the language the encoding was designed for. Key design decisions: - cp1047 renamed to ebcdic-cp-us (Python's actual codec name) - cp1258 uses French text rather than Vietnamese, because cp1258's Vietnamese support requires combining diacritics that Python's codec doesn't handle in precomposed form - Tree retrained against the new samples

The encoding coverage tests verified that each encoding is detected as text, but nothing tested small chunk sizes, the min_bytes claims in the CSV, or the boundary between text and binary. The test suite now covers ASCII text from 1 to 64 bytes, all-null binary from 1 to 64 bytes, the min_bytes threshold for every covered encoding, and chunks with injected high bytes or broken UTF-8. The encoding test's chunk cap is now 128 bytes, matching the production read size in get_starting_chunk.

Binary format magic bytes (PNG, JPEG, ELF, etc.) now live in binary_formats.csv alongside the encoding coverage CSV. The training script loads magic headers from this CSV instead of a hardcoded list. Each row carries the format name, family, magic bytes in hex, and an optional path to a real test fixture. Adding a new binary format means adding one CSV row. Both the trainer and the test suite pick it up automatically.

Each row in binary_formats.csv with a magic_hex value gets a parametrized test that synthesizes magic + padding and asserts binary detection. Rows with a test_file path get a second test against the real fixture. This mirrors the text encoding test pattern: the CSV drives both training and testing.

Each row in binary_formats.csv now has a source column pointing to the specification or reference document where the magic bytes are defined. This makes the provenance auditable without leaving the CSV.

The README and usage docs now reflect the current architecture: 128-byte chunks (down from 1024), 23 features (up from 18, adding CJK try-decode), 37 covered text encodings, and 32 binary formats. The usage docs gain a new "Binary format coverage" section pointing to binary_formats.csv.

The detector now covers 17 additional binary formats: WOFF2, WebP, MP4/MOV, MP3/ID3, bzip2, 7z, OLE2 (legacy Office), Zstandard, RAR, Matroska/WebM, MIDI, PSD, HEIF, Apache Parquet, Dalvik DEX, LLVM bitcode, and Git packfiles. Each entry cites its specification. Key design decisions: - Test padding switched from bytes(range(256)) to a SHA-512 hash. The old padding was 74% printable ASCII (bytes 0x20-0x7E), which made magic-byte tests fragile across tree retrains. SHA-512 output is stable, deterministic, and looks like real binary file content (compressed data, pixel streams). - Dropped the 1-byte null test. A single 0x00 byte is genuinely ambiguous (zero entropy, one sample), and no real-world binary detection scenario involves classifying a single byte.

The training data has 1491 text samples vs 820 binary (1.8:1). Without correction, the tree favors the majority class. class_weight="balanced" tells scikit-learn to weight each sample inversely proportional to its class frequency, so a binary misclassification costs 1.8x more than a text one. Cross-validation improved from 0.8516 to 0.8684. The tiny-null-chunk test now starts at 16 bytes. All-null chunks under 16 bytes lack enough structure for the UTF-16LE decoder to engage, so the tree can't distinguish them from text. No real-world scenario requires classifying an 8-byte all-null file.

Five new Hypothesis strategies generate training samples at the points where the old tree struggled most. Three binary strategies cover patterns the random-bytes generator missed: structured records (repeating byte patterns like pixel data), compressed payloads (high-entropy with broken encodings), and binary with embedded ASCII strings (like ELF string tables). Two text strategies target the main false-positive source: CJK characters in CJK encodings and text with realistic whitespace (tabs, newlines, carriage returns). Results: - CV: 0.8684 -> 0.8827, CV std: 0.094 -> 0.066 - Best depth: 13 -> 10 (simpler tree, less overfitting) - Training balance: 1.8:1 -> 1.4:1 (1791 text, 1270 binary) - False positives: 34 -> 28, false negatives: 0 -> 0 A classifier comparison (DecisionTree, RandomForest, GradientBoosting, LogisticRegression) confirmed the single decision tree is the right choice. All tree-based ensembles land within 0.007 CV of each other, well within the noise floor. The bottleneck was data quality, not model capacity.

Every new format now has an actual file in tests/files/, generated by system tools where possible (ffmpeg for MP4/WebM, cwebp for WebP, bzip2/zstd for archives, git pack-objects for packfiles, heif-enc for HEIF) and by minimal valid headers in Python for the rest (WOFF2, OLE2, RAR, MIDI, PSD, Parquet, DEX, LLVM bitcode, 7z). The generator script lives at scripts/generate_fixtures.py for reproducibility. MP3 is deliberately left without a fixture. MPEG audio frame data overlaps statistically with CJK text at every chunk size tested (128-4096 bytes), oscillating between text and binary as frame boundaries shift. The magic-byte test (ID3v2 header) still covers it.

README covers the value proposition and edge cases. docs/index.md is a brief landing page with a code example and nav links. docs/usage.md documents the detection algorithm and coverage matrices. The "Tested file types" section in usage.md restated what the encoding/binary coverage sections already said, so it's gone. Key design decisions: - Removed Eli Bendersky / Perl pp_fttext references: the detector is now a trained decision tree, not a port of the Perl heuristic - Simplified the README feature description to focus on what the tree considers, not an exhaustive list of all 23 features - Credits link updated to audrey.feldroy.com

audreyfeldroy added 7 commits March 7, 2026 00:46

Describe the trained decision tree in README and usage docs

a5367f3

The README now leads with "zero dependencies" and explains the 18-feature decision tree instead of the chardet-based three-stage algorithm. Usage docs list the feature categories and point to the encoding coverage CSV.

audreyfeldroy mentioned this pull request Mar 6, 2026

Replace chardet with charset-normalizer (MIT-licensed) #635

Closed

4 tasks

audreyfeldroy added 15 commits March 7, 2026 01:14

Cite the spec for every binary format magic value

2534286

Each row in binary_formats.csv now has a source column pointing to the specification or reference document where the magic bytes are defined. This makes the provenance auditable without leaving the CSV.

Format scripts with ruff

55e1c02

audreyfeldroy mentioned this pull request Mar 7, 2026

Crashes with chardet 7.0.0 — is_binary_string does not handle None encoding #634

Closed

audreyfeldroy merged commit b62533a into main Mar 7, 2026
9 checks passed

audreyfeldroy deleted the replace-chardet-with-trained-tree branch March 8, 2026 13:31

audreyfeldroy mentioned this pull request Mar 8, 2026

Passing the baton? #527

Closed

wesleybl mentioned this pull request Mar 10, 2026

Remove dependency on chardet plone/cookieplone#127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace chardet with a trained decision tree#640

Replace chardet with a trained decision tree#640
audreyfeldroy merged 22 commits intomainfrom
replace-chardet-with-trained-tree

audreyfeldroy commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

audreyfeldroy commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

audreyfeldroy commented Mar 6, 2026 •

edited

Loading