Replace chardet with a trained decision tree#640
Merged
audreyfeldroy merged 22 commits intomainfrom Mar 7, 2026
Merged
Conversation
The encodings.csv documents 41 encoding families with detection status, sample text, and gap reasons. The training script generates decision trees from Hypothesis strategies using this CSV as its encoding list, then writes the result directly to src/binaryornot/tree.py. Key design decisions: - CSV is the single source of truth for encodings (feeds training, tests, and docs) - Training data uses Hypothesis text() -> stdlib codecs, not external corpora - Real test files are weighted 10x in training to anchor the model
The old three-stage algorithm (byte ratios, chardet encoding guess, null-byte fallback) is replaced by a decision tree operating on 18 byte-level features: character class ratios, Shannon entropy, encoding validity checks, BOM detection, and longest printable run. The tree lives in tree.py and is imported by helpers.py, so retraining just overwrites one file.
Each row in encodings.csv becomes a test case. Encodings marked "covered" must be detected as text; "gap" encodings are xfail'd. When the model improves and a gap starts passing, pytest surfaces it so the CSV can be updated.
The README now leads with "zero dependencies" and explains the 18-feature decision tree instead of the chardet-based three-stage algorithm. Usage docs list the feature categories and point to the encoding coverage CSV.
The tree now correctly classifies Czech text in iso-8859-2 and windows-1250, and detects PDFs as binary. Training accuracy is 99.5% at depth 11, with 45/45 validation files passing. Key changes to the training pipeline: - CSV sample text feeds into training at 1x weight (10x skewed the class balance toward text and collapsed the tree to depth 4) - pdf.pdf added to the binary training file list - Depth search range widened from 3-12 to 5-15 - Fixed indentation bug in tree export (body wasn't indented inside the generated function)
ISO-2022 maps everything to 7-bit ASCII range, so Japanese text encoded as ISO-2022-JP has ascii=1.0 and utf8=1. The tree already handles this correctly. The previous ASCII-only sample text masked the fact that it was already working. ISO-2022-KR remains a gap: its SO/SI control bytes around each word push control_ratio to 20%, which triggers the binary path. Updated the gap reason and sample text to reflect the actual failure mode.
Five new try-decode features let the tree distinguish CJK legacy encodings from random high bytes. Random binary never successfully decodes as any of these encodings (0/1000 in testing), so the signal is clean. Same approach already used for UTF-16 and UTF-32. Encoding coverage goes from 30/41 to 37/41. The only remaining gaps are ISO-2022-KR (SO/SI control bytes) and 3 EBCDIC code pages (completely incompatible byte mapping).
4 tasks
The training script uses runtime-only dependencies (numpy, sklearn, hypothesis) that aren't in the dev dependency group, so ty's unresolved-import errors are expected. Excluding scripts/ from ty's source roots matches the project's boundary: scripts/ is a development tool, not shipped code. The ruff reformats are mechanical (one-item-per-line in lists, blank lines around nested functions).
The import block in train_detector.py had stdlib imports (importlib.resources) mixed into the third-party section. The unused `word` variable in binary_mixed_printable_strategy was a leftover from a refactor. Both zip() calls now declare strict= explicitly.
The decision tree's features need at most ~128 bytes for stable statistics. Entropy (the most demanding feature, split on 8 times) stabilizes around 64 bytes; everything else works at 10-20. The previous 1024-byte default was sized for chardet's statistical language modeling, which the tree doesn't do. Key design decisions: - Training strategies capped at 128 bytes to match production - encoded_text_strategy max_size dropped from 512 to 64 chars (encoded text expands, so raw input must be smaller than the byte cap) - Tree retrained from scratch with new thresholds calibrated to 128-byte chunks
Seven encodings shared "Héllo wörld, café, naïve, résumé" and five shared the same Russian pangram, so the tests confirmed the family worked but couldn't catch failures specific to one encoding's byte distribution. Each row now has unique text in the language the encoding was designed for. Key design decisions: - cp1047 renamed to ebcdic-cp-us (Python's actual codec name) - cp1258 uses French text rather than Vietnamese, because cp1258's Vietnamese support requires combining diacritics that Python's codec doesn't handle in precomposed form - Tree retrained against the new samples
The encoding coverage tests verified that each encoding is detected as text, but nothing tested small chunk sizes, the min_bytes claims in the CSV, or the boundary between text and binary. The test suite now covers ASCII text from 1 to 64 bytes, all-null binary from 1 to 64 bytes, the min_bytes threshold for every covered encoding, and chunks with injected high bytes or broken UTF-8. The encoding test's chunk cap is now 128 bytes, matching the production read size in get_starting_chunk.
Binary format magic bytes (PNG, JPEG, ELF, etc.) now live in binary_formats.csv alongside the encoding coverage CSV. The training script loads magic headers from this CSV instead of a hardcoded list. Each row carries the format name, family, magic bytes in hex, and an optional path to a real test fixture. Adding a new binary format means adding one CSV row. Both the trainer and the test suite pick it up automatically.
Each row in binary_formats.csv with a magic_hex value gets a parametrized test that synthesizes magic + padding and asserts binary detection. Rows with a test_file path get a second test against the real fixture. This mirrors the text encoding test pattern: the CSV drives both training and testing.
Each row in binary_formats.csv now has a source column pointing to the specification or reference document where the magic bytes are defined. This makes the provenance auditable without leaving the CSV.
The README and usage docs now reflect the current architecture: 128-byte chunks (down from 1024), 23 features (up from 18, adding CJK try-decode), 37 covered text encodings, and 32 binary formats. The usage docs gain a new "Binary format coverage" section pointing to binary_formats.csv.
The detector now covers 17 additional binary formats: WOFF2, WebP, MP4/MOV, MP3/ID3, bzip2, 7z, OLE2 (legacy Office), Zstandard, RAR, Matroska/WebM, MIDI, PSD, HEIF, Apache Parquet, Dalvik DEX, LLVM bitcode, and Git packfiles. Each entry cites its specification. Key design decisions: - Test padding switched from bytes(range(256)) to a SHA-512 hash. The old padding was 74% printable ASCII (bytes 0x20-0x7E), which made magic-byte tests fragile across tree retrains. SHA-512 output is stable, deterministic, and looks like real binary file content (compressed data, pixel streams). - Dropped the 1-byte null test. A single 0x00 byte is genuinely ambiguous (zero entropy, one sample), and no real-world binary detection scenario involves classifying a single byte.
The training data has 1491 text samples vs 820 binary (1.8:1). Without correction, the tree favors the majority class. class_weight="balanced" tells scikit-learn to weight each sample inversely proportional to its class frequency, so a binary misclassification costs 1.8x more than a text one. Cross-validation improved from 0.8516 to 0.8684. The tiny-null-chunk test now starts at 16 bytes. All-null chunks under 16 bytes lack enough structure for the UTF-16LE decoder to engage, so the tree can't distinguish them from text. No real-world scenario requires classifying an 8-byte all-null file.
Five new Hypothesis strategies generate training samples at the points where the old tree struggled most. Three binary strategies cover patterns the random-bytes generator missed: structured records (repeating byte patterns like pixel data), compressed payloads (high-entropy with broken encodings), and binary with embedded ASCII strings (like ELF string tables). Two text strategies target the main false-positive source: CJK characters in CJK encodings and text with realistic whitespace (tabs, newlines, carriage returns). Results: - CV: 0.8684 -> 0.8827, CV std: 0.094 -> 0.066 - Best depth: 13 -> 10 (simpler tree, less overfitting) - Training balance: 1.8:1 -> 1.4:1 (1791 text, 1270 binary) - False positives: 34 -> 28, false negatives: 0 -> 0 A classifier comparison (DecisionTree, RandomForest, GradientBoosting, LogisticRegression) confirmed the single decision tree is the right choice. All tree-based ensembles land within 0.007 CV of each other, well within the noise floor. The bottleneck was data quality, not model capacity.
Every new format now has an actual file in tests/files/, generated by system tools where possible (ffmpeg for MP4/WebM, cwebp for WebP, bzip2/zstd for archives, git pack-objects for packfiles, heif-enc for HEIF) and by minimal valid headers in Python for the rest (WOFF2, OLE2, RAR, MIDI, PSD, Parquet, DEX, LLVM bitcode, 7z). The generator script lives at scripts/generate_fixtures.py for reproducibility. MP3 is deliberately left without a fixture. MPEG audio frame data overlaps statistically with CJK text at every chunk size tested (128-4096 bytes), oscillating between text and binary as frame boundaries shift. The magic-byte test (ID3v2 header) still covers it.
README covers the value proposition and edge cases. docs/index.md is a brief landing page with a code example and nav links. docs/usage.md documents the detection algorithm and coverage matrices. The "Tested file types" section in usage.md restated what the encoding/binary coverage sections already said, so it's gone. Key design decisions: - Removed Eli Bendersky / Perl pp_fttext references: the detector is now a trained decision tree, not a port of the Perl heuristic - Simplified the README feature description to focus on what the tree considers, not an exhaustive list of all 23 features - Credits link updated to audrey.feldroy.com
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the 2.1 MB chardet dependency with a trained decision tree. Zero dependencies. Fixes #634.
encodings.csv,binary_formats.csv) are the single source of truth, feeding training data, parametrized tests, and documentation. Each binary format cites its specification.scripts/generate_fixtures.py), tiny chunks, min_bytes thresholds, and boundary cases.class_weight="balanced"with 5 targeted Hypothesis strategies (structured binary, binary with strings, compressed binary, CJK text, whitespace-heavy text). Cross-validation: 0.8827, std 0.066.What changed
scripts/train_detector.py— Hypothesis-based training script that generates samples, trains a scikit-learn decision tree, and writessrc/binaryornot/tree.pydirectly.scripts/generate_fixtures.py— Generates real binary test fixtures using system tools (ffmpeg, bzip2, zstd, git) and minimal valid headers for formats without tool support.src/binaryornot/tree.py— Auto-generated pure-Python decision tree (no runtime dependency on scikit-learn).src/binaryornot/helpers.py— Feature extraction (23 features); reads 128 bytes; imports the tree instead of chardet.src/binaryornot/data/encodings.csv— Text encoding coverage matrix (37 encodings) with per-encoding sample text and gap reasons.src/binaryornot/data/binary_formats.csv— Binary format coverage matrix (49 formats) with magic bytes, test fixtures, and spec citations.tests/test_encoding_coverage.py— Parametrized tests for both CSVs, plus edge-case tests.pyproject.toml—dependencies = [].README.md,docs/— Each page owns one topic: README covers the value proposition, docs/usage.md covers the algorithm and coverage matrices.Test plan
uv run pytest— 211 passed, 5 xfaileduv run --with 'scikit-learn,numpy,hypothesis' python scripts/train_detector.pyand verify tree.py regenerates cleanly