Releases: binaryornot/binaryornot
BinaryOrNot 0.6.0: Three Layers of Detection
BinaryOrNot identifies binary files three ways: by extension, by file signature, and by content analysis. Pass it any file path and it tells you binary or text, accurately, across PNGs, PDFs, executables, archives, fonts, CJK-encoded text, and hundreds of other formats.
uv pip install --upgrade binaryornot
What's new
131 file types recognized by name. is_binary() checks the filename extension against a curated list of binary types (images, audio, video, archives, executables, fonts, documents, databases, 3D models, CAD files, scientific data formats, game ROMs) before reading any bytes. A .png or .mp4 is classified instantly with zero file I/O. The extension list ships as binary_extensions.csv and is easy to inspect or extend. (#648)
If you need pure content-based classification, pass check_extensions=False:
from binaryornot.check import is_binary
# Extension says binary, but let's check the actual bytes
is_binary("mystery_file.pyc", check_extensions=False)55 binary format signatures. The detector checks file headers against known magic bytes for PNG, JPEG, PDF, ZIP, ELF, Mach-O, WebAssembly, SQLite, Parquet, Arrow IPC, and 45 more formats. Files that match a known signature are classified as binary immediately, before the statistical model runs. The signature table ships as binary_formats.csv. (#647)
Type annotations on the public API. is_binary(), is_binary_string(), and get_starting_chunk() all have inline type annotations. Editors and type checkers know that is_binary() accepts str, bytes, or pathlib.Path and returns bool. Credit to @smheidrich for the initial type stubs proposal (#627) and @AlJohri for requesting pathlib.Path support (#628). (#643)
What's better
Completely retrained decision tree on 4x more data. The detector reads 512 bytes per file instead of 128, and the decision tree was rebuilt from scratch on those larger samples. A new feature, has_magic_signature, gives the tree a second path to the right answer when statistical features are ambiguous. Byte ratios and entropy calculations reflect actual file content rather than header artifacts. (#647)
Python 3.10+ compatibility. BinaryOrNot installs on Python 3.10 through 3.14, supporting Cookiecutter, cookieplone, and other tools that run on older interpreters. Thanks @wesleybl for raising this. (#645)
Test fixtures ship in the sdist. .pyc and .DS_Store test fixtures are force-included in the source distribution so tests pass when run from the sdist. (#646)
What's fixed
PNGs with ambiguous headers are correctly classified. A 512x512 grayscale+alpha PNG has an IHDR chunk with enough null bytes that the first 128 bytes accidentally decode as UTF-16. Extension checking, signature matching, and the retrained tree each independently prevent this misclassification. Closes #642. (#647)
What's changed
is_binary() has a new keyword argument. check_extensions (default True) controls whether the extension check runs. Existing code that calls is_binary(path) gets the extension check automatically. Code that passes check_extensions=False gets the previous content-only behavior.
Contributors
@audreyfeldroy (Audrey M. Roy Greenfeld) designed and built this release: the extension detection system, file signature matching, decision tree retraining, type annotations, Python 3.10 compatibility, and sdist fixes.
Thanks to @smheidrich for the type stubs proposal, @AlJohri for requesting pathlib.Path support, and @wesleybl for raising Python 3.10 compatibility.
BinaryOrNot 0.5.0: Zero Dependencies, 128 Bytes, One Trained Classifier
This is the biggest release in BinaryOrNot's history. I rebuilt the detection engine from the ground up. The original used byte ratio heuristics with chardet as a second opinion for ambiguous files. I replaced all of that with a trained decision tree operating on 23 features, covering 49 binary formats and 37 text encodings, with zero external dependencies. It's backed by 211 tests and a training pipeline you can re-run yourself. If you've ever had BinaryOrNot misidentify a UTF-16 file, choke on a CJK-encoded document, or crash because chardet changed its API, this release is for you.
BinaryOrNot now has zero dependencies. The chardet library (2.1 MB installed) is gone, replaced by a decision tree that reads 128 bytes of a file and classifies it as binary or text using 23 features computed from those bytes alone. The API is unchanged: is_binary("file.png") still returns True.
pip install --upgrade binaryornotBy the numbers
| Before (0.4.4) | After (0.5.0) |
|---|---|
| 1 dependency (chardet, 2.1 MB) | 0 dependencies |
| 1024 bytes read per file | 128 bytes read per file |
| Byte ratio heuristics + chardet | Trained classifier, 23 features |
| ~12 binary formats | 49 binary formats |
| ASCII + whatever chardet detected | 37 text encodings |
| 48 tests | 211 tests |
What's new
-
CLI tool. Run
binaryornot myfile.pngfrom the command line and getTrueorFalse. Thanks @moluwole! (#49) -
49 binary formats recognized. PNG, JPEG, GIF, BMP, TIFF, ICO, WebP, PSD, HEIF, PDF, OLE2 (.doc/.xls), SQLite, ZIP, gzip, xz, bzip2, 7z, RAR, Zstandard, ELF, Mach-O, MZ/PE, Java class, WebAssembly, Dalvik DEX, RIFF, Ogg, FLAC, MP4/MOV, MP3, Matroska/WebM, MIDI, WOFF, WOFF2, OTF, TTF, EOT, Apache Parquet, .pyc, .DS_Store, LLVM bitcode, Git packfiles, and more. Every format cites its specification and is verified by magic-byte tests and real file fixtures.
-
37 text encodings covered. UTF-8, UTF-16, UTF-32, all major single-byte encodings (ISO-8859, Windows code pages, KOI8-R, Mac encodings), and CJK encodings (GB2312, GBK, GB18030, Big5, Shift-JIS, EUC-JP, EUC-KR, ISO-2022-JP). A Big5-encoded Chinese document is correctly identified as text, not binary.
-
Encoding and format coverage tracked in CSVs.
encodings.csvandbinary_formats.csvare the single source of truth, feeding training data, parametrized tests, and documentation. Four gaps are documented with reasons (ISO-2022-KR and three EBCDIC code pages).
What's better
-
8x fewer bytes read per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.
-
211 tests, up from 48. Encoding round-trips, binary format magic bytes, real file fixtures for 16 formats, tiny-chunk edge cases, and boundary conditions. The decision tree is trained with balanced class weights and 5 targeted Hypothesis strategies (structured binary, binary with embedded strings, compressed binary, CJK text, whitespace-heavy text).
-
SQLite databases correctly detected as binary. Thanks @pombredanne! (#44)
-
Proper error logging for file I/O issues. Uses
logger.exception()for better diagnostics when a file can't be read. Thanks @MarshalX! (#629)
What's fixed
-
chardet 7.0.0 crash (#634). chardet 7 returns
{'encoding': None, 'confidence': 0.99}, which crashedis_binary_string()with aTypeError, then crashed the error handler with aNameErrorfrom a Python 2unicode()call. Both crash paths are structurally impossible now because chardet is gone. Thanks @wesleybl for the report! -
Unreadable files raise instead of returning False.
is_binary()on a nonexistent or permission-denied file now raisesFileNotFoundErrororPermissionError. Previously it silently returnedFalse, making broken paths indistinguishable from text files.
What's changed
- Zero dependencies.
pip install binaryornotinstalls nothing else. chardet is no longer needed. - Python 3.12+ only. Python 2 and older Python 3 versions are no longer supported. All Python 2 compatibility code has been removed.
- MIT license (previously BSD).
- src/ layout with hatchling build system, replacing setup.py/setup.cfg.
Contributors
@audreyfeldroy (Audrey M. Roy Greenfeld) designed and built this release: the trained decision tree, encoding and binary format coverage matrices, Hypothesis-based training pipeline, fixture generation, documentation, and the complete modernization from Cookiecutter PyPackage.
Thanks to @pombredanne (Philippe Ombredanne) for SQLite detection and binary stream improvements, @moluwole for the CLI tool, @MarshalX (Ilya Siamionau) for better error logging, @thebaptiste for pyproject.toml migration (#633), @wesleybl for reporting the chardet 7 crash (#634), @alcuin2 for binary detection improvements (#48), @olaoluwa-98 for CI updates (#50), and @cosmic-byte for test fixes (#52).
0.4.0
- Enhanced detection for some binary streams and UTF texts. (#10, 11) Thanks @pombredanne.
- Set up Appveyor for continuous testing on Windows. Thanks @pydanny.
- Update link to Perl source implementation. (#9) Thanks @asmeurer @pombredanne @audreyr.
- Handle UnicodeDecodeError in check. (#12) Thanks @DRMacIver.
- Add very simple Hypothesis based tests. (#13) Thanks @DRMacIver.
- Use setup to determine requirements and remove redundant requirements.txt. (#14) Thanks @hackebrot.
- Add documentation status badge to README.rst. (#15) Thanks @hackebrot.
- Run tox in travis.yml. Add pypy and Python 3.4 to tox enviroments. (#16) Thanks @hackebrot @pydanny.
- Handle LookupError when detecting encoding. (#17) Thanks @DRMacIver.