Add extension-based binary detection#648
Merged
audreyfeldroy merged 3 commits intomainfrom Mar 8, 2026
Merged
Conversation
is_binary() checks the filename against a CSV of 131 known binary extensions (images, audio, video, archives, executables, fonts, documents, databases, 3D models, disk images, and more) before opening the file. This is the fastest detection path: no I/O, no byte statistics, just a frozenset lookup on the suffix. The original extension check (2015, Audrey) covered only .pyc and had no way to opt out. Alcuin2 commented it out in 2019 because a text file named .pyc was misclassified. The new version ships 131 extensions and a check_extensions=False keyword argument for callers who need pure content-based classification. Key design decisions: - Extensions live in binary_extensions.csv, same pattern as the signatures in binary_formats.csv. Single source of truth, easy to audit and extend. - Case-insensitive matching (handles .PNG, .Jpg, etc.) - Keyword-only argument prevents accidental positional use. - The test_negative_binary test (alcuin2's case) is no longer an expected failure: .pyc is binary by extension, text by content with check_extensions=False. Both assertions are now explicit.
Path() doesn't accept bytes. Decode with os.fsdecode() first so the type checker (ty) is satisfied and bytes paths work correctly.
CJK locales on Windows produce Shift-JIS, GBK, or EUC-KR filenames that aren't valid UTF-8. When those files land on Linux (or Docker, or WSL), os.listdir() returns bytes. The comment records the use case so future readers don't remove the bytes handling.
b827afa to
c375996
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
is_binary()now checks the filename against 131 known binary extensions before opening the file. Images, audio, video, archives, executables, fonts, documents, databases, 3D models, disk images, CAD files, scientific data formats, and game ROMs are all classified by their suffix alone, with no file I/O.The extension list lives in
binary_extensions.csv, following the same CSV-as-source-of-truth pattern as the format signatures. Callers who need pure content-based classification (the use case that motivated commenting out the original.pyc-only check in 2019) passcheck_extensions=False.This restores and completes a feature Audrey introduced in 2015 that only ever covered
.pycbefore being disabled. The original had no opt-out mechanism, which meant a text file named.pycwas misclassified with no workaround. The new version handles that case explicitly:.pycis binary by extension, text by content withcheck_extensions=False.Related: #642 (PNG misclassified as text would also be caught by the extension check)
Test plan
test_negative_binary(alcuin2's case): no longer@expectedFailure, now asserts both modestest_binary_gif2(empty.gif): updated to reflect extension-first behaviorthis_is_not_a_bin.pycfixture created (text content,.pycextension)