Add extension-based binary detection by audreyfeldroy · Pull Request #648 · binaryornot/binaryornot

audreyfeldroy · 2026-03-08T16:04:18Z

Summary

is_binary() now checks the filename against 131 known binary extensions before opening the file. Images, audio, video, archives, executables, fonts, documents, databases, 3D models, disk images, CAD files, scientific data formats, and game ROMs are all classified by their suffix alone, with no file I/O.

The extension list lives in binary_extensions.csv, following the same CSV-as-source-of-truth pattern as the format signatures. Callers who need pure content-based classification (the use case that motivated commenting out the original .pyc-only check in 2019) pass check_extensions=False.

This restores and completes a feature Audrey introduced in 2015 that only ever covered .pyc before being disabled. The original had no opt-out mechanism, which meant a text file named .pyc was misclassified with no workaround. The new version handles that case explicitly: .pyc is binary by extension, text by content with check_extensions=False.

Related: #642 (PNG misclassified as text would also be caught by the extension check)

Test plan

Red/green TDD: 4 tests fail before implementation (pyc by extension, png by extension, disabled check, case-insensitive), all pass after
test_negative_binary (alcuin2's case): no longer @expectedFailure, now asserts both modes
test_binary_gif2 (empty .gif): updated to reflect extension-first behavior
this_is_not_a_bin.pyc fixture created (text content, .pyc extension)
Full suite: 222 passed, 4 xfailed

is_binary() checks the filename against a CSV of 131 known binary extensions (images, audio, video, archives, executables, fonts, documents, databases, 3D models, disk images, and more) before opening the file. This is the fastest detection path: no I/O, no byte statistics, just a frozenset lookup on the suffix. The original extension check (2015, Audrey) covered only .pyc and had no way to opt out. Alcuin2 commented it out in 2019 because a text file named .pyc was misclassified. The new version ships 131 extensions and a check_extensions=False keyword argument for callers who need pure content-based classification. Key design decisions: - Extensions live in binary_extensions.csv, same pattern as the signatures in binary_formats.csv. Single source of truth, easy to audit and extend. - Case-insensitive matching (handles .PNG, .Jpg, etc.) - Keyword-only argument prevents accidental positional use. - The test_negative_binary test (alcuin2's case) is no longer an expected failure: .pyc is binary by extension, text by content with check_extensions=False. Both assertions are now explicit.

Path() doesn't accept bytes. Decode with os.fsdecode() first so the type checker (ty) is satisfied and bytes paths work correctly.

CJK locales on Windows produce Shift-JIS, GBK, or EUC-KR filenames that aren't valid UTF-8. When those files land on Linux (or Docker, or WSL), os.listdir() returns bytes. The comment records the use case so future readers don't remove the bytes handling.

audreyfeldroy added 3 commits March 9, 2026 00:15

Handle bytes filenames in extension check

17540bc

Path() doesn't accept bytes. Decode with os.fsdecode() first so the type checker (ty) is satisfied and bytes paths work correctly.

audreyfeldroy force-pushed the check-file-extensions branch from b827afa to c375996 Compare March 8, 2026 16:15

audreyfeldroy merged commit fba3730 into main Mar 8, 2026
11 checks passed

audreyfeldroy deleted the check-file-extensions branch March 8, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extension-based binary detection#648

Add extension-based binary detection#648
audreyfeldroy merged 3 commits intomainfrom
check-file-extensions

audreyfeldroy commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

audreyfeldroy commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

audreyfeldroy commented Mar 8, 2026 •

edited

Loading