Skip to content

Add extension-based binary detection#648

Merged
audreyfeldroy merged 3 commits intomainfrom
check-file-extensions
Mar 8, 2026
Merged

Add extension-based binary detection#648
audreyfeldroy merged 3 commits intomainfrom
check-file-extensions

Conversation

@audreyfeldroy
Copy link
Copy Markdown
Collaborator

@audreyfeldroy audreyfeldroy commented Mar 8, 2026

Summary

is_binary() now checks the filename against 131 known binary extensions before opening the file. Images, audio, video, archives, executables, fonts, documents, databases, 3D models, disk images, CAD files, scientific data formats, and game ROMs are all classified by their suffix alone, with no file I/O.

The extension list lives in binary_extensions.csv, following the same CSV-as-source-of-truth pattern as the format signatures. Callers who need pure content-based classification (the use case that motivated commenting out the original .pyc-only check in 2019) pass check_extensions=False.

This restores and completes a feature Audrey introduced in 2015 that only ever covered .pyc before being disabled. The original had no opt-out mechanism, which meant a text file named .pyc was misclassified with no workaround. The new version handles that case explicitly: .pyc is binary by extension, text by content with check_extensions=False.

Related: #642 (PNG misclassified as text would also be caught by the extension check)

Test plan

  • Red/green TDD: 4 tests fail before implementation (pyc by extension, png by extension, disabled check, case-insensitive), all pass after
  • test_negative_binary (alcuin2's case): no longer @expectedFailure, now asserts both modes
  • test_binary_gif2 (empty .gif): updated to reflect extension-first behavior
  • this_is_not_a_bin.pyc fixture created (text content, .pyc extension)
  • Full suite: 222 passed, 4 xfailed

is_binary() checks the filename against a CSV of 131 known binary
extensions (images, audio, video, archives, executables, fonts,
documents, databases, 3D models, disk images, and more) before
opening the file. This is the fastest detection path: no I/O, no
byte statistics, just a frozenset lookup on the suffix.

The original extension check (2015, Audrey) covered only .pyc and
had no way to opt out. Alcuin2 commented it out in 2019 because a
text file named .pyc was misclassified. The new version ships 131
extensions and a check_extensions=False keyword argument for callers
who need pure content-based classification.

Key design decisions:
- Extensions live in binary_extensions.csv, same pattern as the
  signatures in binary_formats.csv. Single source of truth, easy
  to audit and extend.
- Case-insensitive matching (handles .PNG, .Jpg, etc.)
- Keyword-only argument prevents accidental positional use.
- The test_negative_binary test (alcuin2's case) is no longer an
  expected failure: .pyc is binary by extension, text by content
  with check_extensions=False. Both assertions are now explicit.
Path() doesn't accept bytes. Decode with os.fsdecode() first so
the type checker (ty) is satisfied and bytes paths work correctly.
CJK locales on Windows produce Shift-JIS, GBK, or EUC-KR filenames
that aren't valid UTF-8. When those files land on Linux (or Docker,
or WSL), os.listdir() returns bytes. The comment records the use
case so future readers don't remove the bytes handling.
@audreyfeldroy audreyfeldroy force-pushed the check-file-extensions branch from b827afa to c375996 Compare March 8, 2026 16:15
@audreyfeldroy audreyfeldroy merged commit fba3730 into main Mar 8, 2026
11 checks passed
@audreyfeldroy audreyfeldroy deleted the check-file-extensions branch March 8, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant