Skip to content

Add mime_type to detection results via magic number matching#350

Merged
dan-blanchard merged 24 commits intomainfrom
mime-type-detection
Mar 23, 2026
Merged

Add mime_type to detection results via magic number matching#350
dan-blanchard merged 24 commits intomainfrom
mime-type-detection

Conversation

@dan-blanchard
Copy link
Copy Markdown
Member

@dan-blanchard dan-blanchard commented Mar 23, 2026

Summary

Adds a mime_type field to DetectionResult and DetectionDict so callers can tell what kind of file they're looking at, not just whether it's binary or text. Binary files get identified via magic number prefix matching in a new pipeline/magic.py module. Text files get appropriate MIME types from the pipeline stage that identified them (text/html, text/xml, text/x-python) or default to text/plain.

This came out of the fact that detect() was returning encoding=None for binary files with no further information. Now you get mime_type="image/png" instead of just "it's binary."

What changed

  • DetectionResult / DetectionDict: New mime_type: str | None = None field, backward-compatible (existing construction sites unaffected)
  • pipeline/magic.py: Magic number lookup table covering 40+ formats. Fixed-offset prefix matching only, no deep analysis. Includes:
    • Images: PNG, JPEG, GIF, WebP, BMP, TIFF, ICO, AVIF, HEIC/HEIF, JPEG XL, PSD, QOI
    • Audio/Video: MP3 (ID3), MP4/MOV, QuickTime, OGG, FLAC, WAV, AVI, WebM/MKV, MIDI, AIFF
    • Archives: ZIP, GZIP, BZIP2, XZ, 7z, RAR, ZSTD, TAR
    • Documents/Data: PDF, WASM, SQLite, Apache Parquet, Apache Arrow
    • Executables/Bytecode: ELF, Mach-O, PE (MZ), Java class files (disambiguated from Mach-O fat binary via version bytes), Android DEX
    • Fonts: WOFF, WOFF2, OTF, TTF
  • ZIP sub-detection: Scans first 4KB of ZIP files for entry filenames/content to distinguish XLSX/DOCX/PPTX, JAR, APK, EPUB, Python wheels, and OpenDocument formats from plain ZIPs
  • ftyp sub-detection: Distinguishes MP4, QuickTime, AVIF, HEIC/HEIF, and M4A by brand, with box-size validation to prevent false positives on text data
  • RIFF/FORM containers: Distinguishes WAV, WebP, AVI (RIFF) and AIFF (FORM) by subtype
  • Markup stage: Sets text/html, text/xml, or text/x-python based on which pattern matched
  • Orchestrator: Magic stage runs after escape detection, before UTF-8/ASCII prechecks. _fill_metadata() (merged from the old _fill_language()) fills defaults at the API boundary
  • Perf: Replaced dataclasses.replace() with direct construction on hot paths, eliminating ~354k function calls per full test suite run
  • MIME types verified: All MIME types checked against IANA registry, MDN, Wikipedia, and file(1). Three incorrect types fixed (Parquet, Arrow, ICO now use IANA-registered types). Unregistered types (image/qoi, application/vnd.android.dex, application/x-wheel+zip) confirmed in widespread use (VLC, GIMP, JupyterLite, etc.)

Performance

Benchmarked against chardet 7.2.0 with mypyc (--mypyc mode):

  • Detection time: 4528ms vs 4640ms (-2.4%, faster than baseline)
  • Median per-file: 0.57ms vs 0.58ms
  • Accuracy: 98.1% encoding on both (unchanged)

The dataclasses.replace() cleanup is a net improvement over baseline, since we were already paying that cost in _fill_language() before this PR.

dan-blanchard and others added 14 commits March 23, 2026 14:57
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clarify pipeline ordering, mime_type propagation through reconstruction
sites, TAR offset, MP4 support, markup MIME type scope, and backward
compatibility details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review feedback: removed buggy AVIF code, merged task 7 into
task 6 for green builds, dropped redundant stage-level mime_type task,
added UniversalDetector tests, noted mypyc compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…opagation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove dead offset field from _MAGIC_NUMBERS entries, extract RIFF
handling to a standalone check (matching ftyp/TAR pattern), remove
redundant test, and add assertion for test data file count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both orchestrator.py and detector.py had identical _NONE_RESULT
constants. Move to pipeline/__init__.py and import from there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single pass over results to fill both language and mime_type, eliminating
the duplicated loop structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan the first 4KB of ZIP files for local file headers with entry
filenames starting with xl/, word/, or ppt/ to distinguish Office
Open XML documents from plain ZIP archives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generalize ZIP sub-detection to recognize Java JAR (META-INF/MANIFEST.MF),
Android APK (AndroidManifest.xml), EPUB (META-INF/container.xml),
Python wheels (.dist-info/), and OpenDocument formats (mimetype entry
content). Plain ZIPs with no matching entries remain application/zip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ruction

dataclasses.replace() has significant per-call overhead (field introspection,
validation). Direct construction of the 4-field frozen dataclass eliminates
~354k function calls per full test suite run, reducing _fill_metadata
cumulative time by ~10%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (a7942a9) to head (d7b3eb6).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #350    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           23        24     +1     
  Lines         1449      1554   +105     
==========================================
+ Hits          1449      1554   +105     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dan-blanchard and others added 10 commits March 23, 2026 16:06
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ftyp: validate that box size (bytes 0-3, big-endian uint32) is between
8 and len(data). Text files like "The ftypeface..." have ASCII bytes
in positions 0-3 producing box sizes in the billions, far exceeding
any real input length.

ZIP: advance scan offset past extra field and file content (when
compressed_size is available) to avoid matching PK\x03\x04 signatures
that appear inside stored file data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rquet, Arrow

New formats detected:
- Images: HEIC/HEIF (ftyp brands), JPEG XL (container + codestream),
  PSD (Photoshop), QOI
- Audio: MIDI, AIFF/AIFC (FORM container)
- Video: QuickTime (ftyp qt brand)
- Fonts: OTF (OpenType CFF)
- Bytecode: Android DEX
- Data: Apache Parquet, Apache Arrow IPC

AVIF now returns image/heif (shared HEIF container format).
Skipped Java class files (cafebabe conflicts with Mach-O fat binary).
Skipped TTF (signature conflicts with ICO).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTF (\x00\x01\x00\x00) doesn't actually conflict with ICO (\x00\x00\x01\x00).

Java class files and Mach-O fat binaries share \xca\xfe\xba\xbe but
are distinguished by bytes 4-7: Mach-O fat has nfat_arch (2-5),
Java has minor+major version (major >= 45 for Java 1.1+).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AVIF and HEIF are different codecs in the same container family with
distinct IANA-registered MIME types. Split the ftyp brand sets so
avif/avis brands return image/avif and heic/heix/mif1/msf1 return
image/heif.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ric mif1/msf1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- application/x-parquet → application/vnd.apache.parquet (IANA 2024-02-14)
- application/x-apache-arrow-file → application/vnd.apache.arrow.file (IANA 2021-06-23)
- image/x-icon → image/vnd.microsoft.icon (IANA 2003-09-03)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dan-blanchard dan-blanchard enabled auto-merge (squash) March 23, 2026 21:03
@dan-blanchard dan-blanchard disabled auto-merge March 23, 2026 21:05
@dan-blanchard dan-blanchard merged commit e8e8a3a into main Mar 23, 2026
17 checks passed
@dan-blanchard dan-blanchard deleted the mime-type-detection branch March 23, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant