Add mime_type to detection results via magic number matching#350
Merged
dan-blanchard merged 24 commits intomainfrom Mar 23, 2026
Merged
Add mime_type to detection results via magic number matching#350dan-blanchard merged 24 commits intomainfrom
mime_type to detection results via magic number matching#350dan-blanchard merged 24 commits intomainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clarify pipeline ordering, mime_type propagation through reconstruction sites, TAR offset, MP4 support, markup MIME type scope, and backward compatibility details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review feedback: removed buggy AVIF code, merged task 7 into task 6 for green builds, dropped redundant stage-level mime_type task, added UniversalDetector tests, noted mypyc compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…opagation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove dead offset field from _MAGIC_NUMBERS entries, extract RIFF handling to a standalone check (matching ftyp/TAR pattern), remove redundant test, and add assertion for test data file count. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both orchestrator.py and detector.py had identical _NONE_RESULT constants. Move to pipeline/__init__.py and import from there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single pass over results to fill both language and mime_type, eliminating the duplicated loop structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan the first 4KB of ZIP files for local file headers with entry filenames starting with xl/, word/, or ppt/ to distinguish Office Open XML documents from plain ZIP archives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generalize ZIP sub-detection to recognize Java JAR (META-INF/MANIFEST.MF), Android APK (AndroidManifest.xml), EPUB (META-INF/container.xml), Python wheels (.dist-info/), and OpenDocument formats (mimetype entry content). Plain ZIPs with no matching entries remain application/zip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ruction dataclasses.replace() has significant per-call overhead (field introspection, validation). Direct construction of the 4-field frozen dataclass eliminates ~354k function calls per full test suite run, reducing _fill_metadata cumulative time by ~10%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #350 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 23 24 +1
Lines 1449 1554 +105
==========================================
+ Hits 1449 1554 +105 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ftyp: validate that box size (bytes 0-3, big-endian uint32) is between 8 and len(data). Text files like "The ftypeface..." have ASCII bytes in positions 0-3 producing box sizes in the billions, far exceeding any real input length. ZIP: advance scan offset past extra field and file content (when compressed_size is available) to avoid matching PK\x03\x04 signatures that appear inside stored file data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rquet, Arrow New formats detected: - Images: HEIC/HEIF (ftyp brands), JPEG XL (container + codestream), PSD (Photoshop), QOI - Audio: MIDI, AIFF/AIFC (FORM container) - Video: QuickTime (ftyp qt brand) - Fonts: OTF (OpenType CFF) - Bytecode: Android DEX - Data: Apache Parquet, Apache Arrow IPC AVIF now returns image/heif (shared HEIF container format). Skipped Java class files (cafebabe conflicts with Mach-O fat binary). Skipped TTF (signature conflicts with ICO). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TTF (\x00\x01\x00\x00) doesn't actually conflict with ICO (\x00\x00\x01\x00). Java class files and Mach-O fat binaries share \xca\xfe\xba\xbe but are distinguished by bytes 4-7: Mach-O fat has nfat_arch (2-5), Java has minor+major version (major >= 45 for Java 1.1+). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AVIF and HEIF are different codecs in the same container family with distinct IANA-registered MIME types. Split the ftyp brand sets so avif/avis brands return image/avif and heic/heix/mif1/msf1 return image/heif. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ric mif1/msf1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- application/x-parquet → application/vnd.apache.parquet (IANA 2024-02-14) - application/x-apache-arrow-file → application/vnd.apache.arrow.file (IANA 2021-06-23) - image/x-icon → image/vnd.microsoft.icon (IANA 2003-09-03) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
mime_typefield toDetectionResultandDetectionDictso callers can tell what kind of file they're looking at, not just whether it's binary or text. Binary files get identified via magic number prefix matching in a newpipeline/magic.pymodule. Text files get appropriate MIME types from the pipeline stage that identified them (text/html,text/xml,text/x-python) or default totext/plain.This came out of the fact that
detect()was returningencoding=Nonefor binary files with no further information. Now you getmime_type="image/png"instead of just "it's binary."What changed
DetectionResult/DetectionDict: Newmime_type: str | None = Nonefield, backward-compatible (existing construction sites unaffected)pipeline/magic.py: Magic number lookup table covering 40+ formats. Fixed-offset prefix matching only, no deep analysis. Includes:text/html,text/xml, ortext/x-pythonbased on which pattern matched_fill_metadata()(merged from the old_fill_language()) fills defaults at the API boundarydataclasses.replace()with direct construction on hot paths, eliminating ~354k function calls per full test suite runfile(1). Three incorrect types fixed (Parquet, Arrow, ICO now use IANA-registered types). Unregistered types (image/qoi,application/vnd.android.dex,application/x-wheel+zip) confirmed in widespread use (VLC, GIMP, JupyterLite, etc.)Performance
Benchmarked against chardet 7.2.0 with mypyc (
--mypycmode):The
dataclasses.replace()cleanup is a net improvement over baseline, since we were already paying that cost in_fill_language()before this PR.