PI: Batch-parse all objects in ObjStm on first access#3677
Merged
stefan6419846 merged 2 commits intopy-pdf:mainfrom Mar 11, 2026
Merged
PI: Batch-parse all objects in ObjStm on first access#3677stefan6419846 merged 2 commits intopy-pdf:mainfrom
stefan6419846 merged 2 commits intopy-pdf:mainfrom
Conversation
2637bbd to
7207d61
Compare
On first access to any object in a compressed object stream, parse and cache ALL objects in that stream in one pass. This avoids O(N²) behavior when many objects from the same stream are resolved individually during add_page(). PDFs produced by DMS tools like PDFKit.NET DMV10 pack ~2000 objects into a single ObjStm. The previous code scanned the full header for each lookup, resulting in ~77M function calls and 5s+ parse times for a 370KB invoice. With batch parsing, the same file completes in ~0.08s (61x speedup).
7207d61 to
88b8268
Compare
Contributor
Author
|
@stefan6419846 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3677 +/- ##
==========================================
+ Coverage 97.36% 97.39% +0.02%
==========================================
Files 55 55
Lines 9949 9964 +15
Branches 1825 1829 +4
==========================================
+ Hits 9687 9704 +17
+ Misses 152 151 -1
+ Partials 110 109 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
stefan6419846
added a commit
that referenced
this pull request
Mar 15, 2026
## What's new ### New Features (ENH) - Expose /Perms verification result on Encryption object (#3672) by @costajohnt ### Performance Improvements (PI) - Fix O(n²) performance in NameObject read/write (#3679) by @dmitry-kostin - Batch-parse all objects in ObjStm on first access (#3677) by @dmitry-kostin ### Bug Fixes (BUG) - Avoid sharing array-based content streams between pages (#3681) by @stefan6419846 - Avoid accessing invalid page when inserting blank page under some conditions (#3529) by @j-t-1 [Full Changelog](6.8.0...6.9.0)
astahlman
added a commit
to astahlman/pypdf
that referenced
this pull request
Mar 25, 2026
The batch-parse optimization (added in py-pdf#3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked `obj_num in self.xref_objStm`, but this passes for any compressed object — not just ones that belong to the current stream. In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one. Fix: only cache when `xref_objStm` points the object at the stream being decompressed. Closes py-pdf#3697 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
stefan6419846
pushed a commit
that referenced
this pull request
Mar 26, 2026
) The batch-parse optimization (added in #3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked `obj_num in self.xref_objStm`, but this passes for any compressed object — not just ones that belong to the current stream. In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one. Fix: only cache when `xref_objStm` points the object at the stream being decompressed. Closes #3697. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3676
On first access to any object in a compressed object stream, parse and cache ALL objects in that stream in one pass. This turns O(N²) lookups into O(N).
The problem:
_get_object_from_streamcurrently scans the full header (all N objnum/offset pairs) for each call, but only parses the requested object. Whenadd_page()clones a page referencing ~2000 objects from the same stream, the total cost is O(N²).The fix (two parts):
_get_object_from_stream: Split into two phases — Phase 1 reads all (objnum, offset) pairs from the header, Phase 2 seeks to each offset and parses + caches. Already-cached objects are skipped.get_object: Added a guard to skip redundantcache_indirect_object()when the batch parse already cached the object.Benchmarks on real production invoices (PDFKit.NET DMV10, ~2200 ObjStm objects):
Can't include the test PDFs unfortunately — they're real client invoices and I couldn't reconstruct a synthetic one with the same structure (qpdf distributes objects across many small streams instead of packing into one large one). The originals were produced by PDFKit.NET DMV10 and DocuSign DMV10.