Skip to content

PI: Batch-parse all objects in ObjStm on first access#3677

Merged
stefan6419846 merged 2 commits intopy-pdf:mainfrom
dmitry-kostin:fix-objstm-batch-parse
Mar 11, 2026
Merged

PI: Batch-parse all objects in ObjStm on first access#3677
stefan6419846 merged 2 commits intopy-pdf:mainfrom
dmitry-kostin:fix-objstm-batch-parse

Conversation

@dmitry-kostin
Copy link
Contributor

Fixes #3676

On first access to any object in a compressed object stream, parse and cache ALL objects in that stream in one pass. This turns O(N²) lookups into O(N).

The problem: _get_object_from_stream currently scans the full header (all N objnum/offset pairs) for each call, but only parses the requested object. When add_page() clones a page referencing ~2000 objects from the same stream, the total cost is O(N²).

The fix (two parts):

  1. _get_object_from_stream: Split into two phases — Phase 1 reads all (objnum, offset) pairs from the header, Phase 2 seeks to each offset and parses + caches. Already-cached objects are skipped.
  2. get_object: Added a guard to skip redundant cache_indirect_object() when the batch parse already cached the object.

Benchmarks on real production invoices (PDFKit.NET DMV10, ~2200 ObjStm objects):

PDF Before After Speedup
298 KB, 4 pages, 1468 ObjStm 2.29s 0.029s 78x
370 KB, 3 pages, 2162 ObjStm 4.95s 0.081s 61x

Can't include the test PDFs unfortunately — they're real client invoices and I couldn't reconstruct a synthetic one with the same structure (qpdf distributes objects across many small streams instead of packing into one large one). The originals were produced by PDFKit.NET DMV10 and DocuSign DMV10.

On first access to any object in a compressed object stream,
parse and cache ALL objects in that stream in one pass.
This avoids O(N²) behavior when many objects from the same
stream are resolved individually during add_page().

PDFs produced by DMS tools like PDFKit.NET DMV10 pack ~2000
objects into a single ObjStm. The previous code scanned the
full header for each lookup, resulting in ~77M function calls
and 5s+ parse times for a 370KB invoice. With batch parsing,
the same file completes in ~0.08s (61x speedup).
@dmitry-kostin dmitry-kostin force-pushed the fix-objstm-batch-parse branch from 7207d61 to 88b8268 Compare March 10, 2026 16:42
@dmitry-kostin
Copy link
Contributor Author

dmitry-kostin commented Mar 10, 2026

@stefan6419846 this looks ready, could you rerun that failed step (windows) pls?

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.39%. Comparing base (2cfcd7e) to head (649cb2a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3677      +/-   ##
==========================================
+ Coverage   97.36%   97.39%   +0.02%     
==========================================
  Files          55       55              
  Lines        9949     9964      +15     
  Branches     1825     1829       +4     
==========================================
+ Hits         9687     9704      +17     
+ Misses        152      151       -1     
+ Partials      110      109       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stefan6419846 stefan6419846 linked an issue Mar 11, 2026 that may be closed by this pull request
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@stefan6419846 stefan6419846 merged commit cf2e518 into py-pdf:main Mar 11, 2026
18 checks passed
stefan6419846 added a commit that referenced this pull request Mar 15, 2026
## What's new

### New Features (ENH)
- Expose /Perms verification result on Encryption object (#3672) by @costajohnt

### Performance Improvements (PI)
- Fix O(n²) performance in NameObject read/write (#3679) by @dmitry-kostin
- Batch-parse all objects in ObjStm on first access (#3677) by @dmitry-kostin

### Bug Fixes (BUG)
- Avoid sharing array-based content streams between pages (#3681) by @stefan6419846
- Avoid accessing invalid page when inserting blank page under some conditions (#3529) by @j-t-1

[Full Changelog](6.8.0...6.9.0)
astahlman added a commit to astahlman/pypdf that referenced this pull request Mar 25, 2026
The batch-parse optimization (added in py-pdf#3677) caches every object
found when decompressing an object stream. The guard intended to
skip overridden objects checked `obj_num in self.xref_objStm`, but
this passes for any compressed object — not just ones that belong
to the current stream.

In incrementally-updated PDFs, the same object can appear in
multiple object streams across revisions (per the PDF 1.7 spec,
§7.5.6). The xref designates one stream as authoritative.
Decompressing a stale stream (e.g. to read a co-located AcroForm
dict) would cache the old version of the object, shadowing the
current one.

Fix: only cache when `xref_objStm` points the object at the stream
being decompressed.

Closes py-pdf#3697

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
stefan6419846 pushed a commit that referenced this pull request Mar 26, 2026
)

The batch-parse optimization (added in #3677) caches every object
found when decompressing an object stream. The guard intended to
skip overridden objects checked `obj_num in self.xref_objStm`, but
this passes for any compressed object — not just ones that belong
to the current stream.

In incrementally-updated PDFs, the same object can appear in
multiple object streams across revisions (per the PDF 1.7 spec,
§7.5.6). The xref designates one stream as authoritative.
Decompressing a stale stream (e.g. to read a co-located AcroForm
dict) would cache the old version of the object, shadowing the
current one.

Fix: only cache when `xref_objStm` points the object at the stream
being decompressed.

Closes #3697.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_get_object_from_stream is O(N²) on PDFs with large compressed object streams reader._get_object_from_stream inefficient

2 participants