PI: Batch-parse all objects in ObjStm on first access by dmitry-kostin · Pull Request #3677 · py-pdf/pypdf

dmitry-kostin · 2026-03-10T16:04:39Z

On first access to any object in a compressed object stream, parse and cache ALL objects in that stream in one pass. This turns O(N²) lookups into O(N).

The problem: _get_object_from_stream currently scans the full header (all N objnum/offset pairs) for each call, but only parses the requested object. When add_page() clones a page referencing ~2000 objects from the same stream, the total cost is O(N²).

The fix (two parts):

_get_object_from_stream: Split into two phases — Phase 1 reads all (objnum, offset) pairs from the header, Phase 2 seeks to each offset and parses + caches. Already-cached objects are skipped.
get_object: Added a guard to skip redundant cache_indirect_object() when the batch parse already cached the object.

Benchmarks on real production invoices (PDFKit.NET DMV10, ~2200 ObjStm objects):

PDF	Before	After	Speedup
298 KB, 4 pages, 1468 ObjStm	2.29s	0.029s	78x
370 KB, 3 pages, 2162 ObjStm	4.95s	0.081s	61x

Can't include the test PDFs unfortunately — they're real client invoices and I couldn't reconstruct a synthetic one with the same structure (qpdf distributes objects across many small streams instead of packing into one large one). The originals were produced by PDFKit.NET DMV10 and DocuSign DMV10.

On first access to any object in a compressed object stream, parse and cache ALL objects in that stream in one pass. This avoids O(N²) behavior when many objects from the same stream are resolved individually during add_page(). PDFs produced by DMS tools like PDFKit.NET DMV10 pack ~2000 objects into a single ObjStm. The previous code scanned the full header for each lookup, resulting in ~77M function calls and 5s+ parse times for a 370KB invoice. With batch parsing, the same file completes in ~0.08s (61x speedup).

dmitry-kostin · 2026-03-10T16:57:30Z

@stefan6419846 ~~this looks ready, could you rerun that failed step (windows) pls?~~

codecov · 2026-03-10T17:10:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.39%. Comparing base (2cfcd7e) to head (649cb2a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3677      +/-   ##
==========================================
+ Coverage   97.36%   97.39%   +0.02%     
==========================================
  Files          55       55              
  Lines        9949     9964      +15     
  Branches     1825     1829       +4     
==========================================
+ Hits         9687     9704      +17     
+ Misses        152      151       -1     
+ Partials      110      109       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846

Thanks.

@costajohnt

## What's new ### New Features (ENH) - Expose /Perms verification result on Encryption object (#3672) by @costajohnt ### Performance Improvements (PI) - Fix O(n²) performance in NameObject read/write (#3679) by @dmitry-kostin - Batch-parse all objects in ObjStm on first access (#3677) by @dmitry-kostin ### Bug Fixes (BUG) - Avoid sharing array-based content streams between pages (#3681) by @stefan6419846 - Avoid accessing invalid page when inserting blank page under some conditions (#3529) by @j-t-1 [Full Changelog](6.8.0...6.9.0)

The batch-parse optimization (added in py-pdf#3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked `obj_num in self.xref_objStm`, but this passes for any compressed object — not just ones that belong to the current stream. In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one. Fix: only cache when `xref_objStm` points the object at the stream being decompressed. Closes py-pdf#3697 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

) The batch-parse optimization (added in #3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked `obj_num in self.xref_objStm`, but this passes for any compressed object — not just ones that belong to the current stream. In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one. Fix: only cache when `xref_objStm` points the object at the stream being decompressed. Closes #3697. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dmitry-kostin mentioned this pull request Mar 10, 2026

_get_object_from_stream is O(N²) on PDFs with large compressed object streams #3676

Closed

dmitry-kostin force-pushed the fix-objstm-batch-parse branch 2 times, most recently from 2637bbd to 7207d61 Compare March 10, 2026 16:27

dmitry-kostin force-pushed the fix-objstm-batch-parse branch from 7207d61 to 88b8268 Compare March 10, 2026 16:42

TST: Add ObjStm batch-parse coverage tests

649cb2a

stefan6419846 linked an issue Mar 11, 2026 that may be closed by this pull request

reader._get_object_from_stream inefficient #3527

Closed

stefan6419846 approved these changes Mar 11, 2026

View reviewed changes

stefan6419846 merged commit cf2e518 into py-pdf:main Mar 11, 2026
18 checks passed

astahlman mentioned this pull request Mar 25, 2026

Object stream batch-parse caches stale objects from non-authoritative streams #3697

Closed

astahlman mentioned this pull request Mar 25, 2026

BUG: Fix stale object cache from non-authoritative object streams #3698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PI: Batch-parse all objects in ObjStm on first access#3677

PI: Batch-parse all objects in ObjStm on first access#3677
stefan6419846 merged 2 commits intopy-pdf:mainfrom
dmitry-kostin:fix-objstm-batch-parse

dmitry-kostin commented Mar 10, 2026

Uh oh!

dmitry-kostin commented Mar 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmitry-kostin commented Mar 10, 2026

Uh oh!

dmitry-kostin commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmitry-kostin commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading