Skip to content

_get_object_from_stream is O(N²) on PDFs with large compressed object streams #3676

@dmitry-kostin

Description

@dmitry-kostin

We hit this in production — certain invoice PDFs produced by PDFKit.NET DMV10 and DocuSign DMV10 take 2–5 seconds on add_page() on a fast machine, and time out entirely in our resource-constrained containers.

These producers pack almost all objects into a single ObjStm. For example one 370 KB, 3-page invoice has 2,162 out of 2,214 objects in compressed object streams.

The bottleneck is _get_object_from_stream — each call linearly scans the full header (all N objnum/offset pairs) to find one object, then returns immediately. When add_page() clones a page and resolves ~2000 indirect references, that's O(N²) total.

Quick cProfile on the worst file:

77,152,908 function calls in 17.9 seconds

Top by cumulative time:
  _get_object_from_stream  — 2,159 calls — 17.8s
  NumberObject.read_from_stream — 4,614,657 calls — 11.4s

Stepped timing:

  • PdfReader() — 3ms
  • add_page()4.5s (all of the time is here)
  • writer.write() — 13ms

I can't share the PDF files unfortunately — they're real client invoices and I wasn't able to properly reconstruct a synthetic one that reproduces the same structure (qpdf distributes objects across many small streams of ~100 each instead of packing them into one large one). The originals were produced by PDFKit.NET 12.3.563.0 DMV10 and Power PDF Create / DocuSign DMV10 if that helps anyone reproduce.

Environment

$ python -m platform
macOS-26.3.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.8.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0

Code

from pypdf import PdfReader, PdfWriter

reader = PdfReader("invoice-with-heavy-objstm.pdf")
writer = PdfWriter()
writer.add_page(reader.pages[0])  # takes 2-5 seconds

The fix is straightforward — on first access to any object in an ObjStm, parse and cache all objects in that stream in one pass. PR: #3677

Metadata

Metadata

Assignees

No one assigned

    Labels

    nf-performanceNon-functional change: Performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions