-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
We hit this in production — certain invoice PDFs produced by PDFKit.NET DMV10 and DocuSign DMV10 take 2–5 seconds on add_page() on a fast machine, and time out entirely in our resource-constrained containers.
These producers pack almost all objects into a single ObjStm. For example one 370 KB, 3-page invoice has 2,162 out of 2,214 objects in compressed object streams.
The bottleneck is _get_object_from_stream — each call linearly scans the full header (all N objnum/offset pairs) to find one object, then returns immediately. When add_page() clones a page and resolves ~2000 indirect references, that's O(N²) total.
Quick cProfile on the worst file:
77,152,908 function calls in 17.9 seconds
Top by cumulative time:
_get_object_from_stream — 2,159 calls — 17.8s
NumberObject.read_from_stream — 4,614,657 calls — 11.4s
Stepped timing:
PdfReader()— 3msadd_page()— 4.5s (all of the time is here)writer.write()— 13ms
I can't share the PDF files unfortunately — they're real client invoices and I wasn't able to properly reconstruct a synthetic one that reproduces the same structure (qpdf distributes objects across many small streams of ~100 each instead of packing them into one large one). The originals were produced by PDFKit.NET 12.3.563.0 DMV10 and Power PDF Create / DocuSign DMV10 if that helps anyone reproduce.
Environment
$ python -m platform
macOS-26.3.1-arm64-arm-64bit
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.8.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0Code
from pypdf import PdfReader, PdfWriter
reader = PdfReader("invoice-with-heavy-objstm.pdf")
writer = PdfWriter()
writer.add_page(reader.pages[0]) # takes 2-5 secondsThe fix is straightforward — on first access to any object in an ObjStm, parse and cache all objects in that stream in one pass. PR: #3677