_get_object_from_stream is O(N²) on PDFs with large compressed object streams

We hit this in production — certain invoice PDFs produced by PDFKit.NET DMV10 and DocuSign DMV10 take 2–5 seconds on `add_page()` on a fast machine, and time out entirely in our resource-constrained containers.

These producers pack almost all objects into a single ObjStm. For example one 370 KB, 3-page invoice has 2,162 out of 2,214 objects in compressed object streams.

The bottleneck is `_get_object_from_stream` — each call linearly scans the full header (all N objnum/offset pairs) to find one object, then returns immediately. When `add_page()` clones a page and resolves ~2000 indirect references, that's O(N²) total.

Quick cProfile on the worst file:

```
77,152,908 function calls in 17.9 seconds

Top by cumulative time:
  _get_object_from_stream  — 2,159 calls — 17.8s
  NumberObject.read_from_stream — 4,614,657 calls — 11.4s
```

Stepped timing:
- `PdfReader()` — 3ms
- `add_page()` — **4.5s** (all of the time is here)
- `writer.write()` — 13ms

I can't share the PDF files unfortunately — they're real client invoices and I wasn't able to properly reconstruct a synthetic one that reproduces the same structure (qpdf distributes objects across many small streams of ~100 each instead of packing them into one large one). The originals were produced by **PDFKit.NET 12.3.563.0 DMV10** and **Power PDF Create / DocuSign DMV10** if that helps anyone reproduce.

## Environment

```bash
$ python -m platform
macOS-26.3.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.8.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0
```

## Code

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("invoice-with-heavy-objstm.pdf")
writer = PdfWriter()
writer.add_page(reader.pages[0])  # takes 2-5 seconds
```

The fix is straightforward — on first access to any object in an ObjStm, parse and cache all objects in that stream in one pass. PR: #3677

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_get_object_from_stream is O(N²) on PDFs with large compressed object streams #3676

Environment

Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_get_object_from_stream is O(N²) on PDFs with large compressed object streams #3676

Description

Environment

Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions