Skip to content

PdfWriter.add_page() hangs on PDFs with pathologically encoded Name objects #3678

@dmitry-kostin

Description

@dmitry-kostin

PDFs produced by some Apple software can contain Name objects with repeatedly mis-encoded UTF-8 characters. For example, the German word "Hauptbeschäftigung" gets re-encoded multiple times, turning a 20-byte name into a ~786KB name with 262,144 hex escape sequences.

When calling PdfWriter.add_page() on such a PDF, three functions exhibit O(n²) behavior and effectively hang:

  1. read_until_regex — re-scans the entire accumulated buffer (name + tok) on every 16-byte read
  2. NameObject.unnumber — rebuilds the full bytes object on each #xx replacement
  3. NameObject.renumber — uses out += concatenation in a loop

pikepdf/QPDF handles the same file in milliseconds because its C++ parser reads names in a single O(n) pass.

The fix is straightforward: use list accumulation + join() instead of repeated concatenation in all three functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    nf-performanceNon-functional change: Performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions