-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Closed
Copy link
Labels
nf-performanceNon-functional change: PerformanceNon-functional change: Performance
Description
PDFs produced by some Apple software can contain Name objects with repeatedly mis-encoded UTF-8 characters. For example, the German word "Hauptbeschäftigung" gets re-encoded multiple times, turning a 20-byte name into a ~786KB name with 262,144 hex escape sequences.
When calling PdfWriter.add_page() on such a PDF, three functions exhibit O(n²) behavior and effectively hang:
read_until_regex— re-scans the entire accumulated buffer (name + tok) on every 16-byte readNameObject.unnumber— rebuilds the full bytes object on each#xxreplacementNameObject.renumber— usesout +=concatenation in a loop
pikepdf/QPDF handles the same file in milliseconds because its C++ parser reads names in a single O(n) pass.
The fix is straightforward: use list accumulation + join() instead of repeated concatenation in all three functions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
nf-performanceNon-functional change: PerformanceNon-functional change: Performance