Skip to content

PI: Fix O(n²) performance in NameObject read/write#3679

Merged
stefan6419846 merged 4 commits intopy-pdf:mainfrom
dmitry-kostin:fix-name-object-on2-perf
Mar 12, 2026
Merged

PI: Fix O(n²) performance in NameObject read/write#3679
stefan6419846 merged 4 commits intopy-pdf:mainfrom
dmitry-kostin:fix-name-object-on2-perf

Conversation

@dmitry-kostin
Copy link
Contributor

@dmitry-kostin dmitry-kostin commented Mar 10, 2026

Fixes #3678

  • read_until_regex: search only new chunk instead of rescanning entire buffer; use list accumulation instead of bytes concatenation
  • NameObject.unnumber: use bytearray instead of rebuilding bytes on each #xx replacement
  • NameObject.renumber: use parts.append() + join() instead of out +=

A real-world PDF with a ~786KB pathologically encoded name (262,144 hex escapes from repeated UTF-8 mis-encoding) went from hanging indefinitely to completing in ~3 seconds.

@dmitry-kostin dmitry-kostin force-pushed the fix-name-object-on2-perf branch from b68a96a to 2b7d583 Compare March 10, 2026 18:36
@dmitry-kostin dmitry-kostin changed the title Fix O(n²) hangs in NameObject read/write PI: Fix O(n²) performance in NameObject read/write Mar 10, 2026
@dmitry-kostin dmitry-kostin force-pushed the fix-name-object-on2-perf branch from 2b7d583 to f629056 Compare March 10, 2026 18:38
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.39%. Comparing base (cf2e518) to head (32d9771).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3679   +/-   ##
=======================================
  Coverage   97.39%   97.39%           
=======================================
  Files          55       55           
  Lines        9964     9977   +13     
  Branches     1829     1830    +1     
=======================================
+ Hits         9704     9717   +13     
  Misses        151      151           
  Partials      109      109           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmitry-kostin
Copy link
Contributor Author

Also added exponential chunk growth in read_until_regex (16 → 32 → 64 → ... → 8192). Saves ~25% on top of the existing fix when benchmarked with add_page() on a large scanned PDF.

And also added more tests for read_until_regex since coverage there was pretty thin.

…ames

Three functions had quadratic behavior that caused hangs on PDFs with
extremely long Name objects (e.g. repeatedly mis-encoded UTF-8 names):

- read_until_regex: searched entire accumulated buffer on each 16-byte
  chunk instead of only the new chunk, and used bytes concatenation
- NameObject.unnumber: rebuilt entire bytes object on each # replacement
- NameObject.renumber: used out += concatenation in a loop
TST: Add read_until_regex coverage tests
@dmitry-kostin dmitry-kostin force-pushed the fix-name-object-on2-perf branch from 885b246 to 4e74f71 Compare March 11, 2026 13:13
@stefan6419846 stefan6419846 merged commit 3a4e913 into py-pdf:main Mar 12, 2026
18 checks passed
stefan6419846 added a commit that referenced this pull request Mar 15, 2026
## What's new

### New Features (ENH)
- Expose /Perms verification result on Encryption object (#3672) by @costajohnt

### Performance Improvements (PI)
- Fix O(n²) performance in NameObject read/write (#3679) by @dmitry-kostin
- Batch-parse all objects in ObjStm on first access (#3677) by @dmitry-kostin

### Bug Fixes (BUG)
- Avoid sharing array-based content streams between pages (#3681) by @stefan6419846
- Avoid accessing invalid page when inserting blank page under some conditions (#3529) by @j-t-1

[Full Changelog](6.8.0...6.9.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PdfWriter.add_page() hangs on PDFs with pathologically encoded Name objects

2 participants