PI: Avoid string concatenation with large embedded base64-encoded images#1350
Conversation
|
The correct behaviour is tested by existing unit tests. I am not sure if you want performance tests on large files in your code base or not. Runtime on a 4mb PDF with a single embedded base64 image was several minutes with the old code, whereas with the new code it is roughly a second. Sadly i did not yet manage to create a sample PDF yet, and the one PDF i have contains sensitive data. |
01c1956 to
a41c497
Compare
Certain PDF libraries do embed images as base64 strings. This causes performance issues in `read_string_from_stream` due to incremental string concatenation, byte by byte. PDF Lib in our case is ``` <xmp:CreatorTool>Canon iR-ADV C256 PDF</xmp:CreatorTool> <pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer> ```
a41c497 to
28306de
Compare
Codecov ReportBase: 94.63% // Head: 94.63% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #1350 +/- ##
=======================================
Coverage 94.63% 94.63%
=======================================
Files 30 30
Lines 5140 5140
Branches 1058 1058
=======================================
Hits 4864 4864
Misses 164 164
Partials 112 112
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
Thank you for the PR ❤️ |
Yes, I want that! We have the https://github.com/py-pdf/sample-files repository included as a git submodule to ensure the PyPDF2 repository itself remains small (that is important if people install directly from source). I also want tests that are expected to be slow to be marked with a pytest marker. Currently we only have Additionally, we have a benchmark over several PDF libraries: https://github.com/py-pdf/benchmarks |
|
https://stackoverflow.com/a/65019934/562769 is a micro-benchmark that essentially shows this improvement. |
|
Thank you! It is merged to |
|
Thank you for the speedy merge @MartinThoma. We did not yet manage to create offending testdata, but will give it another try. |
New Features (ENH): - Add rotation property and transfer_rotate_to_content (#1348) Performance Improvements (PI): - Avoid string concatenation with large embedded base64-encoded images (#1350) Bug Fixes (BUG): - Format floats using their intrinsic decimal precision (#1267) Robustness (ROB): - Fix merge_page for pages without resources (#1349) Full Changelog: 2.10.8...2.10.9
|
Note to self: found out how the original PDF with embedded base64 image was created, will file a test later. |
There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests py-pdf#1350.
Source: py-pdf/pypdf#1350 (comment) Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
|
I've added the example file: https://github.com/py-pdf/sample-files |
There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests #1350. Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
Certain PDF libraries do embed images as base64 strings. This causes performance issues in
read_string_from_streamdue to incremental string concatenation, byte by byte.PDF Lib in our case is