Skip to content

PI: Avoid string concatenation with large embedded base64-encoded images#1350

Merged
MartinThoma merged 1 commit intopy-pdf:mainfrom
mergezalot:fix-read_string_from_stream-performance
Sep 17, 2022
Merged

PI: Avoid string concatenation with large embedded base64-encoded images#1350
MartinThoma merged 1 commit intopy-pdf:mainfrom
mergezalot:fix-read_string_from_stream-performance

Conversation

@mergezalot
Copy link
Copy Markdown
Contributor

Certain PDF libraries do embed images as base64 strings. This causes performance issues in read_string_from_stream due to incremental string concatenation, byte by byte.

PDF Lib in our case is

<xmp:CreatorTool>Canon iR-ADV C256  PDF</xmp:CreatorTool>
<pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer>

@mergezalot
Copy link
Copy Markdown
Contributor Author

The correct behaviour is tested by existing unit tests. I am not sure if you want performance tests on large files in your code base or not. Runtime on a 4mb PDF with a single embedded base64 image was several minutes with the old code, whereas with the new code it is roughly a second.

Sadly i did not yet manage to create a sample PDF yet, and the one PDF i have contains sensitive data.

@mergezalot mergezalot force-pushed the fix-read_string_from_stream-performance branch from 01c1956 to a41c497 Compare September 16, 2022 11:42
Certain PDF libraries do embed images as base64 strings. This causes performance issues
in `read_string_from_stream` due to incremental string concatenation, byte by byte.

PDF Lib in our case is
```
<xmp:CreatorTool>Canon iR-ADV C256  PDF</xmp:CreatorTool>
<pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer>

```
@mergezalot mergezalot force-pushed the fix-read_string_from_stream-performance branch from a41c497 to 28306de Compare September 16, 2022 11:43
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 16, 2022

Codecov Report

Base: 94.63% // Head: 94.63% // No change to project coverage 👍

Coverage data is based on head (28306de) compared to base (7c96d13).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1350   +/-   ##
=======================================
  Coverage   94.63%   94.63%           
=======================================
  Files          30       30           
  Lines        5140     5140           
  Branches     1058     1058           
=======================================
  Hits         4864     4864           
  Misses        164      164           
  Partials      112      112           
Impacted Files Coverage Δ
PyPDF2/generic/_utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mergezalot mergezalot changed the title Fix performance issues with large embedded base64 images BUG: fix performance issues with large embedded base64 images Sep 16, 2022
@mergezalot mergezalot changed the title BUG: fix performance issues with large embedded base64 images BUG: fix performance issues with large embedded base64 Sep 16, 2022
@MartinThoma
Copy link
Copy Markdown
Member

Thank you for the PR ❤️

@MartinThoma
Copy link
Copy Markdown
Member

I am not sure if you want performance tests on large files in your code base or not.

Yes, I want that! We have the https://github.com/py-pdf/sample-files repository included as a git submodule to ensure the PyPDF2 repository itself remains small (that is important if people install directly from source).

I also want tests that are expected to be slow to be marked with a pytest marker. Currently we only have external (code), but we could add a slow marker as well.

Additionally, we have a benchmark over several PDF libraries: https://github.com/py-pdf/benchmarks
It might make sense to add a few "special" cases there (PRs are welcome :-) )

@MartinThoma MartinThoma changed the title BUG: fix performance issues with large embedded base64 PERF: Avoid repeated string concatenation with large embedded base64 images Sep 17, 2022
@MartinThoma MartinThoma changed the title PERF: Avoid repeated string concatenation with large embedded base64 images PERF: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022
@MartinThoma
Copy link
Copy Markdown
Member

https://stackoverflow.com/a/65019934/562769 is a micro-benchmark that essentially shows this improvement.

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Sep 17, 2022
@MartinThoma MartinThoma changed the title PERF: Avoid string concatenation with large embedded base64-encoded images PI: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022
@MartinThoma MartinThoma merged commit 3be01fd into py-pdf:main Sep 17, 2022
@MartinThoma
Copy link
Copy Markdown
Member

Thank you! It is merged to main and will be part of the next release to PyPI (likely today or tomorrow)

@mergezalot
Copy link
Copy Markdown
Contributor Author

Thank you for the speedy merge @MartinThoma. We did not yet manage to create offending testdata, but will give it another try.

MartinThoma added a commit that referenced this pull request Sep 18, 2022
New Features (ENH):
-  Add rotation property and transfer_rotate_to_content (#1348)

Performance Improvements (PI):
-  Avoid string concatenation with large embedded base64-encoded images (#1350)

Bug Fixes (BUG):
-  Format floats using their intrinsic decimal precision (#1267)

Robustness (ROB):
-  Fix merge_page for pages without resources (#1349)

Full Changelog: 2.10.8...2.10.9
@mergezalot
Copy link
Copy Markdown
Contributor Author

Note to self: found out how the original PDF with embedded base64 image was created, will file a test later.
base64image.pdf

mergezalot added a commit to mergezalot/PyPDF2 that referenced this pull request Sep 20, 2022
There is a saftey margin of a factor of 10 in both directions,
so the test should be fairly stable. Tests py-pdf#1350.
MartinThoma added a commit to py-pdf/sample-files that referenced this pull request Sep 24, 2022
Source: py-pdf/pypdf#1350 (comment)

Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
@MartinThoma
Copy link
Copy Markdown
Member

I've added the example file: https://github.com/py-pdf/sample-files

MartinThoma pushed a commit that referenced this pull request Sep 25, 2022
There is a saftey margin of a factor of 10 in both directions,
so the test should be fairly stable.

Tests #1350.

Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nf-performance Non-functional change: Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants