Skip to content

Optimize read_next_end_line#646

Merged
MartinThoma merged 2 commits intopy-pdf:mainfrom
ztravis:feature/fix-readNextEndLine
Jun 9, 2022
Merged

Optimize read_next_end_line#646
MartinThoma merged 2 commits intopy-pdf:mainfrom
ztravis:feature/fix-readNextEndLine

Conversation

@ztravis
Copy link
Copy Markdown
Contributor

@ztravis ztravis commented Nov 16, 2021

This function (which reads the previous line in a binary stream) is pretty inefficient when handling long lines - if the stream is a buffered binary stream, each one-byte "backwards" read may trigger a full buffer read, and (more significantly) iteratively building the line we want to return is quadratic in its total length.

@ztravis ztravis force-pushed the feature/fix-readNextEndLine branch from e60528a to 4779292 Compare November 16, 2021 03:33
PyPDF2/pdf.py Outdated
if x == b_('\n') or x == b_('\r'): # account for CR+LF
stream.seek(-1, 1)
crlf = True
if stream.tell() < 2:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some weird quirks here - some might be intentional (e.g. this skips empty lines, which behavior I have preserved), but others are probably not, e.g. this boundary check means that the line before the previous one matters, and this function can also skip past one-character lines when crlf == True:

r = PdfFileReader("whatever.pdf")
x = io.BytesIO(b'foo\nbar\nX\r\nbaz')
x.seek(-1, 2)
r.readNextEndLine(x) # "baz"
r.readNextEndLine(x) # "bar" - "X" has been skipped!

@ztravis
Copy link
Copy Markdown
Contributor Author

ztravis commented Nov 16, 2021

If anyone is interested, this change is relatively easily monkey-patched in by extending PdfFileReader and overriding readNextEndLine.

@MartinThoma
Copy link
Copy Markdown
Member

I was hoping that this would fix #309 but it doesn't 😅

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Apr 7, 2022
@MartinThoma MartinThoma added this to the PyPDF2 version 2.0.0 milestone Apr 9, 2022
@ztravis
Copy link
Copy Markdown
Contributor Author

ztravis commented Jun 8, 2022

@MartinThoma I see #808 has been merged, which addresses one of the issues here (quadratic behavior when building the to return), but not some of the others (bugs in readNextEndLine, poor performance with buffered reads). Are you still interested in merging this? If so, I can bring the branch up to date, and just let me know if there are other changes you'd like (e.g. tests).

I decided to change the read semantics so that you always read content
strictly ahead of your current position (versus including the byte at
your current position).
@ztravis ztravis force-pushed the feature/fix-readNextEndLine branch from 9899cc8 to a8783a6 Compare June 9, 2022 00:40
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 9, 2022

Codecov Report

Merging #646 (cd08bf0) into main (a7dc370) will increase coverage by 0.14%.
The diff coverage is 95.34%.

@@            Coverage Diff             @@
##             main     #646      +/-   ##
==========================================
+ Coverage   84.36%   84.51%   +0.14%     
==========================================
  Files          18       18              
  Lines        4080     4093      +13     
  Branches      858      862       +4     
==========================================
+ Hits         3442     3459      +17     
+ Misses        451      449       -2     
+ Partials      187      185       -2     
Impacted Files Coverage Δ
PyPDF2/_utils.py 98.03% <94.44%> (-1.12%) ⬇️
PyPDF2/_reader.py 81.97% <100.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7dc370...cd08bf0. Read the comment docs.

@MartinThoma
Copy link
Copy Markdown
Member

@ztravis Thank you for updating the pr 🙏

The problem I have with this PR is that it's pretty hard for me to understand if it would break anything. I promise I'll have a look this evening after work (in ~12h). I will run my benchmarks, read all of the code, and either merge or at the very least ask all questions.

@MartinThoma
Copy link
Copy Markdown
Member

Benchmark of current main:

----------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 116.0322 (1.0)        125.0818 (1.0)        118.9974 (1.0)       2.6988 (1.0)        118.2405 (1.0)       2.3268 (1.0)           2;1  8.4035 (1.0)           9           1
test_page_operations     1,737.7729 (14.98)    1,796.0086 (14.36)    1,773.2439 (14.90)    30.5507 (11.32)    1,795.1839 (15.18)    54.7096 (23.51)         2;0  0.5639 (0.07)          5           1
test_text_extraction     5,123.8949 (44.16)    5,248.9734 (41.96)    5,199.3652 (43.69)    45.9366 (17.02)    5,208.0464 (44.05)    39.5540 (17.00)         2;0  0.1923 (0.02)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Benchmark of this PR:


----------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 114.4122 (1.0)        119.8161 (1.0)        117.1081 (1.0)       1.8162 (1.0)        117.0650 (1.0)        1.8234 (1.0)           4;0  8.5391 (1.0)           9           1
test_page_operations     1,714.9651 (14.99)    1,751.7516 (14.62)    1,734.0682 (14.81)    13.6474 (7.51)     1,733.4835 (14.81)     17.7491 (9.73)          2;0  0.5767 (0.07)          5           1
test_text_extraction     4,892.0197 (42.76)    5,130.5874 (42.82)    5,019.7758 (42.86)    95.4851 (52.57)    5,034.2829 (43.00)    152.4442 (83.60)         2;0  0.1992 (0.02)          5           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Running it a second time

----------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 117.5484 (1.0)        145.6702 (1.0)        133.7171 (1.0)       8.7337 (1.0)        134.0983 (1.0)       9.5516 (1.0)           3;0  7.4785 (1.0)           9           1
test_page_operations     1,759.7965 (14.97)    1,975.6600 (13.56)    1,830.4813 (13.69)    83.9258 (9.61)     1,804.1183 (13.45)    70.6569 (7.40)          1;1  0.5463 (0.07)          5           1
test_text_extraction     5,207.8912 (44.30)    5,370.3599 (36.87)    5,274.5270 (39.45)    62.7157 (7.18)     5,257.6101 (39.21)    84.9518 (8.89)          2;0  0.1896 (0.03)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The results are not conclusive. It might be a bit faster, it might be a bit slower. At least I'm confident that performance is not changing largely in either direction for the given benchmark. It might still be that the benchmark lacks relevant information.

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 9, 2022

The benchmark https://github.com/py-pdf/benchmarks did not change in results (as I hoped for). That means things still work.

However, the text extraction times seem to be worse than before:

branch Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
main PyPDF2 3.0s 23.3s 5.4s 6.1s 1.8s 0.7s 0.9s 0.4s 0.5s 0.3s 0.6s 0.5s 0.3s 0.4s 0.1s
#646 PyPDF2 3.3s 24.6s 7.2s 6.9s 2.0s 0.8s 1.2s 0.4s 0.5s 0.3s 0.5s 0.6s 0.3s 0.5s 0.2s
#646 PyPDF2 3.0s 22.9s 5.9s 6.3s 1.8s 0.6s 1.0s 0.4s 0.5s 0.3s 0.5s 0.6s 0.3s 0.4s 0.2s

Similar for merging:

branch Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
main PyPDF2 7.7s 23.3s 5.4s 6.1s 1.8s 0.7s 0.9s 0.4s 0.5s 0.3s 0.6s 0.5s 0.3s 0.4s 0.1s
#646 PyPDF2 9.0s 24.6s 7.2s 6.9s 2.0s 0.8s 1.2s 0.4s 0.5s 0.3s 0.5s 0.6s 0.3s 0.5s 0.2s
#646 PyPDF2 8.3s 22.9s 5.9s 6.3s 1.8s 0.6s 1.0s 0.4s 0.5s 0.3s 0.5s 0.6s 0.3s 0.4s 0.2s

@MartinThoma
Copy link
Copy Markdown
Member

@ztravis Do you have an example PDF where you would expect better performance with your PR?

@ztravis
Copy link
Copy Markdown
Contributor Author

ztravis commented Jun 9, 2022

This is only going to make a difference for PDFs that have a lot of trailing data at the end (since we're only reading backwards looking for the EOF marker). For example, if I take a normal PDF and then append 1MB of null bytes onto the end, it takes me ~0.5s to load it (from an SSD) on master but around 0.05s with this change.
I don't know what's going on in that benchmark; from what I can see it's loading from an io.BytesIO stream of all the data, so any stream operations should be extremely fast (~1ms for me on either branch), way faster than those times you're reporting, but who knows! I would be surprised if this change has much impact normally given that this this utility method is only used during loading and should be very fast either way most of the time.

The other buggy behavior with the current implementation (e.g. that example where a non-empty line is skipped I gave in a comment above) might not matter given how the function is currently used. If it gets used elsewhere to read backwards in streams, it might cause some issues, but I dunno if it's a realistic concern.

@ztravis
Copy link
Copy Markdown
Contributor Author

ztravis commented Jun 9, 2022

Both are pretty minor concerns, and I can certainly understand the preference to leave things as they are (especially since this is a bit tricky and undoubtedly error-prone operation).

@ztravis
Copy link
Copy Markdown
Contributor Author

ztravis commented Jun 9, 2022

Here's an example PDF based on one of the test files.
trailing_junk.pdf

@MartinThoma
Copy link
Copy Markdown
Member

Nice! I can confirm for trailing_junk.pdf that this PR improves

>>> from PyPDF2 import PdfReader; import time
>>> t0=time.time();reader = PdfReader("trailing_junk.pdf");t1=time.time();print(t1-t0)

from about 0.70s to about 0.11s. Very nice 👍

@MartinThoma MartinThoma changed the title Optimize readNextEndLine Optimize read_next_end_line Jun 9, 2022
@MartinThoma MartinThoma merged commit 8cd0cfe into py-pdf:main Jun 9, 2022
@MartinThoma
Copy link
Copy Markdown
Member

Thank you very much for your contribution!

MartinThoma added a commit that referenced this pull request Jun 9, 2022
It was removed with #646, but we need to keep it in order not to
break backwards compatibility.
MartinThoma added a commit that referenced this pull request Jun 9, 2022
It was removed with #646, but we need to keep it in order not to
break backwards compatibility.
MartinThoma added a commit that referenced this pull request Jun 9, 2022
read_next_end_line was removed with #646, but we need to keep it in order to keep backwards compatibility.
@MartinThoma
Copy link
Copy Markdown
Member

Now that the PR is merged, it will be part of the next release. That will probably be next Sunday (12.06.2022).

PyPDF2 still has a couple of things in the public interface (not leading by an underscore) that should probably not be public. For example, the read_next_end_line method is public. This means users of PyPDF2 can expect it not to be removed without warning. Hence I've added it back in #965. It will not be used by PyPDF2, but has to stay around until PyPDF 4.0.0.

MartinThoma added a commit that referenced this pull request Jun 12, 2022
New Features (ENH):
-  Add support for pathlib as input for PdfReader (#979)

Performance Improvements (PI):
-  Optimize read_next_end_line (#646)

Bug Fixes (BUG):
-  Adobe Acrobat \'Would you like to save this file?\' (#970)

Documentation (DOC):
-  Notes on annotations (#982)
-  Who uses PyPDF2
-  intendet \xe2\x9e\x94 in robustness page  (#958)

Maintenance (MAINT):
-  pre-commit / requirements.txt updates (#977)
-  Mark read_next_end_line as deprecated (#965)
-  Export `PageObject` in PyPDF2 root (#960)

Testing (TST):
-  Add MCVE of issue #416 (#980)
-  FlateDecode.decode decodeParms (#964)
-  Xmp module (#962)
-  utils.paeth_predictor (#959)

Code Style (STY):
-  Use more tuples and list/dict comprehensions (#976)

Full Changelog: 2.1.0...2.1.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nf-performance Non-functional change: Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants