Optimize read_next_end_line by ztravis · Pull Request #646 · py-pdf/pypdf

ztravis · 2021-11-16T03:20:36Z

This function (which reads the previous line in a binary stream) is pretty inefficient when handling long lines - if the stream is a buffered binary stream, each one-byte "backwards" read may trigger a full buffer read, and (more significantly) iteratively building the line we want to return is quadratic in its total length.

ztravis · 2021-11-16T03:32:21Z

PyPDF2/pdf.py

-                    if x == b_('\n') or x == b_('\r'): # account for CR+LF
-                        stream.seek(-1, 1)
-                        crlf = True
-                    if stream.tell() < 2:


There are also some weird quirks here - some might be intentional (e.g. this skips empty lines, which behavior I have preserved), but others are probably not, e.g. this boundary check means that the line before the previous one matters, and this function can also skip past one-character lines when crlf == True:

r = PdfFileReader("whatever.pdf") x = io.BytesIO(b'foo\nbar\nX\r\nbaz') x.seek(-1, 2) r.readNextEndLine(x) # "baz" r.readNextEndLine(x) # "bar" - "X" has been skipped!

ztravis · 2021-11-16T03:34:39Z

If anyone is interested, this change is relatively easily monkey-patched in by extending PdfFileReader and overriding readNextEndLine.

MartinThoma · 2022-04-07T15:02:44Z

I was hoping that this would fix #309 but it doesn't 😅

ztravis · 2022-06-08T21:37:46Z

@MartinThoma I see #808 has been merged, which addresses one of the issues here (quadratic behavior when building the to return), but not some of the others (bugs in readNextEndLine, poor performance with buffered reads). Are you still interested in merging this? If so, I can bring the branch up to date, and just let me know if there are other changes you'd like (e.g. tests).

I decided to change the read semantics so that you always read content strictly ahead of your current position (versus including the byte at your current position).

codecov · 2022-06-09T00:49:58Z

Codecov Report

Merging #646 (cd08bf0) into main (a7dc370) will increase coverage by 0.14%.
The diff coverage is 95.34%.

@@            Coverage Diff             @@
##             main     #646      +/-   ##
==========================================
+ Coverage   84.36%   84.51%   +0.14%     
==========================================
  Files          18       18              
  Lines        4080     4093      +13     
  Branches      858      862       +4     
==========================================
+ Hits         3442     3459      +17     
+ Misses        451      449       -2     
+ Partials      187      185       -2

Impacted Files	Coverage Δ
PyPDF2/_utils.py	`98.03% <94.44%> (-1.12%)`	⬇️
PyPDF2/_reader.py	`81.97% <100.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7dc370...cd08bf0. Read the comment docs.

MartinThoma · 2022-06-09T05:13:36Z

@ztravis Thank you for updating the pr 🙏

The problem I have with this PR is that it's pretty hard for me to understand if it would break anything. I promise I'll have a look this evening after work (in ~12h). I will run my benchmarks, read all of the code, and either merge or at the very least ask all questions.

MartinThoma · 2022-06-09T05:34:37Z

Benchmark of current main:

----------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 116.0322 (1.0)        125.0818 (1.0)        118.9974 (1.0)       2.6988 (1.0)        118.2405 (1.0)       2.3268 (1.0)           2;1  8.4035 (1.0)           9           1
test_page_operations     1,737.7729 (14.98)    1,796.0086 (14.36)    1,773.2439 (14.90)    30.5507 (11.32)    1,795.1839 (15.18)    54.7096 (23.51)         2;0  0.5639 (0.07)          5           1
test_text_extraction     5,123.8949 (44.16)    5,248.9734 (41.96)    5,199.3652 (43.69)    45.9366 (17.02)    5,208.0464 (44.05)    39.5540 (17.00)         2;0  0.1923 (0.02)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Benchmark of this PR:


----------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 114.4122 (1.0)        119.8161 (1.0)        117.1081 (1.0)       1.8162 (1.0)        117.0650 (1.0)        1.8234 (1.0)           4;0  8.5391 (1.0)           9           1
test_page_operations     1,714.9651 (14.99)    1,751.7516 (14.62)    1,734.0682 (14.81)    13.6474 (7.51)     1,733.4835 (14.81)     17.7491 (9.73)          2;0  0.5767 (0.07)          5           1
test_text_extraction     4,892.0197 (42.76)    5,130.5874 (42.82)    5,019.7758 (42.86)    95.4851 (52.57)    5,034.2829 (43.00)    152.4442 (83.60)         2;0  0.1992 (0.02)          5           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Running it a second time

----------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------
Name (time in ms)               Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_merge                 117.5484 (1.0)        145.6702 (1.0)        133.7171 (1.0)       8.7337 (1.0)        134.0983 (1.0)       9.5516 (1.0)           3;0  7.4785 (1.0)           9           1
test_page_operations     1,759.7965 (14.97)    1,975.6600 (13.56)    1,830.4813 (13.69)    83.9258 (9.61)     1,804.1183 (13.45)    70.6569 (7.40)          1;1  0.5463 (0.07)          5           1
test_text_extraction     5,207.8912 (44.30)    5,370.3599 (36.87)    5,274.5270 (39.45)    62.7157 (7.18)     5,257.6101 (39.21)    84.9518 (8.89)          2;0  0.1896 (0.03)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The results are not conclusive. It might be a bit faster, it might be a bit slower. At least I'm confident that performance is not changing largely in either direction for the given benchmark. It might still be that the benchmark lacks relevant information.

MartinThoma · 2022-06-09T05:50:36Z

The benchmark https://github.com/py-pdf/benchmarks did not change in results (as I hoped for). That means things still work.

However, the text extraction times seem to be worse than before:

branch	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
main	PyPDF2	3.0s	23.3s	5.4s	6.1s	1.8s	0.7s	0.9s	0.4s	0.5s	0.3s	0.6s	0.5s	0.3s	0.4s	0.1s
#646	PyPDF2	3.3s	24.6s	7.2s	6.9s	2.0s	0.8s	1.2s	0.4s	0.5s	0.3s	0.5s	0.6s	0.3s	0.5s	0.2s
#646	PyPDF2	3.0s	22.9s	5.9s	6.3s	1.8s	0.6s	1.0s	0.4s	0.5s	0.3s	0.5s	0.6s	0.3s	0.4s	0.2s

Similar for merging:

branch	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
main	PyPDF2	7.7s	23.3s	5.4s	6.1s	1.8s	0.7s	0.9s	0.4s	0.5s	0.3s	0.6s	0.5s	0.3s	0.4s	0.1s
#646	PyPDF2	9.0s	24.6s	7.2s	6.9s	2.0s	0.8s	1.2s	0.4s	0.5s	0.3s	0.5s	0.6s	0.3s	0.5s	0.2s
#646	PyPDF2	8.3s	22.9s	5.9s	6.3s	1.8s	0.6s	1.0s	0.4s	0.5s	0.3s	0.5s	0.6s	0.3s	0.4s	0.2s

MartinThoma · 2022-06-09T06:18:23Z

@ztravis Do you have an example PDF where you would expect better performance with your PR?

ztravis · 2022-06-09T06:19:57Z

This is only going to make a difference for PDFs that have a lot of trailing data at the end (since we're only reading backwards looking for the EOF marker). For example, if I take a normal PDF and then append 1MB of null bytes onto the end, it takes me ~0.5s to load it (from an SSD) on master but around 0.05s with this change.
I don't know what's going on in that benchmark; from what I can see it's loading from an io.BytesIO stream of all the data, so any stream operations should be extremely fast (~1ms for me on either branch), way faster than those times you're reporting, but who knows! I would be surprised if this change has much impact normally given that this this utility method is only used during loading and should be very fast either way most of the time.

The other buggy behavior with the current implementation (e.g. that example where a non-empty line is skipped I gave in a comment above) might not matter given how the function is currently used. If it gets used elsewhere to read backwards in streams, it might cause some issues, but I dunno if it's a realistic concern.

ztravis · 2022-06-09T06:22:08Z

Both are pretty minor concerns, and I can certainly understand the preference to leave things as they are (especially since this is a bit tricky and undoubtedly error-prone operation).

ztravis · 2022-06-09T06:24:17Z

Here's an example PDF based on one of the test files.
trailing_junk.pdf

MartinThoma · 2022-06-09T15:34:30Z

Nice! I can confirm for trailing_junk.pdf that this PR improves

>>> from PyPDF2 import PdfReader; import time
>>> t0=time.time();reader = PdfReader("trailing_junk.pdf");t1=time.time();print(t1-t0)

from about 0.70s to about 0.11s. Very nice 👍

MartinThoma · 2022-06-09T16:42:39Z

Thank you very much for your contribution!

It was removed with #646, but we need to keep it in order not to break backwards compatibility.

read_next_end_line was removed with #646, but we need to keep it in order to keep backwards compatibility.

MartinThoma · 2022-06-09T16:59:47Z

Now that the PR is merged, it will be part of the next release. That will probably be next Sunday (12.06.2022).

PyPDF2 still has a couple of things in the public interface (not leading by an underscore) that should probably not be public. For example, the read_next_end_line method is public. This means users of PyPDF2 can expect it not to be removed without warning. Hence I've added it back in #965. It will not be used by PyPDF2, but has to stay around until PyPDF 4.0.0.

New Features (ENH): - Add support for pathlib as input for PdfReader (#979) Performance Improvements (PI): - Optimize read_next_end_line (#646) Bug Fixes (BUG): - Adobe Acrobat \'Would you like to save this file?\' (#970) Documentation (DOC): - Notes on annotations (#982) - Who uses PyPDF2 - intendet \xe2\x9e\x94 in robustness page (#958) Maintenance (MAINT): - pre-commit / requirements.txt updates (#977) - Mark read_next_end_line as deprecated (#965) - Export `PageObject` in PyPDF2 root (#960) Testing (TST): - Add MCVE of issue #416 (#980) - FlateDecode.decode decodeParms (#964) - Xmp module (#962) - utils.paeth_predictor (#959) Code Style (STY): - Use more tuples and list/dict comprehensions (#976) Full Changelog: 2.1.0...2.1.1

ztravis force-pushed the feature/fix-readNextEndLine branch from e60528a to 4779292 Compare November 16, 2021 03:33

ztravis commented Nov 16, 2021

View reviewed changes

MartinThoma added the nf-performance Non-functional change: Performance label Apr 7, 2022

MartinThoma added this to the PyPDF2 version 2.0.0 milestone Apr 9, 2022

Cleanup/refactoring for 2.0.0, bugfixes

a8783a6

I decided to change the read semantics so that you always read content strictly ahead of your current position (versus including the byte at your current position).

ztravis force-pushed the feature/fix-readNextEndLine branch from 9899cc8 to a8783a6 Compare June 9, 2022 00:40

Flake8

cd08bf0

MartinThoma changed the title ~~Optimize readNextEndLine~~ Optimize read_next_end_line Jun 9, 2022

MartinThoma merged commit 8cd0cfe into py-pdf:main Jun 9, 2022

MartinThoma added a commit that referenced this pull request Jun 9, 2022

MAINT: Mark read_next_end_line as deprecated

0fc52ba

It was removed with #646, but we need to keep it in order not to break backwards compatibility.

MartinThoma mentioned this pull request Jun 9, 2022

MAINT: Mark read_next_end_line as deprecated #965

Merged

MartinThoma added a commit that referenced this pull request Jun 9, 2022

MAINT: Mark read_next_end_line as deprecated

5d9e2f0

It was removed with #646, but we need to keep it in order not to break backwards compatibility.

MartinThoma added a commit that referenced this pull request Jun 9, 2022

MAINT: Mark read_next_end_line as deprecated (#965)

336d659

read_next_end_line was removed with #646, but we need to keep it in order to keep backwards compatibility.

MartinThoma mentioned this pull request Jun 10, 2022

improved ExtractText(3) #969

Merged

Conversation

ztravis commented Nov 16, 2021

Uh oh!

ztravis Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

ztravis commented Nov 16, 2021

Uh oh!

MartinThoma commented Apr 7, 2022

Uh oh!

ztravis commented Jun 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 9, 2022

Codecov Report

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

MartinThoma commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

ztravis commented Jun 9, 2022

Uh oh!

ztravis commented Jun 9, 2022

Uh oh!

ztravis commented Jun 9, 2022

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

MartinThoma commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ztravis commented Jun 8, 2022 •

edited

Loading

MartinThoma commented Jun 9, 2022 •

edited

Loading