Conversation
e60528a to
4779292
Compare
PyPDF2/pdf.py
Outdated
| if x == b_('\n') or x == b_('\r'): # account for CR+LF | ||
| stream.seek(-1, 1) | ||
| crlf = True | ||
| if stream.tell() < 2: |
There was a problem hiding this comment.
There are also some weird quirks here - some might be intentional (e.g. this skips empty lines, which behavior I have preserved), but others are probably not, e.g. this boundary check means that the line before the previous one matters, and this function can also skip past one-character lines when crlf == True:
r = PdfFileReader("whatever.pdf")
x = io.BytesIO(b'foo\nbar\nX\r\nbaz')
x.seek(-1, 2)
r.readNextEndLine(x) # "baz"
r.readNextEndLine(x) # "bar" - "X" has been skipped!
|
If anyone is interested, this change is relatively easily monkey-patched in by extending |
|
I was hoping that this would fix #309 but it doesn't 😅 |
|
@MartinThoma I see #808 has been merged, which addresses one of the issues here (quadratic behavior when building the to return), but not some of the others (bugs in |
I decided to change the read semantics so that you always read content strictly ahead of your current position (versus including the byte at your current position).
9899cc8 to
a8783a6
Compare
Codecov Report
@@ Coverage Diff @@
## main #646 +/- ##
==========================================
+ Coverage 84.36% 84.51% +0.14%
==========================================
Files 18 18
Lines 4080 4093 +13
Branches 858 862 +4
==========================================
+ Hits 3442 3459 +17
+ Misses 451 449 -2
+ Partials 187 185 -2
Continue to review full report at Codecov.
|
|
@ztravis Thank you for updating the pr 🙏 The problem I have with this PR is that it's pretty hard for me to understand if it would break anything. I promise I'll have a look this evening after work (in ~12h). I will run my benchmarks, read all of the code, and either merge or at the very least ask all questions. |
|
Benchmark of current Benchmark of this PR: The results are not conclusive. It might be a bit faster, it might be a bit slower. At least I'm confident that performance is not changing largely in either direction for the given benchmark. It might still be that the benchmark lacks relevant information. |
|
The benchmark https://github.com/py-pdf/benchmarks did not change in results (as I hoped for). That means things still work. However, the text extraction times seem to be worse than before:
Similar for merging:
|
|
@ztravis Do you have an example PDF where you would expect better performance with your PR? |
|
This is only going to make a difference for PDFs that have a lot of trailing data at the end (since we're only reading backwards looking for the EOF marker). For example, if I take a normal PDF and then append 1MB of null bytes onto the end, it takes me ~0.5s to load it (from an SSD) on master but around 0.05s with this change. The other buggy behavior with the current implementation (e.g. that example where a non-empty line is skipped I gave in a comment above) might not matter given how the function is currently used. If it gets used elsewhere to read backwards in streams, it might cause some issues, but I dunno if it's a realistic concern. |
|
Both are pretty minor concerns, and I can certainly understand the preference to leave things as they are (especially since this is a bit tricky and undoubtedly error-prone operation). |
|
Here's an example PDF based on one of the test files. |
|
Nice! I can confirm for from about |
|
Thank you very much for your contribution! |
It was removed with #646, but we need to keep it in order not to break backwards compatibility.
It was removed with #646, but we need to keep it in order not to break backwards compatibility.
read_next_end_line was removed with #646, but we need to keep it in order to keep backwards compatibility.
|
Now that the PR is merged, it will be part of the next release. That will probably be next Sunday (12.06.2022). PyPDF2 still has a couple of things in the public interface (not leading by an underscore) that should probably not be public. For example, the |
New Features (ENH): - Add support for pathlib as input for PdfReader (#979) Performance Improvements (PI): - Optimize read_next_end_line (#646) Bug Fixes (BUG): - Adobe Acrobat \'Would you like to save this file?\' (#970) Documentation (DOC): - Notes on annotations (#982) - Who uses PyPDF2 - intendet \xe2\x9e\x94 in robustness page (#958) Maintenance (MAINT): - pre-commit / requirements.txt updates (#977) - Mark read_next_end_line as deprecated (#965) - Export `PageObject` in PyPDF2 root (#960) Testing (TST): - Add MCVE of issue #416 (#980) - FlateDecode.decode decodeParms (#964) - Xmp module (#962) - utils.paeth_predictor (#959) Code Style (STY): - Use more tuples and list/dict comprehensions (#976) Full Changelog: 2.1.0...2.1.1
This function (which reads the previous line in a binary stream) is pretty inefficient when handling long lines - if the stream is a buffered binary stream, each one-byte "backwards" read may trigger a full buffer read, and (more significantly) iteratively building the line we want to return is quadratic in its total length.