ROB: ignore_eof everywhere for read_until_regex#1521
Conversation
Codecov ReportBase: 91.79% // Head: 91.79% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1521 +/- ##
==========================================
- Coverage 91.79% 91.79% -0.01%
==========================================
Files 33 33
Lines 6093 6091 -2
Branches 1200 1199 -1
==========================================
- Hits 5593 5591 -2
Misses 323 323
Partials 177 177
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This was initially motivated by `NumberObject.read_from_stream`, which was calling `read_until_regex` with the default value of `ignore_eof=False` and thus raising exceptions like: ``` PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly ``` py-pdf@431ba70 demonstrates a similar fix for `NameObject.read_from_stream`. From discussion in py-pdf#1505, it was realized that the change to `NumberObject.read_from_stream` had now made ALL callers of `read_until_regex` pass `ignore_eof=True`. It's cleaner to remove the parameter entirely and change the default behaviour.
cd70bae to
8417a83
Compare
|
@rraval |
|
That would also help me. At the moment, I don't know what to do with those PRs. |
|
@pubpub-zz yes, I have code that looks like this: from PyPDF2 import PdfMerger
from pathlib import Path
bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")
merger = PdfMerger(strict=True)
merger.append(str(bad_pdf))
merger.write(str(output_pdf))Which produces a traceback like this: Unfortunately I cannot provide Interestingly, if I run with from PyPDF2 import PdfMerger
from pathlib import Path
bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")
merger = PdfMerger(strict=False) # <-- this line changed
merger.append(str(bad_pdf))
merger.write(str(output_pdf))... the exceptions are turned into warnings: ... and the resulting PDF is valid with the following quirks:
With this patch applied, I get no warnings and no blank pages (all 6 pages are visually equivalent to the original). |
|
Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you? The warnings exist to let the user know that the PDF is broken. You can mute them in the user code, but I would not want to remove them (except if they are actually wrong). And |
Not directly, no. But it's good to have a clear definition of the semantics that pypdf intends. Quoting from that page:
I do not know the PDF spec well (... or at all really), but does a stream termination in this fashion explicitly violate the PDF specification? Or is pypdf implementation being overly strict here? If this is indeed a PDF specification violation, then it strikes me as odd that 431ba70 for For what it's worth, I have not been able to find a PDF viewer or processor that fails on the motivating PDF file, I've tried processing via |
|
@rraval |
I've done as requested. I apologize for not sharing the file here but the file contains sensitive information and I do not have the rights to publish it broadly. |
|
Thank you @rraval for trusting us. Having example files helps us (well, @pubpub-zz to be honest - I'm mostly just doing sanity checks / releases / answering community questions 😄 ) |
|
@MartinThoma |
|
Thank you for your contributions @rraval 🙏 This change will be part of the release tomorrow @pubpub-zz Thank you for helping me with this one again ❤️ |
New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)
This was initially motivated by
NumberObject.read_from_stream, which was callingread_until_regexwith the default value ofignore_eof=Falseand thus raising exceptions like:431ba70 demonstrates a similar fix for
NameObject.read_from_stream.From discussion in #1505, it was realized that the change to
NumberObject.read_from_streamhad now made ALL callers ofread_until_regexpassignore_eof=True. It's cleaner to remove the parameter entirely and change the default behaviour.