Fix #297 : fix corruption in startxref or xref table#788
Fix #297 : fix corruption in startxref or xref table#788MartinThoma merged 6 commits intopy-pdf:mainfrom pubpub-zz:main
Conversation
pubpub-zz
commented
Apr 20, 2022
Proposal for #297 Issue is due to corrupted startxref pointer and/or xref table Fixed proposed if incorrect pointer look for all objs to rebuild an xref table Tests updated
Codecov Report
@@ Coverage Diff @@
## main #788 +/- ##
==========================================
+ Coverage 75.21% 75.31% +0.09%
==========================================
Files 11 12 +1
Lines 3514 3609 +95
Branches 809 835 +26
==========================================
+ Hits 2643 2718 +75
- Misses 658 670 +12
- Partials 213 221 +8
Continue to review full report at Codecov.
|
Tests/test_reader.py
Outdated
| with pytest.raises(UserWarning): | ||
| reader = PdfFileReader(path, "rb") | ||
| reader.getPage(0) |
There was a problem hiding this comment.
This deactivates the test. At least for strict=False I would expect reader.extractText() to give something reasonable.
The rest of the PR might be fine, but it does not fix issue297. I will not merge this part.
cf comments in #788 (comment) Test in strict=False and True; use PdfReadWarning instead of UserWarniing to be consistant with other raising
|
@MartinThoma |
|
I understand :-) This might help to capture both cases: |
|
@MartinThoma , |
|
@pubpub-zz Thank your for pointing it out - yes, that is ok :-) The parametrize option is better as it shows clearer where an issue is (if there was one) + it avoids a bit of code duplication. But leave it as is - that is just a nitpick; I'll polish that in another commit. Thank you 🤗 I'll not merge it for now. In my roof apartment one of the windows just broke out ... I'm still shocked + cannot concentrate. I'll likely do review it again tomorrow when I can focus again 😅 |
|
@MartinThoma, good luck with your Window(...s 10 or 11 ? That is the most common window which is breaking down 🤣) |
|
Good morning! There seems to be something off with As you can see, the byte string starts with Can you help me understand why the |
|
(minor comment, feel free to ignore: Running black over just the test is fine: |
due to the formatting, the original formatting string includes a double '%%' that is converted to a single during formating. the point is that the .find(b"xref") applies the search on the original function. therefore the 327 points to the r and not the x of the xref. |
applied only on test_reader as recommanded
Thank you - that was what I missed 🙏 👍 Thank you for fixing the tests!
I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍 |
Test completed to cover old cases loop detected and fix if rebuilding xref table
So did I. I've extended to test to cover both nominal case and incorrect pointer. A loop has been detected and fixed in such cases. |
|
Thank you very much for all the time and effort you've put into this! I appreciate it! |
|
@pubpub-zz I've created a follow-up PR that improves this IMHO a little bit. Nothing substantial, but from a maintainers perspective important. I'll add the reasoning in a second: #830 |
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Use PdfReadWarning instead of UserWarning to be consistent Closes #297
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Use PdfReadWarning instead of UserWarning to be consistent Closes py-pdf#297
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Robustness (ROB): - Handle missing destinations in reader (#840) - warn-only in readStringFromStream (#837) - Fix corruption in startxref or xref table (#788 and #830) Documentation (DOC): - Project Governance (#799) - History of PyPDF2 - PDF feature/version support (#816) - More details on text parsing issues (#815) Developer Experience (DEV): - Add benchmark command to Makefile - Ignore IronPython parts for code coverage (#826) Maintenance (MAINT): - Split pdf module (#836) - Separated CCITTFax param parsing/decoding (#841) - Update requirements files Testing (TST): - Use external repository for larger/more PDFs for testing (#820) - Swap incorrect test names (#838) - Add test for PdfFileReader and page properties (#835) - Add tests for PyPDF2.generic (#831) - Add tests for utils, form fields, PageRange (#827) - Add test for ASCII85Decode (#825) - Add test for FlateDecode (#823) - Add test for filters.ASCIIHexDecode (#822) Code Style (STY): - Apply pre-commit (black, isort) + use snake_case variables (#832) - Remove debug code (#828) - Documentation, Variable names (#839) Full Changelog: 1.27.9...1.27.10