Fix #297 : fix corruption in startxref or xref table by pubpub-zz · Pull Request #788 · py-pdf/pypdf

pubpub-zz · 2022-04-20T20:34:04Z

Proposal for #297

Issue is due to corrupted startxref pointer and/or xref table
Fixed proposed if incorrect pointer look for all objs to rebuild an xref table

Tests updated

Proposal for #297 Issue is due to corrupted startxref pointer and/or xref table Fixed proposed if incorrect pointer look for all objs to rebuild an xref table Tests updated

codecov-commenter · 2022-04-21T04:02:14Z

Codecov Report

Merging #788 (2852248) into main (e7fe707) will increase coverage by 0.09%.
The diff coverage is 95.34%.

@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
+ Coverage   75.21%   75.31%   +0.09%     
==========================================
  Files          11       12       +1     
  Lines        3514     3609      +95     
  Branches      809      835      +26     
==========================================
+ Hits         2643     2718      +75     
- Misses        658      670      +12     
- Partials      213      221       +8

Impacted Files	Coverage Δ
PyPDF2/pdf.py	`81.72% <95.34%> (-0.11%)`	⬇️
PyPDF2/merger.py	`68.07% <0.00%> (-4.87%)`	⬇️
PyPDF2/generic.py	`67.84% <0.00%> (ø)`
PyPDF2/__init__.py	`100.00% <0.00%> (ø)`
PyPDF2/papersizes.py	`100.00% <0.00%> (ø)`
PyPDF2/xmp.py	`50.27% <0.00%> (+7.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7fe707...2852248. Read the comment docs.

MartinThoma · 2022-04-21T06:29:48Z

#297

MartinThoma · 2022-04-22T15:08:43Z

Tests/test_reader.py

+    with pytest.raises(UserWarning):
+        reader = PdfFileReader(path, "rb")
+        reader.getPage(0)


This deactivates the test. At least for strict=False I would expect reader.extractText() to give something reasonable.

The rest of the PR might be fine, but it does not fix issue297. I will not merge this part.

cf comments in #788 (comment) Test in strict=False and True; use PdfReadWarning instead of UserWarniing to be consistant with other raising

pubpub-zz · 2022-04-23T11:34:13Z

@MartinThoma
Thanks for your review. I do not master pytest. Test reviewed to check both Strict True and False.

MartinThoma · 2022-04-23T12:31:50Z

I understand :-)

This might help to capture both cases:

@pytest.mark.parametrize(
    "set_strict, [True, False]
)
def test_issue297(set_strict):
    .... # the test: you pass set_strict to the PdfFileReader and you only do 'with pytest.raises' if set_strict is True

pubpub-zz · 2022-04-23T14:18:37Z

@MartinThoma ,
The expected result not being the same I've included both in the test manually isn't this good for you ?

def test_issue297():
    path = os.path.join(RESOURCE_ROOT, "issue-297.pdf")
    with pytest.raises(PdfReadWarning) as exc:
        print(exc)
        reader = PdfFileReader(path,strict=True)
        reader.getPage(0)
    assert ( exc.value.args[0].find("startxref") > 0)
    reader = PdfFileReader(path,strict=False)
    reader.getPage(0)

MartinThoma · 2022-04-23T17:23:28Z

@pubpub-zz Thank your for pointing it out - yes, that is ok :-) The parametrize option is better as it shows clearer where an issue is (if there was one) + it avoids a bit of code duplication. But leave it as is - that is just a nitpick; I'll polish that in another commit. Thank you 🤗

I'll not merge it for now. In my roof apartment one of the windows just broke out ... I'm still shocked + cannot concentrate. I'll likely do review it again tomorrow when I can focus again 😅

pubpub-zz · 2022-04-23T17:31:03Z

@MartinThoma, good luck with your Window(...s 10 or 11 ? That is the most common window which is breaking down 🤣)

MartinThoma · 2022-04-24T05:29:57Z

Good morning!

There seems to be something off with test_read_unknown_zero_pages. You've commented, that this is necessary due to the double-percentage in the beginning (%%PDF-1.7). However, this is a part of the percentage formatting. The actual formatted byte-string looks like this:

%PDF-1.7
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0] /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R /Resources << /Font << >> >> /Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 0 0 R /Type /Catalog >> endobj
xref 1 5
0000000010 00000 n
0000000067 00000 n
0000000088 00000 n
0000000109 00000 n
0000000278 00000 n
trailer << /Root 5 1 R /Size 6 >>
startxref 327
%%EOF'

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

MartinThoma · 2022-04-24T05:32:16Z

(minor comment, feel free to ignore: Running black over just the test is fine: black Tests/test_reader.py. Please don't run it over the complete codebase as this would add to many changes, but I already had it running over the Tests. For this reason it will just beautify your additions)

PyPDF2/pdf.py

pubpub-zz · 2022-04-24T08:26:27Z

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

due to the formatting, the original formatting string includes a double '%%' that is converted to a single during formating. the point is that the .find(b"xref") applies the search on the original function. therefore the 327 points to the r and not the x of the xref.
the original code was not checking for the xref not reporting any error being tolerant to 'invalid' pdf data.
with the new code, strict=true will reject the file - not compliant with the spec - but with string=false, the xref table will be rebuilt, being able to accept the file
@MartinThoma , Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

applied only on test_reader as recommanded

MartinThoma · 2022-04-24T09:28:01Z

the .find(b"xref") applies the search on the original

Thank you - that was what I missed 🙏 👍 Thank you for fixing the tests!

Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

Test completed to cover old cases loop detected and fix if rebuilding xref table

pubpub-zz · 2022-04-24T14:34:31Z

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

So did I. I've extended to test to cover both nominal case and incorrect pointer. A loop has been detected and fixed in such cases.

PyPDF2/pdf.py

Tests/test_reader.py

#788 (comment)

MartinThoma · 2022-04-27T11:30:49Z

Thank you very much for all the time and effort you've put into this! I appreciate it!

MartinThoma · 2022-04-27T11:58:14Z

@pubpub-zz I've created a follow-up PR that improves this IMHO a little bit. Nothing substantial, but from a maintainers perspective important. I'll add the reasoning in a second: #830

This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.

Use PdfReadWarning instead of UserWarning to be consistent Closes #297

This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.

Use PdfReadWarning instead of UserWarning to be consistent Closes py-pdf#297

This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.

Robustness (ROB): - Handle missing destinations in reader (#840) - warn-only in readStringFromStream (#837) - Fix corruption in startxref or xref table (#788 and #830) Documentation (DOC): - Project Governance (#799) - History of PyPDF2 - PDF feature/version support (#816) - More details on text parsing issues (#815) Developer Experience (DEV): - Add benchmark command to Makefile - Ignore IronPython parts for code coverage (#826) Maintenance (MAINT): - Split pdf module (#836) - Separated CCITTFax param parsing/decoding (#841) - Update requirements files Testing (TST): - Use external repository for larger/more PDFs for testing (#820) - Swap incorrect test names (#838) - Add test for PdfFileReader and page properties (#835) - Add tests for PyPDF2.generic (#831) - Add tests for utils, form fields, PageRange (#827) - Add test for ASCII85Decode (#825) - Add test for FlateDecode (#823) - Add test for filters.ASCIIHexDecode (#822) Code Style (STY): - Apply pre-commit (black, isort) + use snake_case variables (#832) - Remove debug code (#828) - Documentation, Variable names (#839) Full Changelog: 1.27.9...1.27.10

pubpub-zz added 2 commits April 20, 2022 22:30

Fix #297 : fix corruption in startxref or xref table

19efa1d

Proposal for #297 Issue is due to corrupted startxref pointer and/or xref table Fixed proposed if incorrect pointer look for all objs to rebuild an xref table Tests updated

cleanup flake8 report

706d4c5

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 21, 2022

MartinThoma reviewed Apr 22, 2022

View reviewed changes

MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 22, 2022

review Test

e61e469

cf comments in #788 (comment) Test in strict=False and True; use PdfReadWarning instead of UserWarniing to be consistant with other raising

MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Apr 23, 2022

MartinThoma reviewed Apr 24, 2022

View reviewed changes

PyPDF2/pdf.py Show resolved Hide resolved

reformatting with black

cd2df38

applied only on test_reader as recommanded

extend test

c7ecd8c

Test completed to cover old cases loop detected and fix if rebuilding xref table

MartinThoma reviewed Apr 24, 2022

View reviewed changes

PyPDF2/pdf.py Show resolved Hide resolved

MartinThoma reviewed Apr 24, 2022

View reviewed changes

Tests/test_reader.py Outdated Show resolved Hide resolved

Tests/test_reader.py Outdated Show resolved Hide resolved

Tests/test_reader.py Outdated Show resolved Hide resolved

Tests/test_reader.py Outdated Show resolved Hide resolved

cleanup

2852248

#788 (comment)

MartinThoma mentioned this pull request Apr 25, 2022

ENH: Add decrypt support for V5 and AES-128, AES-256 (R5 only) #749

Merged

MartinThoma merged commit 904b0df into py-pdf:main Apr 27, 2022

MartinThoma added a commit that referenced this pull request Apr 27, 2022

MAINT: Small refactoring after #788

e4e67a9

MartinThoma added a commit that referenced this pull request Apr 27, 2022

MAINT: Small refactoring after #788

46fd5dd

MartinThoma pushed a commit that referenced this pull request Apr 28, 2022

ROB: Fix corruption in startxref or xref table (#788)

1df2e4b

Use PdfReadWarning instead of UserWarning to be consistent Closes #297

VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022

ROB: Fix corruption in startxref or xref table (py-pdf#788)

2481e9c

Use PdfReadWarning instead of UserWarning to be consistent Closes py-pdf#297

Conversation

pubpub-zz commented Apr 20, 2022

Uh oh!

codecov-commenter commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MartinThoma commented Apr 21, 2022

Uh oh!

MartinThoma Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

pubpub-zz commented Apr 23, 2022

Uh oh!

MartinThoma commented Apr 23, 2022

Uh oh!

pubpub-zz commented Apr 23, 2022

Uh oh!

MartinThoma commented Apr 23, 2022

Uh oh!

pubpub-zz commented Apr 23, 2022

Uh oh!

MartinThoma commented Apr 24, 2022

Uh oh!

MartinThoma commented Apr 24, 2022

Uh oh!

Uh oh!

pubpub-zz commented Apr 24, 2022

Uh oh!

MartinThoma commented Apr 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Apr 24, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Apr 27, 2022

Uh oh!

MartinThoma commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 21, 2022 •

edited

Loading

MartinThoma commented Apr 24, 2022 •

edited

Loading