Skip to content

Fix #297 : fix corruption in startxref or xref table#788

Merged
MartinThoma merged 6 commits intopy-pdf:mainfrom
pubpub-zz:main
Apr 27, 2022
Merged

Fix #297 : fix corruption in startxref or xref table#788
MartinThoma merged 6 commits intopy-pdf:mainfrom
pubpub-zz:main

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

Proposal for #297

Issue is due to corrupted startxref pointer and/or xref table
Fixed proposed if incorrect pointer look for all objs to rebuild an xref table

Tests updated

    Proposal for #297

    Issue is due to corrupted startxref pointer and/or xref table
    Fixed proposed if incorrect pointer look for all objs to rebuild an xref table

    Tests updated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 21, 2022

Codecov Report

Merging #788 (2852248) into main (e7fe707) will increase coverage by 0.09%.
The diff coverage is 95.34%.

@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
+ Coverage   75.21%   75.31%   +0.09%     
==========================================
  Files          11       12       +1     
  Lines        3514     3609      +95     
  Branches      809      835      +26     
==========================================
+ Hits         2643     2718      +75     
- Misses        658      670      +12     
- Partials      213      221       +8     
Impacted Files Coverage Δ
PyPDF2/pdf.py 81.72% <95.34%> (-0.11%) ⬇️
PyPDF2/merger.py 68.07% <0.00%> (-4.87%) ⬇️
PyPDF2/generic.py 67.84% <0.00%> (ø)
PyPDF2/__init__.py 100.00% <0.00%> (ø)
PyPDF2/papersizes.py 100.00% <0.00%> (ø)
PyPDF2/xmp.py 50.27% <0.00%> (+7.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7fe707...2852248. Read the comment docs.

@MartinThoma
Copy link
Copy Markdown
Member

#297

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 21, 2022
Comment on lines +219 to +222
with pytest.raises(UserWarning):
reader = PdfFileReader(path, "rb")
reader.getPage(0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deactivates the test. At least for strict=False I would expect reader.extractText() to give something reasonable.

The rest of the PR might be fine, but it does not fix issue297. I will not merge this part.

@MartinThoma MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 22, 2022
cf comments in #788 (comment)

Test in strict=False and True;
use PdfReadWarning instead of UserWarniing to be consistant with other raising
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
Thanks for your review. I do not master pytest. Test reviewed to check both Strict True and False.

@MartinThoma
Copy link
Copy Markdown
Member

I understand :-)

This might help to capture both cases:

@pytest.mark.parametrize(
    "set_strict, [True, False]
)
def test_issue297(set_strict):
    .... # the test: you pass set_strict to the PdfFileReader and you only do 'with pytest.raises' if set_strict is True

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma ,
The expected result not being the same I've included both in the test manually isn't this good for you ?

def test_issue297():
    path = os.path.join(RESOURCE_ROOT, "issue-297.pdf")
    with pytest.raises(PdfReadWarning) as exc:
        print(exc)
        reader = PdfFileReader(path,strict=True)
        reader.getPage(0)
    assert ( exc.value.args[0].find("startxref") > 0)
    reader = PdfFileReader(path,strict=False)
    reader.getPage(0)

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz Thank your for pointing it out - yes, that is ok :-) The parametrize option is better as it shows clearer where an issue is (if there was one) + it avoids a bit of code duplication. But leave it as is - that is just a nitpick; I'll polish that in another commit. Thank you 🤗

I'll not merge it for now. In my roof apartment one of the windows just broke out ... I'm still shocked + cannot concentrate. I'll likely do review it again tomorrow when I can focus again 😅

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma, good luck with your Window(...s 10 or 11 ? That is the most common window which is breaking down 🤣)

@MartinThoma MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Apr 23, 2022
@MartinThoma
Copy link
Copy Markdown
Member

Good morning!

There seems to be something off with test_read_unknown_zero_pages. You've commented, that this is necessary due to the double-percentage in the beginning (%%PDF-1.7). However, this is a part of the percentage formatting. The actual formatted byte-string looks like this:

%PDF-1.7
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0] /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R /Resources << /Font << >> >> /Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 0 0 R /Type /Catalog >> endobj
xref 1 5
0000000010 00000 n
0000000067 00000 n
0000000088 00000 n
0000000109 00000 n
0000000278 00000 n
trailer << /Root 5 1 R /Size 6 >>
startxref 327
%%EOF'

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

@MartinThoma
Copy link
Copy Markdown
Member

(minor comment, feel free to ignore: Running black over just the test is fine: black Tests/test_reader.py. Please don't run it over the complete codebase as this would add to many changes, but I already had it running over the Tests. For this reason it will just beautify your additions)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

due to the formatting, the original formatting string includes a double '%%' that is converted to a single during formating. the point is that the .find(b"xref") applies the search on the original function. therefore the 327 points to the r and not the x of the xref.
the original code was not checking for the xref not reporting any error being tolerant to 'invalid' pdf data.
with the new code, strict=true will reject the file - not compliant with the spec - but with string=false, the xref table will be rebuilt, being able to accept the file
@MartinThoma , Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

applied only on test_reader as recommanded
@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Apr 24, 2022

the .find(b"xref") applies the search on the original

Thank you - that was what I missed 🙏 👍 Thank you for fixing the tests!

Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

Test completed to cover old cases
loop detected and fix if rebuilding xref table
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

So did I. I've extended to test to cover both nominal case and incorrect pointer. A loop has been detected and fixed in such cases.

@MartinThoma
Copy link
Copy Markdown
Member

Thank you very much for all the time and effort you've put into this! I appreciate it!

MartinThoma added a commit that referenced this pull request Apr 27, 2022
MartinThoma added a commit that referenced this pull request Apr 27, 2022
@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz I've created a follow-up PR that improves this IMHO a little bit. Nothing substantial, but from a maintainers perspective important. I'll add the reasoning in a second: #830

MartinThoma added a commit that referenced this pull request Apr 27, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
MartinThoma pushed a commit that referenced this pull request Apr 28, 2022
Use PdfReadWarning instead of UserWarning to be consistent

Closes #297
MartinThoma added a commit that referenced this pull request Apr 28, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
Use PdfReadWarning instead of UserWarning to be consistent

Closes py-pdf#297
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
MartinThoma added a commit that referenced this pull request May 1, 2022
Robustness (ROB):
-  Handle missing destinations in reader (#840)
-  warn-only in readStringFromStream (#837)
-  Fix corruption in startxref or xref table (#788 and #830)

Documentation (DOC):
-  Project Governance (#799)
-  History of PyPDF2
-  PDF feature/version support (#816)
-  More details on text parsing issues (#815)

Developer Experience (DEV):
-  Add benchmark command to Makefile
-  Ignore IronPython parts for code coverage (#826)

Maintenance (MAINT):
-  Split pdf module (#836)
-  Separated CCITTFax param parsing/decoding (#841)
-  Update requirements files

Testing (TST):
-  Use external repository for larger/more PDFs for testing (#820)
-  Swap incorrect test names (#838)
-  Add test for PdfFileReader and page properties (#835)
-  Add tests for PyPDF2.generic (#831)
-  Add tests for utils, form fields, PageRange (#827)
-  Add test for ASCII85Decode (#825)
-  Add test for FlateDecode (#823)
-  Add test for filters.ASCIIHexDecode (#822)

Code Style (STY):
-  Apply pre-commit (black, isort) + use snake_case variables (#832)
-  Remove debug code (#828)
-  Documentation, Variable names (#839)

Full Changelog: 1.27.9...1.27.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF soon PRs that are almost ready to be merged, issues that get solved pretty soon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants