Skip to content

ROB: Relax flate decoding for too many lookup values#2331

Merged
MartinThoma merged 5 commits intopy-pdf:mainfrom
stefan6419846:relax-flate-lookup
Dec 10, 2023
Merged

ROB: Relax flate decoding for too many lookup values#2331
MartinThoma merged 5 commits intopy-pdf:mainfrom
stefan6419846:relax-flate-lookup

Conversation

@stefan6419846
Copy link
Copy Markdown
Collaborator

When handling flate objects with a lookup table and the image mode 1, we would previously raise a generic AssertionError if the number of lookup values did not match.

This PR proposes to add a more meaningful error message. Additionally, cases where too many values are specified are now considered a warning only as I could not see any real difference.

I might try to find a document version I can use for public test cases later on to at least cover the case where there are too many values.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 7, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (8d3f879) 94.37% compared to head (5d3fc66) 94.32%.
Report is 1 commits behind head on main.

Files Patch % Lines
pypdf/_xobj_image_helpers.py 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2331      +/-   ##
==========================================
- Coverage   94.37%   94.32%   -0.05%     
==========================================
  Files          43       43              
  Lines        7660     7666       +6     
  Branches     1515     1518       +3     
==========================================
+ Hits         7229     7231       +2     
- Misses        267      269       +2     
- Partials      164      166       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Copy Markdown
Collaborator Author

Example: We have the lookup table b'\x00\x00\x00\xff\xff\xff\n', id est with a trailing whitespace character. This is generated by the SkyPDF Pro Driver for example.

@stefan6419846
Copy link
Copy Markdown
Collaborator Author

Example document: out1.pdf

Copy link
Copy Markdown
Collaborator

@pubpub-zz pubpub-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Now we just need to improve test coverage😉

@stefan6419846
Copy link
Copy Markdown
Collaborator Author

I might have a look at how to emulate the two remaining error cases where we do not have actual example files for now in the next days.

@MartinThoma
Copy link
Copy Markdown
Member

@stefan6419846 Please correct me if I'm wrong, but I don't think it's a bugfix-PR.

A bug in pypdf would mean that pypdf is either...

  • ... not acting according to the PDF specification, or
  • ... not acting according to the pypdf documentation (directly or indirectly, if the user could get a reasonable expectation for a specific behavior)

This PR has two components:

  • ROB: Not raising an exception but just throwing a warning in case of too many values.
  • STY: Throwing a more specific error message / different types of error message

As robustness improvements (ROB) are typically more important for pypdf users, I'd start the PR with "ROB".

See https://pypdf.readthedocs.io/en/latest/dev/intro.html (I should probably extend those :-) )

@MartinThoma
Copy link
Copy Markdown
Member

From my side this looks good to be merged 👍 Just let me know if I can change the PR title

@stefan6419846
Copy link
Copy Markdown
Collaborator Author

@MartinThoma No worries, I am completely fine with you adjusting the title if required, referrring to robustness instead. And adjusting the dev docs sounds like a nice idea as well to make it more clear.

Feel free to merge this PR after updating the title. If I find some time during the next week, I might provide some more test data for the two new edge cases to increase coverage in a separate PR - for now, I just did not stumble upon such files as in theory they violate the PDF standard anyway, while the whitespace stuff seems more spec-like and thus appears in the real world as well.

@MartinThoma MartinThoma changed the title BUG: Relax flate decoding for too many lookup values ROB: Relax flate decoding for too many lookup values Dec 10, 2023
@MartinThoma MartinThoma merged commit 6dad92a into py-pdf:main Dec 10, 2023
@stefan6419846 stefan6419846 deleted the relax-flate-lookup branch December 10, 2023 11:28
MartinThoma added a commit that referenced this pull request Dec 10, 2023
## What's new

### Bug Fixes (BUG)
-  Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz
-  Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846
-  check words length in _cmap type1_alternative function (#2310) by @Takher

### Robustness (ROB)
-  Relax flate decoding for too many lookup values (#2331) by @stefan6419846
-  Let _build_destination skip in case of missing /D key (#2018) by @nickryand

### Documentation (DOC)
-  Note in reading form data (#2338) by @MartinThoma
-  Pull Request prefixes and size by @MartinThoma
-  Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma
-  Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846

### Maintenance (MAINT)
-  Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann

### Testing (TST)
-  Centralize file downloads (#2324) by @MartinThoma

### Code Style (STY)
-  Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846
-  Run black by @MartinThoma
-  Make Traceback in bug report template uppercase (#2304) by @stefan6419846

[Full Changelog](3.17.1...3.17.2)
MartinThoma pushed a commit that referenced this pull request Dec 18, 2023
As mentioned in #2331, this will improve the test coverage for the edge cases.

Further refactoring was necessary as iterating over bytes will yield integers instead of single bytes and thus the whitespace check has been broken.

Additionally, the whitespace check has previously always been performed on the shortened bytes data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants