ENH: Add PageObject.images attribute#1330
Conversation
Codecov ReportBase: 94.71% // Head: 94.55% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1330 +/- ##
==========================================
- Coverage 94.71% 94.55% -0.16%
==========================================
Files 30 28 -2
Lines 5181 5016 -165
Branches 1060 1033 -27
==========================================
- Hits 4907 4743 -164
Misses 164 164
+ Partials 110 109 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
@pubpub-zz @MasterOdin What do you think about this PR? While I wrote it, I realized that PyPDF2 does something wrong with image extraction in some cases. I marked those tests with xfail. The point of this PR is not to fix those issues, but to provide a convenient interface for getting images from PDF pages. That means:
@pubpub-zz You mentioned that this method might not get all images of a page. For this PR, this would be acceptable to me. We can fix that later. As a follow-up step we might use the I'm uncertain about the The reason why I chose mime-type were spelling inconsistencies like this:
Additionally, I'm uncertain if using extension vs mime_type makes a difference if we use the |
PyPDF2/_page.py
Outdated
| filename = f"{obj[1:]}.{File._mime2extension(mime_type)}" | ||
| images_extracted.append( | ||
| File(name=filename, data=byte_stream, mime_type=mime_type) | ||
| ) |
There was a problem hiding this comment.
This strikes me as an odd abstraction, where we are passing in the mime_type as part of the File constructor, but we also need to construct the full filename, using a private static function to boot, but also that the file_extension method doesn't correspond to the extension of the passed in name, but rather mime_type.
If we go the route of passing in the mime_type for the File, I'd advocate for just passing in name sans extension altogether and we can have a special property function that does the concatenation of name + extension to give a "filename" on demand as needed by users.
The only caveat would be for attachments, it may make sense to pass in the full filename, but I'm not well versed on that part of the spec to even know how that API might look.
There was a problem hiding this comment.
I think I'll go with 'File only has name + data (no mime_type)' for the moment, because it seems to have only advantages:
- Less clutter / less code to maintain
- No potential to discover the wrong mime type
- We could make
_xobj_to_imagejust pass the file extension as before - As
_xobj_to_imageis a private function, we can easily change the behavior if we see a clear advantage
This is consistent with Pillow: https://pillow.readthedocs.io/en/latest/reference/Image.html#PIL.Image.Image.format Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>
21b54ae to
c491881
Compare
New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0
No description provided.