Skip to content

ENH : complete attachments functions (get/add)#1611

Merged
MartinThoma merged 17 commits intopy-pdf:mainfrom
pubpub-zz:attachments
Feb 25, 2023
Merged

ENH : complete attachments functions (get/add)#1611
MartinThoma merged 17 commits intopy-pdf:mainfrom
pubpub-zz:attachments

Conversation

@pubpub-zz
Copy link
Collaborator

add_attachments now allows to produce a list of multiple files
get_attachemnts/list_attachements added
fixes #1047 #527 #169

@codecov
Copy link

codecov bot commented Feb 5, 2023

Codecov Report

Base: 92.28% // Head: 92.30% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (8688ba3) compared to base (cc32b59).
Patch coverage: 97.50% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1611      +/-   ##
==========================================
+ Coverage   92.28%   92.30%   +0.02%     
==========================================
  Files          33       33              
  Lines        6376     6410      +34     
  Branches     1271     1280       +9     
==========================================
+ Hits         5884     5917      +33     
  Misses        312      312              
- Partials      180      181       +1     
Impacted Files Coverage Δ
pypdf/_reader.py 90.63% <96.87%> (+0.21%) ⬆️
pypdf/_writer.py 84.61% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pubpub-zz
Copy link
Collaborator Author

ready for 3.5.0 😊

pubpub-zz and others added 6 commits February 6, 2023 18:43
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma MartinThoma added the is-feature A feature request label Feb 6, 2023
@MartinThoma
Copy link
Member

MartinThoma commented Feb 6, 2023

That's a pretty awesome new feature! I'm curious to try this out from a users perspective tomorrow!

Regarding the interface, I want to try a couple of things:

  1. Can we reasonably make use of _utils.File?
  2. Can we make this a property (e.g. simply attachments)?
  3. Instead of (1), maybe something that stores the pointer / metadata, but does not actually read the part into memory (not sure if that would be much more complicated / if there would be value in it)
  4. The name "attachment": How does EmbeddedFiles relate to file annotations? (I want to avoid running into a bookmark/outline-renaming kind of situation)

@pubpub-zz
Copy link
Collaborator Author

That's a pretty awesome new feature! I'm curious to try this out from a users perspective tomorrow!

☺️ this was expected a long time ago based on the issues...

Regarding the interface, I want to try a couple of things:

1. Can we reasonably make use of `_utils.File`?

That's feasable however not sure it will be very useful...

2. Can we make this a property (e.g. simply `attachments`)?

PdfReader will have only getter, and for memory usage, the list function will have to be kept.
for PdfWriter, there was no request was raised about a getter.

3. Instead of (1), maybe something that stores the pointer / metadata, but does not actually read the part into memory (not sure if that would be much more complicated / if there would be value in it).

The PDF will have been loaded in memory any way. but it will still be "compressed". I added an optional parameter to get only the file data you are looking for.

4. The name "attachment": How does `EmbeddedFiles` relate to file annotations? (I want to avoid running into a bookmark/outline-renaming kind of situation)

From my test, they are independant (although Acrobat Reader shows them in the attachment list).I agree embedded_files may be clearer
on the other hand there was add_attachment() so don't know what to think...

@MartinThoma
Copy link
Member

This is the file generated with two attachments:

out-attachment.pdf

The following readers seem not to be able to display this attachment / attachments at all:

  • Google Chrome Version 109.0.5414.119 (Official Build) (64-bit)
  • Okular Version 1.9.3

Atril Document Viewer 1.24.0

image

Document Viewer 3.36.10

image

Firefox

image

pubpub-zz and others added 2 commits February 7, 2023 21:05
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma
Copy link
Member

I'm happy with the implementation, but I'm uncertain about the public interface in the PdfReader.

For the reader, could it be an attribute attachments which maps to something like this:

class _DelayedDict:
    def __init__(self, list_attachments: Callable[[], List[str]], get_attachment: Callable[[str], bytes]]):
        self.list_attachments = list_attachments
        self.get_attachment = get_attachment

    def keys(self) -> List[str]:
        return self.list_attachments()
    
    def __len__(self) -> int:
        return len(self.keys())

    def __getitem__(self, key: str) -> bytes:
        return self.get_attachment(key)

The idea is similar to the _VirtualList we use for the pages attribute. Just giving users something that feels / behaves in a familiar way.

One way to get this merged fast would be to postpone the decision on the public interface by re-naming:

  • get_attachments -> _get_attachments
  • list_attachments -> _list_attachments
    (I'm not saying you should do that ... I'm just uncertain how eager you are to get this into the release this weekend 😄 )

@pubpub-zz
Copy link
Collaborator Author

I've just completed with a commit for a test case I missed: multiple files with the same name (this case is acceptable for Acrobat Reader) : if multiple inputs have the same name, first they will be present many times in the list_attachments. In the dict returned by get_attachements they will be present in an array (in the same order as in the pdf attachment array

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I've tried to refactor with your proposal, however, I've got a problem: get_attachments() return the full list but I can not get the same with attachments : attachments[] is incorrect; if i'm using get(), getitem() will be not be accessible. can we consider attachments[None] as acceptable ?

MartinThoma added a commit that referenced this pull request Feb 26, 2023
Add `PdfReader.attachments -> Mapping[str, List[bytes]] as a public interface.

The heavy-lifting was done by @pubpub-zz in #1611 . This PR only adds the interface for the exiting functions.
MartinThoma added a commit that referenced this pull request Feb 26, 2023
New Features (ENH)
-  Add reader.attachments public interface (#1611, #1661)
-  Add PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) (#1648)
-  Allow free-text annotation to have transparent border/background (#1664)

Bug Fixes (BUG)
-  Allow decryption with empty password for AlgV5 (#1663)
-  Let PdfWriter.pages return PageObject after calling `clone_document_from_reader()` (#1613)
-  Invalid font pointed during merge_resources (#1641)

Robustness (ROB)
-  Cope with invalid objects in IndirectObject.clone (#1637)
-  Improve tolerance to invalid Names/Dests (#1658)
-  Decode encoded values in get_fields (#1636)
-  Let PdfWriter.merge cope with missing "/Fields" (#1628)

[Full Changelog](3.4.1...3.5.0)
@pubpub-zz pubpub-zz deleted the attachments branch June 24, 2023 08:40
d-fence pushed a commit to odoo-dev/odoo that referenced this pull request Oct 22, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)
robodoo pushed a commit to odoo/odoo that referenced this pull request Oct 22, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

Part-of: #183165
Related: odoo/enterprise#71676
Signed-off-by: Christophe Monniez (moc) <moc@odoo.com>
fw-bot pushed a commit to odoo-dev/odoo that referenced this pull request Oct 22, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
robodoo pushed a commit to odoo/odoo that referenced this pull request Oct 25, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
Part-of: #184787
Related: odoo/enterprise#72538
Signed-off-by: Christophe Monniez (moc) <moc@odoo.com>
d-fence pushed a commit to odoo-dev/odoo that referenced this pull request Oct 29, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
robodoo pushed a commit to odoo/odoo that referenced this pull request Oct 29, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
Part-of: #185282
Related: odoo/enterprise#72781
Signed-off-by: Christophe Monniez (moc) <moc@odoo.com>
d-fence pushed a commit to odoo-dev/odoo that referenced this pull request Nov 5, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
robodoo pushed a commit to odoo/odoo that referenced this pull request Nov 5, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
Part-of: #185736
Related: odoo/enterprise#73014
Signed-off-by: Christophe Monniez (moc) <moc@odoo.com>
d-fence pushed a commit to odoo-dev/odoo that referenced this pull request Nov 6, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
robodoo pushed a commit to odoo/odoo that referenced this pull request Nov 6, 2024
Try to override both old and new API, although for the new version it
might actually be useless as pypdf seems to have added multiple
attachments support circa 3.5? (py-pdf/pypdf#1611)

X-original-commit: c78870c
Part-of: #186301
Related: odoo/enterprise#73330
Signed-off-by: Christophe Monniez (moc) <moc@odoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

is-feature A feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improvements to attachment functions

2 participants