ENH: Support R6 decrypting#1015
ENH: Support R6 decrypting#1015MartinThoma merged 7 commits intopy-pdf:mainfrom exiledkingcc:encryption
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1015 +/- ##
==========================================
+ Coverage 89.50% 89.60% +0.09%
==========================================
Files 24 24
Lines 4432 4425 -7
Branches 919 914 -5
==========================================
- Hits 3967 3965 -2
+ Misses 318 314 -4
+ Partials 147 146 -1
Continue to review full report at Codecov.
|
|
You're on fire, @exiledkingcc 👍 🚀 The tests pass - is this ready from your perspective, @exiledkingcc ? |
|
thanks to qpdf, althoght it's comment shows the algorithm is NOT EXACTLY described in specifiction.
anyway i think you can merge this, it will not make things worse, at least, we can decrypt PDFs encypted by qpdf. |
|
I was just checking the PDF files. While the r2 / r3 / r4 files ask for passwords, the r5 and r6 files don't. I can directly see the content. Are you sure that they are encrypted? |
|
yes, they are encrypted, but not password-protected. |
|
this is how qpdf do about passwords, maybe we should keep the same way: |
|
@exiledkingcc Do you think this is ready to be merged? If yes, I would review it again this evening :-) I'm excited about it :-) |
|
i think yes. 😊 |
PyPDF2/_merger.py
Outdated
| self.inputs.append((stream, reader, my_file)) | ||
| if encryption_obj is not None: | ||
| reader._encryption = encryption_obj | ||
| reader.encryption = encryption_obj |
There was a problem hiding this comment.
Every attribute that does not start with an underscore is part of the public interface. That means we cannot change the behavior / name without deprecation warnings.
Is there a reason why a user would want to access the encryption attribute directly?
PyPDF2/_reader.py
Outdated
| self._override_encryption = False | ||
| if password is not None and self.decrypt(password) == 0: | ||
| raise PdfReadError("Wrong password") | ||
| self.encryption: Optional[Encryption] = None |
There was a problem hiding this comment.
Consistently with my question in _merger, I would prefer if this was the private attribute _encryption instead of the public attribute `encryption. Except, of course, if there is a reason why users would want to access it.
PyPDF2/_reader.py
Outdated
| encryptEntry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object()) | ||
| self.encryption = Encryption.read(encryptEntry, id1_entry) |
There was a problem hiding this comment.
| encryptEntry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object()) | |
| self.encryption = Encryption.read(encryptEntry, id1_entry) | |
| encrypt_entry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object()) | |
| self.encryption = Encryption.read(encrypt_entry, id1_entry) |
PyPDF2/_encryption.py
Outdated
| self._user_keys: Dict = {} | ||
| self._owner_keys: Dict = {} | ||
|
|
||
| def verified(self) -> bool: |
There was a problem hiding this comment.
I guess verified means that the file was decrypted? Would is_decrypted be better?
There was a problem hiding this comment.
Also, I think this should rather be _verified if the user does not need it.
|
I've tried to cross-check the decryption, but...
I guess we are head of our time 😄 |
|
@exiledkingcc Overall, it looks good to me. Good work 👏 👍 🎉 There are a couple of things I want to be changed before I release it. I could do them myself after merging this PR or you could do them now - whatever you prefer. Just let me know :-) I've also noticed that if either the user password or the owner password is set to the empty password, typical viewers will directly show the contents. For this reason I would add the following test PDF: |
|
@MartinThoma all your review comments are great, i will do the update tonight. |
|
I just saw https://github.com/pdfminer/pdfminer.six/tree/master/samples and especially: I think I'll run PyPDF2 over those examples some time today :-) |
|
updated. |
| try: | ||
| pwd = password.encode("latin-1") | ||
| except Exception: # noqa | ||
| pwd = password.encode("utf-8") |
There was a problem hiding this comment.
Why did you choose latin-1 as the default?
There was a problem hiding this comment.
to be honest, i don't think too much about it, just copy it from previous code. 😀
There was a problem hiding this comment.
Hehe, ok. Thanks for the honesty ❤️
|
@exiledkingcc Very nice work! Thank you 🤗 I'll release it today (just have to eat something now 😄 ) |
New Features (ENH): - Support R6 decrypting (#1015) - Add PdfReader.pdf_header (#1013) Performance Improvements (PI): - Remove ord_ calls (#1014) Bug Fixes (BUG): - Fix missing page for bookmark (#1016) Robustness (ROB): - Deal with invalid Destinations (#1028) Documentation (DOC): - get_form_text_fields does not extract dropdown data (#1029) - Adjust PdfWriter.add_uri docstring - Mention crypto extra_requires for installation (#1017) Developer Experience (DEV): - Use /n line endings everywhere (#1027) - Adjust string formatting to be able to use mutmut (#1020) - Update Bug report template Full Changelog: 2.3.1...2.4.0
New Features (ENH): - Add PageObject._get_fonts (#1083) - Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067) Performance Improvements (PI): - Use iterative DFS in PdfWriter._sweep_indirect_references (#1072) Bug Fixes (BUG): - Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066) - Column default for CCITTFaxDecode (#1079) Robustness (ROB): - Guard against None-value in _get_outlines (#1060) Documentation (DOC): - Stamps and watermarks (#1082) - OCR vs PDF text extraction (#1081) - Python Version support - Formatting of CHANGELOG Developer Experience (DEV): - Cache downloaded files (#1070) - Speed-up for CI (#1069) Maintenance (MAINT): - Set page.rotate(angle: int) (#1092) - Issue #416 was fixed by #1015 (#1078) Testing (TST): - Image extraction (#1080) - Image extraction (#1077) Code Style (STY): - Apply black - Typo in Changelog Full Changelog: 2.4.2...2.4.3
Fixes #416