PI: Use iterative DFS in PdfWriter._sweep_indirect_references#1072
PI: Use iterative DFS in PdfWriter._sweep_indirect_references#1072MartinThoma merged 20 commits intopy-pdf:mainfrom
Conversation
xref indexes has updated.
|
Hm, some unknown reason py37 and py38 fails but py39 and py310 was ok. |
|
There is two tests with issues:
I did research them and those tests "succeeded" because this _sweep_indirect_references hit recursionlimit. And it happens because PDF has a linked list over 1000 items. |
|
Maybe we should work on getting #351 ready first so that we don't hit the recursion limit anymore? |
|
This can be done in iterative algorithm like that. |
|
Now this is transformed to iterative version. Some tests needed update because warnings was not raised any more. |
Codecov Report
@@ Coverage Diff @@
## main #1072 +/- ##
==========================================
+ Coverage 91.50% 91.57% +0.07%
==========================================
Files 24 24
Lines 4530 4524 -6
Branches 927 926 -1
==========================================
- Hits 4145 4143 -2
+ Misses 245 241 -4
Partials 140 140
Continue to review full report at Codecov.
|
|
Wow, this is amazing @Hatell ! Thank you 🙏 🤗 I will review it today, but it might take to the evening :-) |
|
This is now ready for testing. Main changes is:
One fix need to be done to recalculate all parents hash if dictionary or array object value changes. |
If data is changed then update of keys is done all parents. Added checks to tests to verify that all keys in _idnum_hash is valid.
|
I think I solved this issue for recalculating hashes when updating a dictionary or array object. |
|
Thank you so much for all the effort @Hatell ! I've adjusted the title of the PR and the first message of it. I will use them for the squash commit to represent all of the changes done here. Feel free to adjust if you think there should be something added / adjusted. |
|
If you want, you can also remove the
in |
|
I'm currently letting a bigger text run through. So far, it looks good. I'm still a tiny bit worried as this is such a core part of PyPDF2 😅 |
|
Great and thanks for help. |
|
Thank you for your contribution ❤️ I'll make a release in a couple of hours |
New Features (ENH): - Add PageObject._get_fonts (#1083) - Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067) Performance Improvements (PI): - Use iterative DFS in PdfWriter._sweep_indirect_references (#1072) Bug Fixes (BUG): - Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066) - Column default for CCITTFaxDecode (#1079) Robustness (ROB): - Guard against None-value in _get_outlines (#1060) Documentation (DOC): - Stamps and watermarks (#1082) - OCR vs PDF text extraction (#1081) - Python Version support - Formatting of CHANGELOG Developer Experience (DEV): - Cache downloaded files (#1070) - Speed-up for CI (#1069) Maintenance (MAINT): - Set page.rotate(angle: int) (#1092) - Issue #416 was fixed by #1015 (#1078) Testing (TST): - Image extraction (#1080) - Image extraction (#1077) Code Style (STY): - Apply black - Typo in Changelog Full Changelog: 2.4.2...2.4.3
PdfWriter.external_reference_mapand calculate hash from every referred object and use that to detect duplicate objects.Closes #351
Closes #1036