BUG: Using compress_identical_objects on transformed content duplicates differing content#3197
BUG: Using compress_identical_objects on transformed content duplicates differing content#3197stefan6419846 merged 4 commits intopy-pdf:mainfrom
Conversation
…es differing content
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3197 +/- ##
=======================================
Coverage 96.54% 96.54%
=======================================
Files 53 53
Lines 8935 8935
Branches 1642 1642
=======================================
Hits 8626 8626
Misses 186 186
Partials 123 123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for the report and PR. Could you please add a corresponding test as well? The function provided by you should be rather easy to transform into an integration test if you run As for why Lines 1003 to 1018 in a548ca1 _data is not set unless you call get_data() explicitly. (contents is page.get_contents() in this case.)
|
|
@stefan6419846 That test would be nice, but I can't see how I can extract the text from the PdfWriter. |
|
@stefan6419846 I've added a test that uses |
|
The current approach looks fine for me. I have added a comment about a small code style change we should add; afterwards, I see no problem with getting this merged. |
## What's new ### New Features (ENH) - Add support for IndirectObject.__iter__ (#3228) by @bryan-brancotte - Allow filtering by font when removing text (#3216) by @samuelbradshaw ### Bug Fixes (BUG) - Add missing named destinations being ByteStringObjects (#3282) by @stefan6419846 - Get font information more reliably when removing text (#3252) by @samuelbradshaw - T* 2D Translation consistent with PDF 1.7 Spec (#3250) by @hackowitz-af - Add font stack to q/Q operations in layout mode (#3225) by @hackowitz-af - Avoid completely hiding image loading issues like exceeding image size limits (#3221) by @stefan6419846 - Using compress_identical_objects on transformed content duplicates differing content (#3197) by @danio - Consider BlackIs1 parameter for CCITTFaxDecode filter (#3196) by @stefan6419846 ### Robustness (ROB) - Deal with insufficient cm matrix during text extraction (#3283) by @stefan6419846 - Allow merging when annotations miss D entry (#3281) by @stefan6419846 - Fix merging documents if there are no Dests (#3280) by @stefan6419846 - Fix crash on malformed action in outline (#3278) by @larsga - Fix compression issues for removed images which might be None (#3246) by @stefan6419846 - Attempt to deal with non-rectangular FlateDecode streams (#3245) by @stefan6419846 - Handle some None values for broken PDF files (#3230) by @stefan6419846 ### Developer Experience (DEV) - Multiple style improvements by @j-t-1 - Update ruff to 0.11.0 by @stefan6419846 ### Maintenance (MAINT) - Conform ASCIIHexDecode implementation to specification (#3274) by @j-t-1 - Modify comments of filters that do not use decode_parms (#3260) by @j-t-1 ### Code Style (STY) - Simplify warnings & debugging in layout mode text extraction (#3271) by @hackowitz-af - Standardize mypy assert statements (#3276) by @j-t-1 [Full Changelog](5.4.0...5.5.0)
compress_identical_objects()can result in lost content, ifpage.add_transformationis used.Test that creates bad output:
Input file:
two-different-pages.pdf
This contains "1" on page 1 and "2" on page 2
Result before fix:
result-before.pdf
This contains "1" on page 1 and "1" on page 2
Result after fix:
result-withfix.pdf
This contains "1" on page 1 and "2" on page 2
The issue is around the
EncodedStreamObjectwhich gets converted into aContentStreamafter the transformation. BothContentStreamobjects on each page have the sameobj.hash_value(), even though they have the same contents. Printing the_datavalue out for the contentstreams showed it was empty after the transformation.I tried creating a simple test to reproduce just the hashcode calculation for the content streams but couldn't get it to work, I'm not sure quite what is happening inside
page.add_transformationto create the content stream with no data yet calculated. This is what I tried (inspired bytest_contentstream_arrayobject_containing_nullobject):