STY: Simplify warnings & debugging in layout mode text extraction#3271
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3271 +/- ##
==========================================
- Coverage 96.62% 96.61% -0.01%
==========================================
Files 53 53
Lines 8966 8963 -3
Branches 1661 1661
==========================================
- Hits 8663 8660 -3
Misses 181 181
Partials 122 122 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
## What's new ### New Features (ENH) - Add support for IndirectObject.__iter__ (#3228) by @bryan-brancotte - Allow filtering by font when removing text (#3216) by @samuelbradshaw ### Bug Fixes (BUG) - Add missing named destinations being ByteStringObjects (#3282) by @stefan6419846 - Get font information more reliably when removing text (#3252) by @samuelbradshaw - T* 2D Translation consistent with PDF 1.7 Spec (#3250) by @hackowitz-af - Add font stack to q/Q operations in layout mode (#3225) by @hackowitz-af - Avoid completely hiding image loading issues like exceeding image size limits (#3221) by @stefan6419846 - Using compress_identical_objects on transformed content duplicates differing content (#3197) by @danio - Consider BlackIs1 parameter for CCITTFaxDecode filter (#3196) by @stefan6419846 ### Robustness (ROB) - Deal with insufficient cm matrix during text extraction (#3283) by @stefan6419846 - Allow merging when annotations miss D entry (#3281) by @stefan6419846 - Fix merging documents if there are no Dests (#3280) by @stefan6419846 - Fix crash on malformed action in outline (#3278) by @larsga - Fix compression issues for removed images which might be None (#3246) by @stefan6419846 - Attempt to deal with non-rectangular FlateDecode streams (#3245) by @stefan6419846 - Handle some None values for broken PDF files (#3230) by @stefan6419846 ### Developer Experience (DEV) - Multiple style improvements by @j-t-1 - Update ruff to 0.11.0 by @stefan6419846 ### Maintenance (MAINT) - Conform ASCIIHexDecode implementation to specification (#3274) by @j-t-1 - Modify comments of filters that do not use decode_parms (#3260) by @j-t-1 ### Code Style (STY) - Simplify warnings & debugging in layout mode text extraction (#3271) by @hackowitz-af - Standardize mypy assert statements (#3276) by @j-t-1 [Full Changelog](5.4.0...5.5.0)
|
Nice work! Much cleaner... My thought process at the time was to save memory in the 'non-debug' use case by letting |
The existing code for warning if there is rotated text or uninterpretable fonts is slightly overcomplicated. It follows two nested loops to decide if there is text to warn, and needs local variables to flag if a warning is already emitted. It also comingles debug-only logic with the operational code by re-checking for each group of operators if they will be debugged later.
This PR clarifies both by waiting until all operators are collected before searching them for warn-able data, or deciding if to debug them. This removes the need for several variables, nested loops, and several conditional expressions within the loop.