BUG: layout mode text extraction ZeroDivisionError#2417
BUG: layout mode text extraction ZeroDivisionError#2417MartinThoma merged 2 commits intopy-pdf:mainfrom
Conversation
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`.
Remove duplicate docstring for layout_mode_strip_rotated
|
Sorry for the quick patch, @MartinThoma, but we picked up a new client with "pre-OCR'd" image PDFs that contained a lot of handwritten text and this error popped up. Nothing urgent so feel free to sit on it for a bit. Just wanted to get it out there while it was top of mind. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2417 +/- ##
=======================================
Coverage 94.42% 94.43%
=======================================
Files 49 49
Lines 8007 8008 +1
Branches 1616 1616
=======================================
+ Hits 7561 7562 +1
Misses 276 276
Partials 170 170 ☔ View full report in Codecov by Sentry. |
float 0.0 is already `falsy` and only a "true zero" float results in the ZeroDivisionError. I.e. the int() conversion isn't needed and will likely cause more harm than good.
|
I guess you tested the change with a private PDF that has this property?
No worries, I will never complain about any contribution that improves pypdf 😄 |
Yes, sorry. The offenders currently at my disposal all contain protected health information. I'll see if I can get our client to scan something over that doesn't. If so, I'll add a test case, but I'd put the odds of them getting back to me on that at ~50/50. |
|
Thank you! I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday. Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue. |
Thanks! Sounds good. |
## What's new ### Bug Fixes (BUG) - layout mode text extraction ZeroDivisionError (#2417) by @shartzog ### Testing (TST) - Skip tests using fpdf2 if it\'s not installed (#2419) by @MartinThoma [Full Changelog](4.0.0...4.0.1)
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0.
Discovered during processing of a "pre-OCR'd" image PDF having
{"/BaseFont": "/GlyphLessFont"}.Remove duplicate docstring for layout_mode_strip_rotated