BUG: Add font stack to q/Q operations in layout mode by hackowitz-af · Pull Request #3225 · py-pdf/pypdf

hackowitz-af · 2025-03-28T17:32:49Z

This is my first contribution to open source - please give feedback!

This addresses issue #3212, and the tested pdf is the upload from that issue. This fixes 100s of pages of my team's documents, in addition to the test file, and is a much-needed fix.

I'm sure there is a better way to handle the stack of text state(s), but 3 lines added is not worth conflating, in my opinion.

Add failing tests Closes py-pdf#3212

Closes py-pdf#3212 Enters/exits font stack in q/Q operations.

Pass ruff linting, add myself as a contributor Closes py-pdf#3212

codecov · 2025-03-28T17:53:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.59%. Comparing base (7f7fd95) to head (746656a).
Report is 66 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3225   +/-   ##
=======================================
  Coverage   96.59%   96.59%           
=======================================
  Files          53       53           
  Lines        8950     8953    +3     
  Branches     1648     1648           
=======================================
+ Hits         8645     8648    +3     
  Misses        183      183           
  Partials      122      122

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846

Thanks for the PR. I have added some small remarks about the tests.

For the actual implementation, I see no obvious issue as all tests seems to still pass, although @shartzog is of course invited to have a look at it before merging.

resources/garbled-font.layout.txt

tests/test_text_extraction.py

pypdf/_text_extraction/_layout_mode/_text_state_manager.py

shartzog

Great catch, @hackowitz-af. I think you're on the right track. Can you test w/ these updates in place?

EDIT: Upon further review, the font state is NOT preserved across graphic state saves/restores, so your original logic should work as submitted. I've already submitted an approval. Please disregard this review.

pypdf/_text_extraction/_layout_mode/_text_state_manager.py

shartzog

After some further reading the PDF standard, I think this change is OK as is. I originally thought font state was preserved across graphic state saves and restores, but it turns out that's not the case.

hackowitz-af · 2025-04-01T16:07:13Z

@stefan6419846 I addressed your requested changes and resolved the threads - but the PR is still blocked because of "requested changes". Can you please point me to any outstanding requests and make sure the rest are closed? Thanks!

stefan6419846

Thanks for the PR and your patience. Going to merge this now.

stefan6419846 · 2025-04-02T07:40:19Z

The outstanding request has been from my initial review which I just did not update while there still has been a second review in progress, as approving the PR myself usually is the last step which leads to a merge.

hackowitz-af · 2025-04-02T16:34:12Z

Understood, thank you!

@bryan-brancotte

## What's new ### New Features (ENH) - Add support for IndirectObject.__iter__ (#3228) by @bryan-brancotte - Allow filtering by font when removing text (#3216) by @samuelbradshaw ### Bug Fixes (BUG) - Add missing named destinations being ByteStringObjects (#3282) by @stefan6419846 - Get font information more reliably when removing text (#3252) by @samuelbradshaw - T* 2D Translation consistent with PDF 1.7 Spec (#3250) by @hackowitz-af - Add font stack to q/Q operations in layout mode (#3225) by @hackowitz-af - Avoid completely hiding image loading issues like exceeding image size limits (#3221) by @stefan6419846 - Using compress_identical_objects on transformed content duplicates differing content (#3197) by @danio - Consider BlackIs1 parameter for CCITTFaxDecode filter (#3196) by @stefan6419846 ### Robustness (ROB) - Deal with insufficient cm matrix during text extraction (#3283) by @stefan6419846 - Allow merging when annotations miss D entry (#3281) by @stefan6419846 - Fix merging documents if there are no Dests (#3280) by @stefan6419846 - Fix crash on malformed action in outline (#3278) by @larsga - Fix compression issues for removed images which might be None (#3246) by @stefan6419846 - Attempt to deal with non-rectangular FlateDecode streams (#3245) by @stefan6419846 - Handle some None values for broken PDF files (#3230) by @stefan6419846 ### Developer Experience (DEV) - Multiple style improvements by @j-t-1 - Update ruff to 0.11.0 by @stefan6419846 ### Maintenance (MAINT) - Conform ASCIIHexDecode implementation to specification (#3274) by @j-t-1 - Modify comments of filters that do not use decode_parms (#3260) by @j-t-1 ### Code Style (STY) - Simplify warnings & debugging in layout mode text extraction (#3271) by @hackowitz-af - Standardize mypy assert statements (#3276) by @j-t-1 [Full Changelog](5.4.0...5.5.0)

hackowitz-af added 3 commits March 28, 2025 10:59

BUG: Garbled Text in Layout Extraction

0ce9e29

Add failing tests Closes py-pdf#3212

BUG: Garbled Text in Layout Extraction

d26295f

Closes py-pdf#3212 Enters/exits font stack in q/Q operations.

BUG: Garbled Text in Layout Extraction

21dab5c

Pass ruff linting, add myself as a contributor Closes py-pdf#3212

hackowitz-af changed the title ~~Add font stack to q/Q operations in layout mode~~ BUG: Add font stack to q/Q operations in layout mode Mar 28, 2025

Mypy is cool! And I did not know about it...

33b19f6

stefan6419846 requested changes Mar 28, 2025

View reviewed changes

resources/garbled-font.layout.txt Outdated Show resolved Hide resolved

tests/test_text_extraction.py Outdated Show resolved Hide resolved

tests/test_text_extraction.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Mar 28, 2025

View reviewed changes

pypdf/_text_extraction/_layout_mode/_text_state_manager.py Outdated Show resolved Hide resolved

hackowitz-af added 2 commits March 28, 2025 13:10

address review by @stefan6419846

5990c8d

ruffify the tests too

fbc4372

hackowitz-af requested a review from stefan6419846 March 28, 2025 19:58

Merge branch 'main' into main

746656a

shartzog suggested changes Apr 1, 2025

View reviewed changes

pypdf/_text_extraction/_layout_mode/_text_state_manager.py Show resolved Hide resolved

pypdf/_text_extraction/_layout_mode/_text_state_manager.py Show resolved Hide resolved

shartzog approved these changes Apr 1, 2025

View reviewed changes

stefan6419846 approved these changes Apr 2, 2025

View reviewed changes

stefan6419846 merged commit 499cd9d into py-pdf:main Apr 2, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Add font stack to q/Q operations in layout mode#3225

BUG: Add font stack to q/Q operations in layout mode#3225
stefan6419846 merged 7 commits intopy-pdf:mainfrom
hackowitz-af:main

hackowitz-af commented Mar 28, 2025

Uh oh!

codecov bot commented Mar 28, 2025 •

edited

Loading

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shartzog left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shartzog left a comment •

edited

Loading

Uh oh!

hackowitz-af commented Apr 1, 2025

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

stefan6419846 commented Apr 2, 2025

Uh oh!

hackowitz-af commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hackowitz-af commented Mar 28, 2025

Uh oh!

codecov bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shartzog left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shartzog left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackowitz-af commented Apr 1, 2025

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stefan6419846 commented Apr 2, 2025

Uh oh!

hackowitz-af commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 28, 2025 •

edited

Loading

shartzog left a comment •

edited

Loading

shartzog left a comment •

edited

Loading