BUG: Fix incorrect tm_matrix in call to visitor_text#2060
BUG: Fix incorrect tm_matrix in call to visitor_text#2060MartinThoma merged 4 commits intopy-pdf:mainfrom
Conversation
Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2060 +/- ##
==========================================
+ Coverage 94.17% 94.28% +0.10%
==========================================
Files 41 41
Lines 7333 7365 +32
Branches 1442 1451 +9
==========================================
+ Hits 6906 6944 +38
+ Misses 266 262 -4
+ Partials 161 159 -2
☔ View full report in Codecov by Sentry. |
|
@troethe |
Check the coordinates computed from the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices given to the `visitor_text` function and checks, if they are roughly where they should be.
|
@pubpub-zz : All right, I added a new test on a simplified test document where the correct positioning is a bit more apparent. Let me know if you have any feedback. Edit: Also note that my original PR was wrong, because I passed the |
|
@troethe From my side we are good to go :-) Please adjust the top comment of the PR in such a way that it can be the commit message of the squash-merge. For example: One last question:
What do you mean by "preliminary"? Doesn't this fix the mentioned issue? |
|
@pubpub-zz I guess this PR is fine from your perspective? |
Yes, it does, but when I originally submitted this, I didn't quite understand all of the code yet and I had not thoroughly tested it as shown by the fact that it ended up not being quite right. I guess I should have called it a WIP/draft instead. |
|
Ok, then please let me know when you think I can merge :-) |
Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`. The new one might already be changed and unrelated to the current text. Also add a test for the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices and checks, if they are roughly where they should be. Fixes py-pdf#2059
7c4c1c4 to
2e42262
Compare
|
@MartinThoma Ok, I think it's good now! |
|
No problem :-) Thank you for the contribution 🙏 I will make a release that contains it today |
|
If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) |
## What's new ### Performance Improvements (PI) - optimize _decode_png_prediction (#2068) ### Bug Fixes (BUG) - Fix incorrect tm_matrix in call to visitor_text (#2060) - Writing German characters into form fields (#2047) - Prevent stall when accessing image in corrupted pdf (#2081) - append() fails when articles do not have /T (#2080) ### Robustness (ROB) - Cope with xref not followed by separator (#2083) [Full Changelog](3.15.0...3.15.1)
Sure, why not. :) Thank you! |
Supply the old tm_matrix when flushing out
textto thevisitor_textin
crlf_space_check. The new one might already be changed andunrelated to the current text.
Also add a test for the tm_matrix and cm_matrix that are given to
visitor_textwhen extracting text.The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.
Fixes #2059