Skip to content

BUG: invalid cm/tm in visitor functions#2206

Merged
MartinThoma merged 15 commits intopy-pdf:mainfrom
pubpub-zz:iss2200
Oct 8, 2023
Merged

BUG: invalid cm/tm in visitor functions#2206
MartinThoma merged 15 commits intopy-pdf:mainfrom
pubpub-zz:iss2200

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

@pubpub-zz pubpub-zz commented Sep 19, 2023

closes #2200
closes #2075
reworks and is still valid to close #2059

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 19, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (4b090ba) 94.40% compared to head (8a6a0e5) 94.44%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2206      +/-   ##
==========================================
+ Coverage   94.40%   94.44%   +0.04%     
==========================================
  Files          43       43              
  Lines        7599     7621      +22     
  Branches     1501     1502       +1     
==========================================
+ Hits         7174     7198      +24     
+ Misses        262      261       -1     
+ Partials      163      162       -1     
Files Coverage Δ
pypdf/__init__.py 87.50% <100.00%> (ø)
pypdf/_text_extraction/__init__.py 93.70% <100.00%> (+0.36%) ⬆️
pypdf/_page.py 94.67% <88.00%> (+0.31%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@troethe,
From my analysis, the change you've proposed was not suffcient as the prev position may be overriden during the process.
I've created a PR where I've storing the cm/tm matrices when resetting the text variable (this normally occurs when meeting the Td which will occurs/set the position before restarting a new line
Can you have a review of my change?

@pubpub-zz pubpub-zz marked this pull request as ready for review September 20, 2023 19:10
Copy link
Copy Markdown
Contributor

@troethe troethe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think this approach is currently the best way to handle #2075. The two comments I added are minutiae and not necessarily important.

delta_y = m[5] - m_prev[5]
k = math.sqrt(abs(m[0] * m[3]) + abs(m[1] * m[2]))
f = font_size * k
cm_prev = m
Copy link
Copy Markdown
Contributor

@troethe troethe Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right. I think this line can however be safely removed, because cm_prev doesn't get accessed until it is being assigned to again at the end of this function (i.e. this is currently a noop).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.
Good point to have noticed it . Actually I've forgot to pass the multiplied cm_matrix to the visitor

pypdf/_page.py Outdated
0.0,
0.0,
] # will store previous tm_matrix
] # will store cm_matrix * tm_matrix
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old comment was more correct, right? This change was probably a side effect of the revert.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct we store inhere only tm_matrix

@pubpub-zz pubpub-zz changed the title Terminal BUG: invalid cm/tm in visitor functions BUG: invalid cm/tm in visitor functions Sep 21, 2023
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@troethe
I've taken into account your comments

@srogmann
I would like to have your opinion too:
I've also completed the PR passing the product of cm_matrix . tm_matrix as the users may not have an easy access to mult() function and because cm_matrix alone has no mean (for me)
also for me it is better to recommand to work on the cm_matrix as it will take into account all transformations()
finally looking at the code I currently see no difference between using visitor_operand_before() and visitor_operand_after().
Shouldn't we await from visitor_operand_before() to return the tuple operator,operands to allow the user to modify/cancel the processing of some operations ?

@srogmann
Copy link
Copy Markdown
Contributor

@pubpub-zz I also have no example of cm_matrix without tm_matrix at hand. PDF-reference, "5.3.3 Text Space Details", uses the product of cm_matrix and tm_matrix you mentioned.

An argument to stay at cm_matrix instead of the product might be the performance if the product is computed but not used. In that case the publish of a mult()-function might be useful.

In text-extraction-test 3 of tests/test_page.py I used visitor_operand_after to verify the Td-operation (see visitor_td).

The return of the operator by visit_operand_before() can support patches or studies like the one in #1389 (comment). Can or should the operand-list be modified inside visit_operand_before() by using append(...), clear(...), ...? Otherwise it would be better to return the tuple (operator, operands).

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@pubpub-zz I also have no example of cm_matrix without tm_matrix at hand. PDF-reference, "5.3.3 Text Space Details", uses the product of cm_matrix and tm_matrix you mentioned.

An argument to stay at cm_matrix instead of the product might be the performance if the product is computed but not used. In that case the publish of a mult()-function might be useful.

In text-extraction-test 3 of tests/test_page.py I used visitor_operand_after to verify the Td-operation (see visitor_td).

The return of the operator by visit_operand_before() can support patches or studies like the one in #1389 (comment). Can or should the operand-list be modified inside visit_operand_before() by using append(...), clear(...), ...? Otherwise it would be better to return the tuple (operator, operands).

@MartinThoma
your opinion?

@MartinThoma
Copy link
Copy Markdown
Member

I don't have an opinion on this topic. To be honest, I haven't read the specs so I need to trust you on those changes.

If two of you agree that this PR is good (and CI is green), then I would merge :-)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma / @srogmann
the PR for me is mature.
I left the change of cm_matrix which now reflects the position/orientation/scale of the text in the page user space :
this may be considered as a change in the interface but I don't think many people are using the existing matrix and the change makes things easier to handle (the matrix description was very obscure)
I did not implement the new feature to use visit_operand_before to modify the operations. I will do this in an independant PR for traceability.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Sep 26, 2023
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

Plus stdby

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Oct 1, 2023
@MartinThoma
Copy link
Copy Markdown
Member

Plus stdby

I take this as "please standby", so I should not merge?

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented Oct 1, 2023

Plus stdby

I take this as "please standby", so I should not merge?

Correct interpretation
@jianfan2012 used the visitor in an interesting (cf #2163 (reply in thread)) way where the multiplication of cm.tm is preventing the use
we need to keep then separate. I'm reverting my change update the documentation

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
All should be good now.

Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma MartinThoma merged commit bcd85c4 into py-pdf:main Oct 8, 2023
@MartinThoma
Copy link
Copy Markdown
Member

The cm/tm parts of the visitor functions are really hard to get right. Thank you for taking care of it @pubpub-zz 🙏

MartinThoma added a commit that referenced this pull request Oct 8, 2023
## What's new

### Bug Fixes (BUG)
-  invalid cm/tm in visitor functions (#2206) by @pubpub-zz
-  Encrypt / decrypt Stream object dictionaries (#2228) by @pilotandy
-  Support nested color spaces for the /DeviceN color space (#2241) by @Stefan
-  images property fails if NullObject in list (#2215) by @pubpub-zz

### Robustness (ROB)
-  XYZ destination to cope with missing left and top param (#2237) by @pubpub-zz

### Documentation (DOC)
-  Add pilotandy for #2228 as a contributor by @MartinThoma
-  Link to CONTRIBUTING.md in README.md (#2232) by @markpfeifle
-  Changelog by @MartinThoma

### Developer Experience (DEV)
-  Exclude tests from mypy checks by @MartinThoma
-  Unify mypy options and warn redundant workarounds (#2223) by @exiledkingcc
-  Stabilize Pillow test with Pillow missing (#2229) by @Stefan

### Maintenance (MAINT)
-  Update pinned packages (#2243) by @MartinThoma

### Testing (TST)
-  Regression test against non-deterministic accidental object reuse (#2244) by @MartinThoma

[Full Changelog](3.16.2...3.16.3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants