Skip to content

BUG: Missing new line in extract_text with cm operations#2142

Merged
MartinThoma merged 8 commits intopy-pdf:mainfrom
pubpub-zz:iss2138
Sep 17, 2023
Merged

BUG: Missing new line in extract_text with cm operations#2142
MartinThoma merged 8 commits intopy-pdf:mainfrom
pubpub-zz:iss2138

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

closes #2138

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
I've met again the random issue with test_watermarking_speed. I've proposed an alternative way which sounds more robust.

thread_time not available in python 3.6
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 3, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b899d3d) 94.36% compared to head (c5cc158) 94.36%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2142   +/-   ##
=======================================
  Coverage   94.36%   94.36%           
=======================================
  Files          43       43           
  Lines        7572     7574    +2     
  Branches     1488     1488           
=======================================
+ Hits         7145     7147    +2     
  Misses        263      263           
  Partials      164      164           
Files Changed Coverage Δ
pypdf/_page.py 94.07% <100.00%> (+<0.01%) ⬆️
pypdf/_text_extraction/__init__.py 93.33% <100.00%> (+0.05%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma all yours. a rerun of the extract_text benchmark could be interesting

@stefan6419846
Copy link
Copy Markdown
Collaborator

Re-running the CI should be as easy as going to the corresponding CI run at https://github.com/py-pdf/pypdf/actions and clicking "Re-run all jobs" in the upper right corner, given the correct repository permissions.

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

This permission seems to be granted only to a limited of people.🫤



@pytest.mark.enable_socket()
@pytest.mark.timeout(4) # this was a lot slower before PR #2086
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that one is the cleaner option. If we change it, it should be a separate PR (but I'm uncertain if we should change it at all).

You mentioned that you had issues with that test. Was it local or in CI? What exactly was the issue / the error message?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not the good place, but wanted to discuss about.

The issue during the checks was just it failed with the timeout. It is not the first time this is why I've proposed this change.
I consider the time is ok but for me it is just because We are measuring too many instructions. That' why I've focused on the specif code of the merge.
If you like it just give me a go and I will generate a dedicated PR dealing with this change only

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
Can you clarify what you expect to merge it ?

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz Sorry for the delay.

The test_writer.py is a change I'm uncertain about in two directions: On the one hand, I'm uncertain if we should do it like this. I have never used thread_time and I'm uncertain about that. On the other hand, we might even want to use it in multiple directions.

Could you please remove the changes from test_writer.py? I would like to have that discussion in #2193 and decide there how we want to do it.

@MartinThoma
Copy link
Copy Markdown
Member

Besides test_writer.py, I'm very happy with the PR :-)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
reverted

@MartinThoma MartinThoma changed the title BUG : Missing new line in extract_text with cm operations BUG: Missing new line in extract_text with cm operations Sep 17, 2023
@MartinThoma MartinThoma merged commit 5b45785 into py-pdf:main Sep 17, 2023
@MartinThoma
Copy link
Copy Markdown
Member

Thank you 🙏

I noticed that when I have a lot to do at work, I have a harder time taking decisions in my private life (including pypdf). Keeping PRs about one issue makes it easier for me 🙏

MartinThoma added a commit that referenced this pull request Sep 17, 2023
## What's new

### Bug Fixes (BUG)
-  Missing new line in extract_text with cm operations (#2142) by @pubpub-zz
-  _get_fonts not processing properly CIDFonts and annotations (#2194) by @pubpub-zz

### Documentation (DOC)
-  Sort list of contributors by @MartinThoma

### Developer Experience (DEV)
-  Give attribution in release notes (#2196) by @MartinThoma

### Maintenance (MAINT)
-  Update packages (#2195) by @MartinThoma
-  Rename PdfWriter.create_viewer_preferences to PdfWriter.create_viewer_preferences (#2190) by @marcstober
-  Mark `cryptography` as default (#2186) by @exiledkingcc

### Testing (TST)
-  Issue with merging pdfkit (#2191) by @MartinThoma

### Code Style (STY)
-   clean-up overriden variable (#2189) by @pubpub-zz

[Full Changelog](3.16.0...3.16.1)
@pubpub-zz pubpub-zz deleted the iss2138 branch August 9, 2024 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Line returns missing in text_extraction()

3 participants