ENH: Add orientation param for text_extraction (# 1071) by pubpub-zz · Pull Request #1175 · py-pdf/pypdf

pubpub-zz · 2022-07-27T20:33:57Z

add new capability to filter text extraction on orientation

Deprecations: PageObject.extract_text no longer uses the Tj_sep and TJ_sep parameters.

cf #1071

add new capability to filter text extraction on orientation

PyPDF2/_page.py

codecov · 2022-07-27T20:53:22Z

Codecov Report

Merging #1175 (03057ac) into main (8c532a0) will increase coverage by 0.02%.
The diff coverage is 97.77%.

@@            Coverage Diff             @@
##             main    #1175      +/-   ##
==========================================
+ Coverage   92.08%   92.11%   +0.02%     
==========================================
  Files          24       24              
  Lines        4866     4897      +31     
  Branches      996     1011      +15     
==========================================
+ Hits         4481     4511      +30     
  Misses        242      242              
- Partials      143      144       +1

Impacted Files	Coverage Δ
PyPDF2/_page.py	`92.81% <97.77%> (+0.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c532a0...03057ac. Read the comment docs.

PyPDF2/_page.py

MartinThoma · 2022-07-30T05:13:55Z

Very nice!

It looks good to me - I will merge it tomorrow if the text extraction benchmark looks fine as well. So it should get into the release on Sunday :-)

MartinThoma · 2022-07-30T06:32:23Z

For 2201.00029 the score increased from 96.7% to 97.7%, the rest stayed the same 👍

Interestingly, it seems to have killed a lot of newlines:

I think I need to design a new benchmark which measures how well newlines are captured. At the moment, this is completely ignored for calculating the score.

MartinThoma · 2022-07-30T06:32:52Z

However, getting the spaces in / between words right is way more important. And there was the improvement 👍

New Features (ENH): - Add ability to add hex encoded colors to outline items (#1186) - Add support for pathlib.Path in PdfMerger.merge (#1190) - Add link annotation (#1189) - Add capability to filter text extraction by orientation (#1175) Bug Fixes (BUG): - Named Dest in PDF1.1 (#1174) - Incomplete Graphic State save/restore (#1172) Documentation (DOC): - Update changelog url in package metadata (#1180) - Table extraction (#1179) - Mention pyHanko for signing PDF documents (#1178) - We now have CMAP support (#1177) Maintenance (MAINT): - Consistant usage of warnings / log messages (#1164) - Consistent terminology for outline items (#1156) Code Style (STY): - Apply pre-commit (#1188) Full Changelog: 2.8.1...2.9.0

Introduced by 8a27fa4 (#1175)

ENH : add orientation param for text_extraction (# 1071)

22260eb

add new capability to filter text extraction on orientation

MartinThoma reviewed Jul 27, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

fix mypy

44b9108

MartinThoma reviewed Jul 28, 2022

View reviewed changes

PyPDF2/_page.py Show resolved Hide resolved

MartinThoma reviewed Jul 28, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

pubpub-zz added 5 commits July 28, 2022 22:44

ensure compatibility with depreciating parameters

85495d3

fix flake8

2d0f576

fix mypy

f409aec

comment updated

d62fc3f

clean debug left behind

ae9047f

MartinThoma changed the title ~~ENH : add orientation param for text_extraction (# 1071)~~ ENH: Add orientation param for text_extraction (# 1071) Jul 30, 2022

MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jul 30, 2022

Merge branch 'main' into Orientations

03057ac

MartinThoma merged commit 8a27fa4 into py-pdf:main Jul 30, 2022

MartinThoma pushed a commit that referenced this pull request Aug 5, 2022

DOC: Example for orientation parameter of extract_text (#1206)

a6b8fa6

Introduced by 8a27fa4 (#1175)

pubpub-zz deleted the Orientations branch August 8, 2022 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add orientation param for text_extraction (# 1071)#1175

ENH: Add orientation param for text_extraction (# 1071)#1175
MartinThoma merged 8 commits intopy-pdf:mainfrom
pubpub-zz:Orientations

pubpub-zz commented Jul 27, 2022 •

edited by MartinThoma

Loading

Uh oh!

Uh oh!

codecov bot commented Jul 27, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Jul 30, 2022 •

edited

Loading

Uh oh!

MartinThoma commented Jul 30, 2022 •

edited

Loading

Uh oh!

MartinThoma commented Jul 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pubpub-zz commented Jul 27, 2022 • edited by MartinThoma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jul 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pubpub-zz commented Jul 27, 2022 •

edited by MartinThoma

Loading

codecov bot commented Jul 27, 2022 •

edited

Loading

MartinThoma commented Jul 30, 2022 •

edited

Loading

MartinThoma commented Jul 30, 2022 •

edited

Loading