ENH: Add orientation param for text_extraction (# 1071)#1175
ENH: Add orientation param for text_extraction (# 1071)#1175MartinThoma merged 8 commits intopy-pdf:mainfrom
Conversation
add new capability to filter text extraction on orientation
Codecov Report
@@ Coverage Diff @@
## main #1175 +/- ##
==========================================
+ Coverage 92.08% 92.11% +0.02%
==========================================
Files 24 24
Lines 4866 4897 +31
Branches 996 1011 +15
==========================================
+ Hits 4481 4511 +30
Misses 242 242
- Partials 143 144 +1
Continue to review full report at Codecov.
|
|
Very nice! It looks good to me - I will merge it tomorrow if the text extraction benchmark looks fine as well. So it should get into the release on Sunday :-) |
|
For Interestingly, it seems to have killed a lot of newlines: I think I need to design a new benchmark which measures how well newlines are captured. At the moment, this is completely ignored for calculating the score. |
|
However, getting the spaces in / between words right is way more important. And there was the improvement 👍 |
New Features (ENH): - Add ability to add hex encoded colors to outline items (#1186) - Add support for pathlib.Path in PdfMerger.merge (#1190) - Add link annotation (#1189) - Add capability to filter text extraction by orientation (#1175) Bug Fixes (BUG): - Named Dest in PDF1.1 (#1174) - Incomplete Graphic State save/restore (#1172) Documentation (DOC): - Update changelog url in package metadata (#1180) - Table extraction (#1179) - Mention pyHanko for signing PDF documents (#1178) - We now have CMAP support (#1177) Maintenance (MAINT): - Consistant usage of warnings / log messages (#1164) - Consistent terminology for outline items (#1156) Code Style (STY): - Apply pre-commit (#1188) Full Changelog: 2.8.1...2.9.0

add new capability to filter text extraction on orientation
Deprecations: PageObject.extract_text no longer uses the
Tj_sepandTJ_sepparameters.cf #1071