Conversation
fix at least py-pdf#880 line feeds are applied as stated in the ref TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf) TODO : font conversion (to be analysed in crazyones.pdf)
|
@MartinThoma |
Codecov Report
@@ Coverage Diff @@
## main #881 +/- ##
==========================================
+ Coverage 78.25% 84.09% +5.83%
==========================================
Files 16 18 +2
Lines 4346 4068 -278
Branches 821 854 +33
==========================================
+ Hits 3401 3421 +20
+ Misses 758 460 -298
Partials 187 187
Continue to review full report at Codecov.
|
fix wrong characters in crazyones fix space wrongly introduced
|
deep changes in the text extractions. still some analysis with some fonts(truetype,type 3) Looking for so beta testers. |
|
Nice! I'll let it run with https://github.com/py-pdf/benchmarks this evening if I have some time :-) |
|
I fixed the merge conflicts due to the recent PEP8 renamings + I adjusted the PR to the new syntax. There are still mypy issues though. |
|
Running the benchmark, I get: |
various fixes +add pytest with pypdftest
Thanks for the test and the report. I've fixed it; also I've introduced a test using testbench dataset. using urllib.request to get them. It's increasing the test duration. I've set a way to bypass it. I think that this pull now should be proposed for testing PS : I merged with the new API and 2 functions are missing to prevent compatibility loss : getPage and getObject |
also includes a little of rewrite
You need to change |
|
A general question: If I see such things, should I directly adjust it in your PR? (I know some people love it and others hate it) |
|
Thanks, for the advice. I've tried to improve myself. |
|
Glancing at the first result of the benchmark the results look amazing 😍 Left is old, right is new: It will take a while until the benchmark completed. I'll share the results later :-) |
|
@MartinThoma, b) spacing: c) mypy. d)benchmark |
Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
|
I've extracted the spacing improvement heuristic to #922 . That should be possible to get merged really soon |
Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
|
I've just fixed the merge conflict I've introduced |
|
@pubpub-zz Here is the simplification for the casts: c488734 - essentially just assigning variable names to intermediate results |
|
Hey man, just want to say thank you for you hard work on this one. I got some interest on do this improved myself and end up finding this PR. Really nice work, @pubpub-zz. Are you checking list your past and future updates somewhere? I would like to follow and see if I can help to add html features to it (while avoiding disturb your work) |
|
@MartinThoma, your proposal is quite heavy to read. May I propose this solution? |
|
That is completely fine for me as well :-) |
@LucasWolfgang, thanks for your nice comments. There will be still some improvement to be done. Don't know i will be able to release new changes. |
|
@pubpub-zz , I was thinking of implemting the functionalities available in PyMuPDF's get_text. Of course it would be a LOT of work. So I would start fist by just creating the blocks (paragraphs) and continue from there to extract font, style and Images (as Base64 or maybe as references) a PR at a time. About positions, I am still unsure how I would use it besides ordering the blocks. |
|
maybe what you could do would be to branch from my dev branch https://github.com/pubpub-zz/PyPDF2/tree/extractText |
|
@pubpub-zz #924 got merged 🎉 I'd make a minor release (2.1.0) on Monday. |
|
I've moved the buildCharMap function to its own module: 4baedb2 - I hope this simplifies future PRs / makes it easier for me to understand what they are actually doing. |


fix at least #880
line feeds are applied as stated in the ref
TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf)
TODO : font conversion (to be analysed in crazyones.pdf)