Conversation
new proposal with deeper analysis of font data and text positionning
|
new proposal. |
|
I'll start it and will post the results this evening (might take 1-2h; I need to finish some other stuff) |
|
The average stayed the same. Most files improved, but one became drastically worse: |
|
new draft proposal where bugs (also applying on the first proposal) : @MartinThoma Can you rerun the bench? I will have a look also to #858 in order to get the best of both |
Includes : * XObject Processing, * choice between encoding and tounicode fields * partial compliance with Identify-H/V encoding (missing processing on 2-bytes) *legacy conversion reintroduced as old for comparison *debug extraction *typing and test
increase test and refactory depreciation warning ignore in test
|
@pubpub-zz I would like to get the Charmap support soon into PyPDF2 and give you ( + some others who made very similar PRs before) full credit for your work. For this reason I would like to avoid to merge #924. I suggest the following:
|
|
@MartinThoma sorry to bother you can you rerun the bench on this version. |
|
No problem - I'm happy that you're doing the heavy-lifting 😄 I've just started the benchmark run. I'll share the results tomorrow morning (takes ~20 minutes and I'll go to bed now 😄 ) |
|
I get for https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf - reader = PyPDF2.PdfFileReader("GeoTopo.pdf")
page = reader.pages[13]
page.extract_text() |
|
I've added the fallback With that fallback, your PR currently boosts the average from 86% to Looking at the single files: |
|
@pubpub-zz I love you 🤩 🤗 This is a crazy improvement! Now I really want it to be merged 😄 Please let me know how you would like me to continue. Should I merge pubpub-zz:ExtractText2 into py-pdf:pubpub-zz-extractText and then that one into |
|
the PR you've referenced will surely improve some translation. In my current branch the legacy function is still present as extract_oldtext for people to reverse if they prefer |
|
Sounds good! Then I'll wait for your ok to get started :-) |
|
I think you should be able to merge this release |
|
You mean I can merge this PR now? (just want to be sure :-) ) |
|
Go :) |

New proposal for evaluation for the current being