Skip to content

ExtractText2#929

Merged
MartinThoma merged 7 commits intopy-pdf:pubpub-zz-extractTextfrom
pubpub-zz:ExtractText2
Jun 5, 2022
Merged

ExtractText2#929
MartinThoma merged 7 commits intopy-pdf:pubpub-zz-extractTextfrom
pubpub-zz:ExtractText2

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

New proposal for evaluation for the current being

pubpub-zz added 2 commits May 30, 2022 12:59
new proposal with deeper analysis of font data and text positionning
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

new proposal.
@MartinThoma,
can you review this proposal with the testbench test?

@MartinThoma
Copy link
Copy Markdown
Member

I'll start it and will post the results this evening (might take 1-2h; I need to finish some other stuff)

@MartinThoma
Copy link
Copy Markdown
Member

The average stayed the same. Most files improved, but one became drastically worse:

https://arxiv.org/pdf/1601.03642 : 0.9438654353562005 -> 0.95,
https://arxiv.org/pdf/1602.06541 : 0.8978933061501869 -> 0.91
https://arxiv.org/pdf/1707.09725 : 0.9100581720093184 -> 0.94
https://arxiv.org/pdf/2201.00021 : 0.9499215589133845 -> 0.97
https://arxiv.org/pdf/2201.00022 : 0.9102201679631884 -> 0.93
https://arxiv.org/pdf/2201.00029 : 0.0 -> 0.0,
https://arxiv.org/pdf/2201.00037 : 0.9155486607869612 -> 0.94
https://arxiv.org/pdf/2201.00069 : 0.8980679211032767 -> 0.91
https://arxiv.org/pdf/2201.00151 : 0.8859883219294902 -> 0.64 <------
https://arxiv.org/pdf/2201.00178 : 0.8927337030785306 -> 0.92
https://arxiv.org/pdf/2201.00200 : 0.9683510183687691 -> 0.98
https://arxiv.org/pdf/2201.00201 : 0.9747879942829919 -> 0.99
https://arxiv.org/pdf/2201.00214 : 0.8850769765492426 -> 0.81 <----
https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo-book : 0.7860457992901709 -> 0.86

@MartinThoma
Copy link
Copy Markdown
Member

This is an excerpt from the file that became so much worse (left is the current PyPDF2==1.28.4 version, right is this PRs version):

image

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented May 30, 2022

new draft proposal where bugs (also applying on the first proposal) : @MartinThoma Can you rerun the bench?

I will have a look also to #858 in order to get the best of both

pubpub-zz added 2 commits May 31, 2022 15:06
Includes :
* XObject Processing, 
* choice between encoding and tounicode fields
* partial compliance with Identify-H/V encoding (missing processing on 2-bytes)

*legacy conversion reintroduced as old for comparison
*debug extraction
*typing and test
increase test and refactory depreciation warning ignore in test
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 1, 2022
@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 4, 2022

@pubpub-zz I would like to get the Charmap support soon into PyPDF2 and give you ( + some others who made very similar PRs before) full credit for your work. For this reason I would like to avoid to merge #924.

I suggest the following:

  1. Improve Text Extraction #881 is the PR we merge into main next. Currently the CI is failing - I can take care of that if you want. Also, I need to check that the quality according to the benchmark stays roughly the same. I would add asabramo and VictorCarlquist as co-authored-by as they have done similar PRs in the past. Would that be ok for you?
  2. I close Pubpub zz extract text #924 - I just created that branch to show some minor mypy / style things I would change in Improve Text Extraction #881.
  3. We / I go through the following PRs to check if something is missing:

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma sorry to bother you can you rerun the bench on this version.
I will have a look at the the others

@MartinThoma
Copy link
Copy Markdown
Member

No problem - I'm happy that you're doing the heavy-lifting 😄

I've just started the benchmark run. I'll share the results tomorrow morning (takes ~20 minutes and I'll go to bed now 😄 )

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 5, 2022

I get

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1331, in buildCharMap
    raise Exception("null width")
Exception: null width

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 530, in <module>
    main(docs, libraries, add_text_extraction_quality=True)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 235, in main
    text = lib.text_extraction_function(data)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 140, in pypdf2_get_text
    text += page.extractText()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1506, in extractText
    return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1482, in extract_text
    return self._extract_text(self,self.pdf,space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1357, in _extract_text
    cmaps[f] = buildCharMap(f)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1344, in buildCharMap
    sp_width = m / cpt / 2
ZeroDivisionError: division by zero

for https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf - reader.pages[13]:

reader = PyPDF2.PdfFileReader("GeoTopo.pdf")
page = reader.pages[13]
page.extract_text()

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 5, 2022

I've added the fallback

if cpt == 0:
    cpt = 1

With that fallback, your PR currently boosts the average from 86% to 90% 96%!
edit: That means PyPDF2 has better text extraction than pdfminer.six and pdftotext 🎉

Looking at the single files:

            "1601.03642": 0.9789762968052216, -> 99%
            "1602.06541": 0.9607310932031617, -> 98%
            "1707.09725": 0.9160059659313918, -> 94%
            "2201.00021": 0.92414829121734, -> 97%
            "2201.00022": 0.9581322751904328, -> 98%
            "2201.00029": 0.0,  -> 98% -- you managed to do it! You're so awesome! Thank you!
            "2201.00037": 0.9228385160911429, -> 94%
            "2201.00069": 0.9320819588347349, -> 96%
            "2201.00151": 0.8986238392139712, -> 93%
            "2201.00178": 0.9035859338326836, -> 93%
            "2201.00200": 0.9411056870547374, -> 97%
            "2201.00201": 0.9444251579563376, -> 98%
            "2201.00214": 0.9625399637918416, -> 97%
            "GeoTopo-book": 0.7924142197687146 -> 86%

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 5, 2022

@pubpub-zz I love you 🤩 🤗 This is a crazy improvement! Now I really want it to be merged 😄

Please let me know how you would like me to continue. Should I merge pubpub-zz:ExtractText2 into py-pdf:pubpub-zz-extractText and then that one into main?

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

the PR you've referenced will surely improve some translation.
What I would propose you :
a) I cleanup flake8 / mypy to confirm that we will pass all tests.
b) you merge Extract2 into pupbpub-Extract and then in main

In my current branch the legacy function is still present as extract_oldtext for people to reverse if they prefer
I will carry on this branch with a new PR from the latest main for introducing the other changes

@MartinThoma
Copy link
Copy Markdown
Member

Sounds good! Then I'll wait for your ok to get started :-)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

I think you should be able to merge this release

@MartinThoma
Copy link
Copy Markdown
Member

You mean I can merge this PR now? (just want to be sure :-) )

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

Go :)

@MartinThoma MartinThoma merged commit d957d4d into py-pdf:pubpub-zz-extractText Jun 5, 2022
@pubpub-zz pubpub-zz deleted the ExtractText2 branch June 10, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants