ExtractText2 by pubpub-zz · Pull Request #929 · py-pdf/pypdf

pubpub-zz · 2022-05-30T17:27:16Z

New proposal for evaluation for the current being

new proposal with deeper analysis of font data and text positionning

pubpub-zz · 2022-05-30T17:29:35Z

new proposal.
@MartinThoma,
can you review this proposal with the testbench test?

MartinThoma · 2022-05-30T18:42:01Z

I'll start it and will post the results this evening (might take 1-2h; I need to finish some other stuff)

MartinThoma · 2022-05-30T19:03:35Z

The average stayed the same. Most files improved, but one became drastically worse:

https://arxiv.org/pdf/1601.03642 : 0.9438654353562005 -> 0.95,
https://arxiv.org/pdf/1602.06541 : 0.8978933061501869 -> 0.91
https://arxiv.org/pdf/1707.09725 : 0.9100581720093184 -> 0.94
https://arxiv.org/pdf/2201.00021 : 0.9499215589133845 -> 0.97
https://arxiv.org/pdf/2201.00022 : 0.9102201679631884 -> 0.93
https://arxiv.org/pdf/2201.00029 : 0.0 -> 0.0,
https://arxiv.org/pdf/2201.00037 : 0.9155486607869612 -> 0.94
https://arxiv.org/pdf/2201.00069 : 0.8980679211032767 -> 0.91
https://arxiv.org/pdf/2201.00151 : 0.8859883219294902 -> 0.64 <------
https://arxiv.org/pdf/2201.00178 : 0.8927337030785306 -> 0.92
https://arxiv.org/pdf/2201.00200 : 0.9683510183687691 -> 0.98
https://arxiv.org/pdf/2201.00201 : 0.9747879942829919 -> 0.99
https://arxiv.org/pdf/2201.00214 : 0.8850769765492426 -> 0.81 <----
https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo-book : 0.7860457992901709 -> 0.86

MartinThoma · 2022-05-30T19:07:07Z

This is an excerpt from the file that became so much worse (left is the current PyPDF2==1.28.4 version, right is this PRs version):

pubpub-zz · 2022-05-30T21:45:03Z

new draft proposal where bugs (also applying on the first proposal) : @MartinThoma Can you rerun the bench?

I will have a look also to #858 in order to get the best of both

Includes : * XObject Processing, * choice between encoding and tounicode fields * partial compliance with Identify-H/V encoding (missing processing on 2-bytes) *legacy conversion reintroduced as old for comparison *debug extraction *typing and test

increase test and refactory depreciation warning ignore in test

MartinThoma · 2022-06-04T09:24:41Z

@pubpub-zz I would like to get the Charmap support soon into PyPDF2 and give you ( + some others who made very similar PRs before) full credit for your work. For this reason I would like to avoid to merge #924.

I suggest the following:

Improve Text Extraction #881 is the PR we merge into main next. Currently the CI is failing - I can take care of that if you want. Also, I need to check that the quality according to the benchmark stays roughly the same. I would add asabramo and VictorCarlquist as co-authored-by as they have done similar PRs in the past. Would that be ok for you?
I close Pubpub zz extract text #924 - I just created that branch to show some minor mypy / style things I would change in Improve Text Extraction #881.
We / I go through the following PRs to check if something is missing:

ExtractText2 #929 - maybe you can already make sure that Improve Text Extraction #881 contains those improvements?
WIP: support font CMAP to translate chars with TJ operator #858 VictorCarlquist
Advanced text extraction #464 asabramo
CMap support #805

pubpub-zz · 2022-06-04T18:45:08Z

@MartinThoma sorry to bother you can you rerun the bench on this version.
I will have a look at the the others

MartinThoma · 2022-06-04T20:56:45Z

No problem - I'm happy that you're doing the heavy-lifting 😄

I've just started the benchmark run. I'll share the results tomorrow morning (takes ~20 minutes and I'll go to bed now 😄 )

MartinThoma · 2022-06-05T07:36:49Z

I get

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1331, in buildCharMap
    raise Exception("null width")
Exception: null width

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 530, in <module>
    main(docs, libraries, add_text_extraction_quality=True)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 235, in main
    text = lib.text_extraction_function(data)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 140, in pypdf2_get_text
    text += page.extractText()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1506, in extractText
    return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1482, in extract_text
    return self._extract_text(self,self.pdf,space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1357, in _extract_text
    cmaps[f] = buildCharMap(f)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1344, in buildCharMap
    sp_width = m / cpt / 2
ZeroDivisionError: division by zero

for https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf - reader.pages[13]:

reader = PyPDF2.PdfFileReader("GeoTopo.pdf")
page = reader.pages[13]
page.extract_text()

MartinThoma · 2022-06-05T08:31:15Z

I've added the fallback

if cpt == 0:
    cpt = 1

With that fallback, your PR currently boosts the average from 86% to ~~90%~~ 96%!
edit: That means PyPDF2 has better text extraction than pdfminer.six and pdftotext 🎉

Looking at the single files:

            "1601.03642": 0.9789762968052216, -> 99%
            "1602.06541": 0.9607310932031617, -> 98%
            "1707.09725": 0.9160059659313918, -> 94%
            "2201.00021": 0.92414829121734, -> 97%
            "2201.00022": 0.9581322751904328, -> 98%
            "2201.00029": 0.0,  -> 98% -- you managed to do it! You're so awesome! Thank you!
            "2201.00037": 0.9228385160911429, -> 94%
            "2201.00069": 0.9320819588347349, -> 96%
            "2201.00151": 0.8986238392139712, -> 93%
            "2201.00178": 0.9035859338326836, -> 93%
            "2201.00200": 0.9411056870547374, -> 97%
            "2201.00201": 0.9444251579563376, -> 98%
            "2201.00214": 0.9625399637918416, -> 97%
            "GeoTopo-book": 0.7924142197687146 -> 86%

MartinThoma · 2022-06-05T08:32:41Z

@pubpub-zz I love you 🤩 🤗 This is a crazy improvement! Now I really want it to be merged 😄

Please let me know how you would like me to continue. Should I merge pubpub-zz:ExtractText2 into py-pdf:pubpub-zz-extractText and then that one into main?

pubpub-zz · 2022-06-05T09:01:05Z

the PR you've referenced will surely improve some translation.
What I would propose you :
a) I cleanup flake8 / mypy to confirm that we will pass all tests.
b) you merge Extract2 into pupbpub-Extract and then in main

In my current branch the legacy function is still present as extract_oldtext for people to reverse if they prefer
I will carry on this branch with a new PR from the latest main for introducing the other changes

MartinThoma · 2022-06-05T09:49:19Z

Sounds good! Then I'll wait for your ok to get started :-)

pubpub-zz · 2022-06-05T09:49:21Z

I think you should be able to merge this release

MartinThoma · 2022-06-05T09:50:33Z

You mean I can merge this PR now? (just want to be sure :-) )

pubpub-zz · 2022-06-05T09:51:59Z

Go :)

pubpub-zz added 2 commits May 30, 2022 12:59

Simplify mypy writing + merge

121420c

New proposal for spacing/lining

1a3d8ec

new proposal with deeper analysis of font data and text positionning

MartinThoma mentioned this pull request May 30, 2022

WIP: support font CMAP to translate chars with TJ operator #858

Closed

fix some errors reported

e89c205

pubpub-zz added 2 commits May 31, 2022 15:06

improve textExtraction

6cecea6

Includes : * XObject Processing, * choice between encoding and tounicode fields * partial compliance with Identify-H/V encoding (missing processing on 2-bytes) *legacy conversion reintroduced as old for comparison *debug extraction *typing and test

refactoring + test

02c739f

increase test and refactory depreciation warning ignore in test

pubpub-zz mentioned this pull request May 31, 2022

.extractText() reads / as 1. #789

Closed

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 1, 2022

take into account Tm scale and improve space calculation

945fc21

clean up before deliveries

a70232b

MartinThoma merged commit d957d4d into py-pdf:pubpub-zz-extractText Jun 5, 2022

pubpub-zz deleted the ExtractText2 branch June 10, 2022 19:57

Conversation

pubpub-zz commented May 30, 2022

Uh oh!

pubpub-zz commented May 30, 2022

Uh oh!

MartinThoma commented May 30, 2022

Uh oh!

MartinThoma commented May 30, 2022

Uh oh!

MartinThoma commented May 30, 2022

Uh oh!

pubpub-zz commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jun 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Jun 4, 2022

Uh oh!

MartinThoma commented Jun 4, 2022

Uh oh!

MartinThoma commented Jun 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jun 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jun 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Jun 5, 2022

Uh oh!

MartinThoma commented Jun 5, 2022

Uh oh!

pubpub-zz commented Jun 5, 2022

Uh oh!

MartinThoma commented Jun 5, 2022

Uh oh!

pubpub-zz commented Jun 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pubpub-zz commented May 30, 2022 •

edited

Loading

MartinThoma commented Jun 4, 2022 •

edited

Loading

MartinThoma commented Jun 5, 2022 •

edited

Loading

MartinThoma commented Jun 5, 2022 •

edited

Loading

MartinThoma commented Jun 5, 2022 •

edited

Loading