Improve Text Extraction by pubpub-zz · Pull Request #881 · py-pdf/pypdf

pubpub-zz · 2022-05-17T22:30:18Z

fix at least #880

line feeds are applied as stated in the ref

TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf)
TODO : font conversion (to be analysed in crazyones.pdf)

fix at least py-pdf#880 line feeds are applied as stated in the ref TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf) TODO : font conversion (to be analysed in crazyones.pdf)

pubpub-zz · 2022-05-17T22:31:38Z

@MartinThoma
For review only, still some works before merging into 2.0

codecov · 2022-05-17T22:32:22Z

Codecov Report

Merging #881 (4baedb2) into main (42d4659) will increase coverage by 5.83%.
The diff coverage is 73.78%.

❗ Current head 4baedb2 differs from pull request most recent head 121420c. Consider uploading reports for the commit 121420c to get more accurate results

@@            Coverage Diff             @@
##             main     #881      +/-   ##
==========================================
+ Coverage   78.25%   84.09%   +5.83%     
==========================================
  Files          16       18       +2     
  Lines        4346     4068     -278     
  Branches      821      854      +33     
==========================================
+ Hits         3401     3421      +20     
+ Misses        758      460     -298     
  Partials      187      187

Impacted Files	Coverage Δ
PyPDF2/_adobe_glyphs.py	`100.00% <ø> (ø)`
PyPDF2/pagerange.py	`100.00% <ø> (ø)`
PyPDF2/filters.py	`79.54% <45.45%> (ø)`
PyPDF2/_page.py	`81.54% <71.01%> (+5.04%)`	⬆️
PyPDF2/_reader.py	`81.71% <71.42%> (+4.64%)`	⬆️
PyPDF2/_cmap.py	`72.72% <72.72%> (ø)`
PyPDF2/_writer.py	`88.16% <75.00%> (+9.26%)`	⬆️
PyPDF2/generic.py	`89.82% <82.92%> (+12.05%)`	⬆️
PyPDF2/_merger.py	`69.93% <100.00%> (+4.41%)`	⬆️
PyPDF2/_security.py	`97.40% <100.00%> (+2.73%)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42d4659...121420c. Read the comment docs.

fix wrong characters in crazyones fix space wrongly introduced

pubpub-zz · 2022-05-22T22:03:01Z

deep changes in the text extractions. still some analysis with some fonts(truetype,type 3)

Looking for so beta testers.

MartinThoma · 2022-05-23T11:35:46Z

Nice! I'll let it run with https://github.com/py-pdf/benchmarks this evening if I have some time :-)

PyPDF2/_page.py

PyPDF2/generic.py

PyPDF2/_page.py

MartinThoma · 2022-05-23T21:39:22Z

I fixed the merge conflicts due to the recent PEP8 renamings + I adjusted the PR to the new syntax. There are still mypy issues though.

MartinThoma · 2022-05-24T05:33:10Z

Running the benchmark, I get:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1141, in extract_text
    ft,cmap,cmap2 = buildCharMap(self.pdf,operands[0])
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1073, in buildCharMap
    fontType = pdf.pages[0]["/Resources"]["/Font"][font_name]["/Subtype"]
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 519, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/R167'

various fixes +add pytest with pypdftest

pubpub-zz · 2022-05-25T22:47:05Z

@MartinThoma

Running the benchmark, I get:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1141, in extract_text
    ft,cmap,cmap2 = buildCharMap(self.pdf,operands[0])
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1073, in buildCharMap
    fontType = pdf.pages[0]["/Resources"]["/Font"][font_name]["/Subtype"]
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 519, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/R167'

Thanks for the test and the report. I've fixed it; also I've introduced a test using testbench dataset. using urllib.request to get them. It's increasing the test duration. I've set a way to bypass it.

I think that this pull now should be proposed for testing

PS : I merged with the new API and 2 functions are missing to prevent compatibility loss : getPage and getObject
PPS: I have some issues with mypy, can you fix the errors for my training ?

also includes a little of rewrite

MartinThoma · 2022-05-27T12:50:08Z

def read_from_stream(stream: StreamType, pdf: Any, forcedEncoding: Union[None,str,List[str],dict[int,str]] = None) -> "ArrayObject":  # PdfReader

E TypeError: 'type' object is not subscriptable

You need to change dict[int,str] to Dict[int,str] (capital D). The syntax you've used was only introduced in Python 3.9 or 3.10, I think

MartinThoma · 2022-05-27T12:50:50Z

A general question: If I see such things, should I directly adjust it in your PR? (I know some people love it and others hate it)

pubpub-zz · 2022-05-27T14:55:48Z

Thanks, for the advice. I've tried to improve myself.
I successed to fix it 😊
During your review, I would like to have your opinion : the solution I've used to prevent mypy errors with cast makes the code not so easy to read for me: can you give me some advice ?

MartinThoma · 2022-05-28T10:37:33Z

Glancing at the first result of the benchmark the results look amazing 😍

Left is old, right is new:

It will take a while until the benchmark completed. I'll share the results later :-)

pubpub-zz · 2022-05-29T10:46:05Z

@MartinThoma,
My latest status
a) about issues with asian characters : I've not been able to reproduce the issue.
my code in python to get the file:
import io,urllib.request,PyPDF2;p = PyPDF2.PdfReader(io.BytesIO(urllib.request.urlopen("https://arxiv.org/pdf/2201.00151.pdf").read()));e=p.pages[0].extract_text();print(e);f=open('e:/out.txt','w',encoding='utf-8');f.write(e);f.close()
the result of comparison between out.txt(left) and copy/paste from adobe reader DC:

b) spacing:
I've sligthtly modified the code to allow to pass the space width criteria. I' still do not understand some case where a lower width is sometimes interpreted as a space whereas is some cases not.

c) mypy.
Your proposal seems to have reintroduced some error reporting. Can you check on your side and reajust enventually?

d)benchmark
Can you tell me how are you obtaining the percentage result?

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma · 2022-05-29T10:47:36Z

I've extracted the spacing improvement heuristic to #922 . That should be possible to get merged really soon

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma · 2022-05-29T13:02:49Z

I've just fixed the merge conflict I've introduced

MartinThoma · 2022-05-29T14:20:42Z

@pubpub-zz Here is the simplification for the casts: c488734 - essentially just assigning variable names to intermediate results

Wolf359Stella · 2022-05-29T15:46:05Z

Hey man, just want to say thank you for you hard work on this one. I got some interest on do this improved myself and end up finding this PR. Really nice work, @pubpub-zz. Are you checking list your past and future updates somewhere? I would like to follow and see if I can help to add html features to it (while avoiding disturb your work)

pubpub-zz · 2022-05-29T16:45:48Z

@MartinThoma, your proposal is quite heavy to read. May I propose this solution?
font_type: str = self["/Resources"]["/Font"][font_name]["/Subtype"] # type: ignore

MartinThoma · 2022-05-29T18:05:11Z

That is completely fine for me as well :-)

pubpub-zz · 2022-05-29T19:33:10Z

Hey man, just want to say thank you for you hard work on this one. I got some interest on do this improved myself and end up finding this PR. Really nice work, @pubpub-zz. Are you checking list your past and future updates somewhere? I would like to follow and see if I can help to add html features to it (while avoiding disturb your work)

@LucasWolfgang, thanks for your nice comments. There will be still some improvement to be done. Don't know i will be able to release new changes.
What do you intend to propose (font? style? position?)

Wolf359Stella · 2022-05-29T19:47:13Z

@pubpub-zz , I was thinking of implemting the functionalities available in PyMuPDF's get_text. Of course it would be a LOT of work. So I would start fist by just creating the blocks (paragraphs) and continue from there to extract font, style and Images (as Base64 or maybe as references) a PR at a time. About positions, I am still unsure how I would use it besides ordering the blocks.

pubpub-zz · 2022-05-29T20:14:40Z

maybe what you could do would be to branch from my dev branch https://github.com/pubpub-zz/PyPDF2/tree/extractText
If you had an as_html boolean parameter to add so if branches to be able keep text and html at the same time. This solution may help you to merge my changes later

MartinThoma · 2022-06-05T12:02:36Z

@pubpub-zz #924 got merged 🎉 I'd make a minor release (2.1.0) on Monday.

MartinThoma · 2022-06-06T11:30:54Z

I've moved the buildCharMap function to its own module: 4baedb2 - I hope this simplifies future PRs / makes it easier for me to understand what they are actually doing.

improve extractText

4346612

fix wrong characters in crazyones fix space wrongly introduced

Merge branch '2.0.0-dev' into extractText

88daaa9

MartinThoma reviewed May 23, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

PyPDF2/generic.py Outdated Show resolved Hide resolved

PyPDF2/generic.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6d35cd0

MartinThoma reviewed May 23, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

PyPDF2/_page.py Outdated Show resolved Hide resolved

PyPDF2/_page.py Outdated Show resolved Hide resolved

PyPDF2/_page.py Outdated Show resolved Hide resolved

PyPDF2/_page.py Outdated Show resolved Hide resolved

Apply suggestions from code review

b386daf

pubpub-zz added 3 commits May 26, 2022 00:08

extractText improvement from testbench tests

de1e103

various fixes +add pytest with pypdftest

Merge remote-tracking branch 'origin/extractText' into extractText

77cf4cc

2.0.0 new API upgrade

617176e

pubpub-zz added 2 commits May 26, 2022 00:50

late fix

8f47080

Late Fix 2

7eee1b7

MartinThoma changed the base branch from 2.0.0-dev to main May 26, 2022 08:13

pubpub-zz added 3 commits May 27, 2022 14:24

make mypy happy

930eaac

also includes a little of rewrite

Merge branch 'main' into extractText

23411f3

fix flake8

05d87a8

pubpub-zz added 3 commits May 27, 2022 16:21

fix Merging main

5d017df

fix mypy 3.8

0a049d1

fix mypy 3.8 (2)

6feebf7

add modifiable space size

a832144

pubpub-zz force-pushed the extractText branch from ace6c31 to a832144 Compare May 29, 2022 09:54

pubpub-zz added 2 commits May 29, 2022 12:08

after black

1b0034b

after black (2)

dae5dd3

MartinThoma added a commit that referenced this pull request May 29, 2022

ENH: Improve space setting for text extraction

fc10e6e

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma mentioned this pull request May 29, 2022

ENH: Improve space setting for text extraction #922

Merged

MartinThoma added a commit that referenced this pull request May 29, 2022

ENH: Improve space setting for text extraction

dee0106

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma added a commit that referenced this pull request May 29, 2022

ENH: Improve space setting for text extraction (#922)

c008b0f

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

Merge branch 'main' into extractText

056fcb3

MartinThoma added 2 commits May 29, 2022 16:06

Merge branch 'main' into extractText

ba44b65

Simplify code; apply black formatter

c488734

MartinThoma mentioned this pull request May 29, 2022

Pubpub zz extract text #924

Merged

VictorCarlquist mentioned this pull request May 29, 2022

WIP: support font CMAP to translate chars with TJ operator #858

Closed

Simplify mypy writing + merge

121420c

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 1, 2022

MartinThoma mentioned this pull request Jun 4, 2022

ExtractText2 #929

Merged

pubpub-zz closed this Jun 10, 2022

pubpub-zz deleted the extractText branch June 10, 2022 19:57

Conversation

pubpub-zz commented May 17, 2022

Uh oh!

pubpub-zz commented May 17, 2022

Uh oh!

codecov bot commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pubpub-zz commented May 22, 2022

Uh oh!

MartinThoma commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented May 23, 2022

Uh oh!

MartinThoma commented May 24, 2022

Uh oh!

pubpub-zz commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented May 27, 2022

Uh oh!

MartinThoma commented May 27, 2022

Uh oh!

pubpub-zz commented May 27, 2022

Uh oh!

MartinThoma commented May 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented May 29, 2022

Uh oh!

MartinThoma commented May 29, 2022

Uh oh!

MartinThoma commented May 29, 2022

Uh oh!

MartinThoma commented May 29, 2022

Uh oh!

Wolf359Stella commented May 29, 2022

Uh oh!

pubpub-zz commented May 29, 2022

Uh oh!

MartinThoma commented May 29, 2022

Uh oh!

pubpub-zz commented May 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wolf359Stella commented May 29, 2022

Uh oh!

pubpub-zz commented May 29, 2022

Uh oh!

MartinThoma commented Jun 5, 2022

Uh oh!

MartinThoma commented Jun 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented May 17, 2022 •

edited

Loading

MartinThoma commented May 23, 2022 •

edited

Loading

pubpub-zz commented May 25, 2022 •

edited

Loading

MartinThoma commented May 28, 2022 •

edited

Loading

pubpub-zz commented May 29, 2022 •

edited

Loading