Skip to content

Improve Text Extraction#881

Closed
pubpub-zz wants to merge 26 commits intopy-pdf:mainfrom
pubpub-zz:extractText
Closed

Improve Text Extraction#881
pubpub-zz wants to merge 26 commits intopy-pdf:mainfrom
pubpub-zz:extractText

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

fix at least #880

line feeds are applied as stated in the ref

TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf)
TODO : font conversion (to be analysed in crazyones.pdf)

fix at least py-pdf#880

line feeds are applied as stated in the ref

TODO : for segmented text, horizontal translation should be analysed to apply some space or delete some characters (to be analysed in crazyones.pdf)
TODO : font conversion (to be analysed in crazyones.pdf)
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
For review only, still some works before merging into 2.0

@codecov
Copy link
Copy Markdown

codecov bot commented May 17, 2022

Codecov Report

Merging #881 (4baedb2) into main (42d4659) will increase coverage by 5.83%.
The diff coverage is 73.78%.

❗ Current head 4baedb2 differs from pull request most recent head 121420c. Consider uploading reports for the commit 121420c to get more accurate results

@@            Coverage Diff             @@
##             main     #881      +/-   ##
==========================================
+ Coverage   78.25%   84.09%   +5.83%     
==========================================
  Files          16       18       +2     
  Lines        4346     4068     -278     
  Branches      821      854      +33     
==========================================
+ Hits         3401     3421      +20     
+ Misses        758      460     -298     
  Partials      187      187              
Impacted Files Coverage Δ
PyPDF2/_adobe_glyphs.py 100.00% <ø> (ø)
PyPDF2/pagerange.py 100.00% <ø> (ø)
PyPDF2/filters.py 79.54% <45.45%> (ø)
PyPDF2/_page.py 81.54% <71.01%> (+5.04%) ⬆️
PyPDF2/_reader.py 81.71% <71.42%> (+4.64%) ⬆️
PyPDF2/_cmap.py 72.72% <72.72%> (ø)
PyPDF2/_writer.py 88.16% <75.00%> (+9.26%) ⬆️
PyPDF2/generic.py 89.82% <82.92%> (+12.05%) ⬆️
PyPDF2/_merger.py 69.93% <100.00%> (+4.41%) ⬆️
PyPDF2/_security.py 97.40% <100.00%> (+2.73%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42d4659...121420c. Read the comment docs.

fix wrong characters in crazyones
fix space wrongly introduced
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

deep changes in the text extractions. still some analysis with some fonts(truetype,type 3)

Looking for so beta testers.

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented May 23, 2022

Nice! I'll let it run with https://github.com/py-pdf/benchmarks this evening if I have some time :-)

@MartinThoma
Copy link
Copy Markdown
Member

I fixed the merge conflicts due to the recent PEP8 renamings + I adjusted the PR to the new syntax. There are still mypy issues though.

@MartinThoma
Copy link
Copy Markdown
Member

Running the benchmark, I get:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1141, in extract_text
    ft,cmap,cmap2 = buildCharMap(self.pdf,operands[0])
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1073, in buildCharMap
    fontType = pdf.pages[0]["/Resources"]["/Font"][font_name]["/Subtype"]
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 519, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/R167'

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented May 25, 2022

@MartinThoma

Running the benchmark, I get:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1141, in extract_text
    ft,cmap,cmap2 = buildCharMap(self.pdf,operands[0])
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1073, in buildCharMap
    fontType = pdf.pages[0]["/Resources"]["/Font"][font_name]["/Subtype"]
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 519, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/R167'

Thanks for the test and the report. I've fixed it; also I've introduced a test using testbench dataset. using urllib.request to get them. It's increasing the test duration. I've set a way to bypass it.

I think that this pull now should be proposed for testing

PS : I merged with the new API and 2 functions are missing to prevent compatibility loss : getPage and getObject
PPS: I have some issues with mypy, can you fix the errors for my training ?

@MartinThoma MartinThoma changed the base branch from 2.0.0-dev to main May 26, 2022 08:13
@MartinThoma
Copy link
Copy Markdown
Member

def read_from_stream(stream: StreamType, pdf: Any, forcedEncoding: Union[None,str,List[str],dict[int,str]] = None) -> "ArrayObject":  # PdfReader

E TypeError: 'type' object is not subscriptable

You need to change dict[int,str] to Dict[int,str] (capital D). The syntax you've used was only introduced in Python 3.9 or 3.10, I think

@MartinThoma
Copy link
Copy Markdown
Member

A general question: If I see such things, should I directly adjust it in your PR? (I know some people love it and others hate it)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

Thanks, for the advice. I've tried to improve myself.
I successed to fix it 😊
During your review, I would like to have your opinion : the solution I've used to prevent mypy errors with cast makes the code not so easy to read for me: can you give me some advice ?

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented May 28, 2022

Glancing at the first result of the benchmark the results look amazing 😍

Left is old, right is new:

image

It will take a while until the benchmark completed. I'll share the results later :-)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma,
My latest status
a) about issues with asian characters : I've not been able to reproduce the issue.
my code in python to get the file:
import io,urllib.request,PyPDF2;p = PyPDF2.PdfReader(io.BytesIO(urllib.request.urlopen("https://arxiv.org/pdf/2201.00151.pdf").read()));e=p.pages[0].extract_text();print(e);f=open('e:/out.txt','w',encoding='utf-8');f.write(e);f.close()
the result of comparison between out.txt(left) and copy/paste from adobe reader DC:
image

b) spacing:
I've sligthtly modified the code to allow to pass the space width criteria. I' still do not understand some case where a lower width is sometimes interpreted as a space whereas is some cases not.

c) mypy.
Your proposal seems to have reintroduced some error reporting. Can you check on your side and reajust enventually?

d)benchmark
Can you tell me how are you obtaining the percentage result?

MartinThoma added a commit that referenced this pull request May 29, 2022
Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@MartinThoma
Copy link
Copy Markdown
Member

I've extracted the spacing improvement heuristic to #922 . That should be possible to get merged really soon

MartinThoma added a commit that referenced this pull request May 29, 2022
Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
MartinThoma added a commit that referenced this pull request May 29, 2022
Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@MartinThoma
Copy link
Copy Markdown
Member

I've just fixed the merge conflict I've introduced

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz Here is the simplification for the casts: c488734 - essentially just assigning variable names to intermediate results

@Wolf359Stella
Copy link
Copy Markdown

Hey man, just want to say thank you for you hard work on this one. I got some interest on do this improved myself and end up finding this PR. Really nice work, @pubpub-zz. Are you checking list your past and future updates somewhere? I would like to follow and see if I can help to add html features to it (while avoiding disturb your work)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma, your proposal is quite heavy to read. May I propose this solution?
font_type: str = self["/Resources"]["/Font"][font_name]["/Subtype"] # type: ignore

@MartinThoma
Copy link
Copy Markdown
Member

That is completely fine for me as well :-)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented May 29, 2022

Hey man, just want to say thank you for you hard work on this one. I got some interest on do this improved myself and end up finding this PR. Really nice work, @pubpub-zz. Are you checking list your past and future updates somewhere? I would like to follow and see if I can help to add html features to it (while avoiding disturb your work)

@LucasWolfgang, thanks for your nice comments. There will be still some improvement to be done. Don't know i will be able to release new changes.
What do you intend to propose (font? style? position?)

@Wolf359Stella
Copy link
Copy Markdown

@pubpub-zz , I was thinking of implemting the functionalities available in PyMuPDF's get_text. Of course it would be a LOT of work. So I would start fist by just creating the blocks (paragraphs) and continue from there to extract font, style and Images (as Base64 or maybe as references) a PR at a time. About positions, I am still unsure how I would use it besides ordering the blocks.

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

maybe what you could do would be to branch from my dev branch https://github.com/pubpub-zz/PyPDF2/tree/extractText
If you had an as_html boolean parameter to add so if branches to be able keep text and html at the same time. This solution may help you to merge my changes later

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 1, 2022
@MartinThoma MartinThoma mentioned this pull request Jun 4, 2022
@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz #924 got merged 🎉 I'd make a minor release (2.1.0) on Monday.

@MartinThoma
Copy link
Copy Markdown
Member

I've moved the buildCharMap function to its own module: 4baedb2 - I hope this simplifies future PRs / makes it easier for me to understand what they are actually doing.

@pubpub-zz pubpub-zz closed this Jun 10, 2022
@pubpub-zz pubpub-zz deleted the extractText branch June 10, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants