ENH: Extract LaTeX characters by pubpub-zz · Pull Request #2016 · py-pdf/pypdf

pubpub-zz · 2023-07-25T22:38:44Z

closes #2009

note: code clean up removed duplicates from adobe_glyphs

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

pubpub-zz · 2023-07-25T22:44:46Z

@MartinThoma
I'm interested in your position about phi / phi1 being crossed

codecov · 2023-07-25T22:55:27Z

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.03% 🎉

Comparison is base (890c93a) 94.03% compared to head (9a06598) 94.07%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2016      +/-   ##
==========================================
+ Coverage   94.03%   94.07%   +0.03%     
==========================================
  Files          33       33              
  Lines        7076     7104      +28     
  Branches     1413     1421       +8     
==========================================
+ Hits         6654     6683      +29     
  Misses        263      263              
+ Partials      159      158       -1

Files Changed	Coverage Δ
pypdf/_codecs/adobe_glyphs.py	`100.00% <ø> (ø)`
pypdf/_cmap.py	`95.01% <92.85%> (-0.28%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-07-26T09:03:25Z

except from My comment above, this PR is all yours

MartinThoma · 2023-07-26T20:16:18Z

This is amazing 😲 😍 Thank you so much 🤗

MartinThoma · 2023-07-26T20:21:13Z

I'm looking forward to the release on the weekend + an update of https://github.com/py-pdf/benchmarks/blob/main/benchmark.py 🎉

## What's new ### New Features (ENH) - Accelerate image list keys generation (#2014) - Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000) - Extract LaTeX characters (#2016) - ASCIIHexDecode.decode now returns bytes instead of str (#1994) ### Bug Fixes (BUG) - Add RunLengthDecode filter (#2012) - Process /Separation ColorSpace (#2007) - Handle single element ColorSpace list (#2026) - Process lookup decoded as TextStringObjects (#2008) ### Robustness (ROB) - Cope with garbage collector during cloning (#1841) ### Maintenance (MAINT) - Cleanup of annotations (#1745) [Full Changelog](3.13.0...3.14.0)

MartinThoma · 2023-07-29T14:48:39Z

@pubpub-zz I've updated the benchmark:

The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better.

I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective):

Local optimizations

Ligature replacement:

ﬁ should be fi
ﬂ should be fl
ﬀ should be ff

I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔

Composed characters:

¯x should be x̄
ˆx should be x̂
Chinese characters in arxiv 2201.00021: The name one the first page.
Removal of hypens inserted solely to fit on the line: Here I'm uncertain. I think most people use the text extraction to do Natural Language Processing (NLP). For them, the hyphens are just noise. But some might need the layout mode to do post-processing on their own. Then hyphen-removal might actually harm.
Superscript / subscripts: Especially squares (x²) and cubes (x³) as well as zero-subscripts (x₀) and one-subscripts (x₁)

Whitespace

Most important are inner-word spaces that often occur after the first letter of a word. See Random whitespaces are inserted when using page.extract_text() #1507
Newlines, especially for arXiv 2201.00029
Spaces around math-mode stuff
Spaces after dots: New line character missing and URLs adding periods and space #1974

Layout-mode

Indentation of code blocks currently completely breaks.
Multiple newlines to represent paragraph / section boundaries

Advanced text extraction normalization

This will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:

Detection of tables + automatic application of layout mode for them, while not using layout mode for e.g. two-column pages.
Removal of footers (page numbers)
Removal of headers
Removal of spaces used for thousands separation
Detection of text that belongs to an image / diagram
Re-structuring of text that is broken up by an image to ensure a smooth text flow

/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016

`/uniHHHH` (H is a hexadecimal) glyphs seems to be generated in LaTeX but is ok for other characters This was mentioned in #2016 / #2038

ENH : extract latex characters

bdfaa49

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

pubpub-zz mentioned this pull request Jul 25, 2023

Fix copying of the reduced Planck constant mozilla/pdf.js#16735

Merged

typo in comment

9a06598

MartinThoma changed the title ~~ENH : extract latex characters~~ ENH: Extract LaTeX characters Jul 26, 2023

MartinThoma merged commit a327df6 into py-pdf:main Jul 26, 2023

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Jul 30, 2023

ENH : Process /uniHHHH for text_extract

8df2dfa

/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016

pubpub-zz mentioned this pull request Jul 30, 2023

ENH: Process /uniHHHH for text_extract #2043

Merged

MartinThoma pushed a commit that referenced this pull request Jul 30, 2023

ENH: Process /uniHHHH for text_extract (#2043)

534c7b4

`/uniHHHH` (H is a hexadecimal) glyphs seems to be generated in LaTeX but is ok for other characters This was mentioned in #2016 / #2038

pubpub-zz deleted the Type1asUnicode branch September 2, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Extract LaTeX characters#2016

ENH: Extract LaTeX characters#2016
MartinThoma merged 2 commits intopy-pdf:mainfrom
pubpub-zz:Type1asUnicode

pubpub-zz commented Jul 25, 2023

Uh oh!

pubpub-zz commented Jul 25, 2023

Uh oh!

codecov bot commented Jul 25, 2023 •

edited

Loading

Uh oh!

pubpub-zz commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pubpub-zz commented Jul 25, 2023

Uh oh!

pubpub-zz commented Jul 25, 2023

Uh oh!

codecov bot commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pubpub-zz commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 26, 2023

Uh oh!

MartinThoma commented Jul 29, 2023

Local optimizations

Whitespace

Layout-mode

Advanced text extraction normalization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jul 25, 2023 •

edited

Loading