ENH: Extract LaTeX characters#2016
Conversation
closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs
|
@MartinThoma |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2016 +/- ##
==========================================
+ Coverage 94.03% 94.07% +0.03%
==========================================
Files 33 33
Lines 7076 7104 +28
Branches 1413 1421 +8
==========================================
+ Hits 6654 6683 +29
Misses 263 263
+ Partials 159 158 -1
☔ View full report in Codecov by Sentry. |
|
except from My comment above, this PR is all yours |
|
This is amazing 😲 😍 Thank you so much 🤗 |
|
I'm looking forward to the release on the weekend + an update of https://github.com/py-pdf/benchmarks/blob/main/benchmark.py 🎉 |
## What's new ### New Features (ENH) - Accelerate image list keys generation (#2014) - Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000) - Extract LaTeX characters (#2016) - ASCIIHexDecode.decode now returns bytes instead of str (#1994) ### Bug Fixes (BUG) - Add RunLengthDecode filter (#2012) - Process /Separation ColorSpace (#2007) - Handle single element ColorSpace list (#2026) - Process lookup decoded as TextStringObjects (#2008) ### Robustness (ROB) - Cope with garbage collector during cloning (#1841) ### Maintenance (MAINT) - Cleanup of annotations (#1745) [Full Changelog](3.13.0...3.14.0)
|
@pubpub-zz I've updated the benchmark: The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better. I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective): Local optimizationsLigature replacement:
I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔 Composed characters:
Whitespace
Layout-mode
Advanced text extraction normalizationThis will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:
|
/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016
closes #2009
note: code clean up removed duplicates from adobe_glyphs