-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Explanation
Extracting math content is super hard and at the moment completely out of reach (e.g. fractions, subscripts / superscripts, curly braces for two cases, roots). However, maybe we can improve the extraction a little bit by supporting some heavily used single characters:
- hbar: https://www.compart.com/de/unicode/U+0127
- integral: https://www.compart.com/de/unicode/U+222B
- phi
- delta
- alpha
- beta
- gamma
- partial derivative: https://www.compart.com/de/unicode/U+2202
- times
- cdot
Expectations and current state
| File | Expected | pypdf | PyMuPDF | PDFium | Tika | Copy-paste from Evince |
|---|---|---|---|---|---|---|
| cdot.pdf | · | · | · | · | · | |
| hbar.pdf | ħ | ~ | ℏ | ~ | ~ | ~ |
| integral.pdf | ∫ | R | � | R | ∫ | R |
| partial-derivative.pdf | ∂ | @ | ∂ | ∂ | ∂ | ∂ |
| phi.pdf | φ | φ | φ | φ | φ | |
| varphi.pdf | φ | ' | ϕ | ϕ | ϕ | φ |
Generated via:
from pypdf import PdfReader
import fitz as PyMuPDF
import pypdfium2 as pdfium
import tika
from tika import parser # pip install tika
tika.initVM()
def pymupdf_get_text(path) -> str:
with PyMuPDF.open(path) as doc:
text = ""
for page in doc:
text += page.get_text() + "\n"
return text
def pdfium_get_text(data: bytes) -> str:
text = ""
pdf = pdfium.PdfDocument(data)
for i in range(len(pdf)):
page = pdf.get_page(i)
textpage = page.get_textpage()
text += textpage.get_text_range() + "\n"
return text
expected = {
"integral.pdf": "∫",
"cdot.pdf": "·",
"phi.pdf": "φ",
"varphi.pdf": "φ",
"partial-derivative.pdf": "∂",
"hbar.pdf": "ħ",
}
# Print header
file = "File"
expected_str = "Expected"
c_pypdf = "pypdf"
c_pymupdf = "PyMuPDF"
c_pdfium = "PDFium"
c_tika = "Tika"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
file = "-" * 25
expected_str = "-------"
c_pypdf = "-----"
c_pymupdf = "-------"
c_pdfium = "------"
c_tika = "----"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
# Print data
for file in sorted(expected.keys()):
expected_str = expected.get(file, 'unknwon')
c_pypdf = PdfReader(file).pages[0].extract_text().strip()
c_pymupdf =pymupdf_get_text(file).strip()
c_pdfium = pdfium_get_text(file).strip()
c_tika = parser.from_file(file)[ "content" ].strip()
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")with those files:
Proof that it's relevant
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow