Skip to content

Improve math character extraction #2009

@MartinThoma

Description

@MartinThoma

Explanation

Extracting math content is super hard and at the moment completely out of reach (e.g. fractions, subscripts / superscripts, curly braces for two cases, roots). However, maybe we can improve the extraction a little bit by supporting some heavily used single characters:

Expectations and current state

File Expected pypdf PyMuPDF PDFium Tika Copy-paste from Evince
cdot.pdf · · · · ·
hbar.pdf ħ ~ ~ ~ ~
integral.pdf R R R
partial-derivative.pdf @
phi.pdf φ φ φ φ φ
varphi.pdf φ ' ϕ ϕ ϕ φ

Generated via:

from pypdf import PdfReader
import fitz as PyMuPDF
import pypdfium2 as pdfium
import tika
from tika import parser  # pip install tika

tika.initVM()

def pymupdf_get_text(path) -> str:
    with PyMuPDF.open(path) as doc:
        text = ""
        for page in doc:
            text += page.get_text() + "\n"
    return text

def pdfium_get_text(data: bytes) -> str:
    text = ""
    pdf = pdfium.PdfDocument(data)
    for i in range(len(pdf)):
        page = pdf.get_page(i)
        textpage = page.get_textpage()
        text += textpage.get_text_range() + "\n"
    return text

expected = {
    "integral.pdf": "∫",
    "cdot.pdf": "·",
    "phi.pdf": "φ",
    "varphi.pdf": "φ",
    "partial-derivative.pdf": "∂",
    "hbar.pdf": "ħ",
}

# Print header
file = "File"
expected_str = "Expected"
c_pypdf = "pypdf"
c_pymupdf = "PyMuPDF"
c_pdfium = "PDFium"
c_tika = "Tika"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
file = "-" * 25
expected_str = "-------"
c_pypdf = "-----"
c_pymupdf = "-------"
c_pdfium = "------"
c_tika = "----"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

# Print data
for file in sorted(expected.keys()):
    expected_str = expected.get(file, 'unknwon')
    c_pypdf = PdfReader(file).pages[0].extract_text().strip()
    c_pymupdf =pymupdf_get_text(file).strip()
    c_pdfium = pdfium_get_text(file).strip()
    c_tika = parser.from_file(file)[ "content" ].strip()
    print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

with those files:

Proof that it's relevant

Metadata

Metadata

Assignees

Labels

workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions