Improve math character extraction

## Explanation

Extracting math content is super hard and at the moment completely out of reach (e.g. fractions, subscripts / superscripts, curly braces for two cases, roots). However, maybe we can improve the extraction a little bit by supporting some heavily used single characters:

* hbar: https://www.compart.com/de/unicode/U+0127
* integral: https://www.compart.com/de/unicode/U+222B
* phi
* delta
* alpha
* beta
* gamma
* partial derivative: https://www.compart.com/de/unicode/U+2202
* times
* cdot

## Expectations and current state

File                     |Expected |pypdf|PyMuPDF|PDFium|Tika | Copy-paste from Evince
-------------------------|-------  |-----|-------|------|---- |---
cdot.pdf                 |·        |    |·      |·     |·    |·
hbar.pdf                 |ħ        |~    |ℏ      |~     |~    |~
integral.pdf             |∫        |R    |�      |R     |∫    |R
partial-derivative.pdf   |∂        |@    |∂      |∂     |∂    |∂
phi.pdf                  |φ        |     |φ      |φ     |φ    |φ
varphi.pdf               |φ        |'    |ϕ      |ϕ     |ϕ  | φ


Generated via:

```python
from pypdf import PdfReader
import fitz as PyMuPDF
import pypdfium2 as pdfium
import tika
from tika import parser  # pip install tika

tika.initVM()

def pymupdf_get_text(path) -> str:
    with PyMuPDF.open(path) as doc:
        text = ""
        for page in doc:
            text += page.get_text() + "\n"
    return text

def pdfium_get_text(data: bytes) -> str:
    text = ""
    pdf = pdfium.PdfDocument(data)
    for i in range(len(pdf)):
        page = pdf.get_page(i)
        textpage = page.get_textpage()
        text += textpage.get_text_range() + "\n"
    return text

expected = {
    "integral.pdf": "∫",
    "cdot.pdf": "·",
    "phi.pdf": "φ",
    "varphi.pdf": "φ",
    "partial-derivative.pdf": "∂",
    "hbar.pdf": "ħ",
}

# Print header
file = "File"
expected_str = "Expected"
c_pypdf = "pypdf"
c_pymupdf = "PyMuPDF"
c_pdfium = "PDFium"
c_tika = "Tika"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
file = "-" * 25
expected_str = "-------"
c_pypdf = "-----"
c_pymupdf = "-------"
c_pdfium = "------"
c_tika = "----"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

# Print data
for file in sorted(expected.keys()):
    expected_str = expected.get(file, 'unknwon')
    c_pypdf = PdfReader(file).pages[0].extract_text().strip()
    c_pymupdf =pymupdf_get_text(file).strip()
    c_pdfium = pdfium_get_text(file).strip()
    c_tika = parser.from_file(file)[ "content" ].strip()
    print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
```

with those files:

* [cdot.pdf](https://github.com/py-pdf/pypdf/files/12145089/cdot.pdf)
* [hbar.pdf](https://github.com/py-pdf/pypdf/files/12145090/hbar.pdf)
* [integral.pdf](https://github.com/py-pdf/pypdf/files/12145091/integral.pdf)
* [partial-derivative.pdf](https://github.com/py-pdf/pypdf/files/12145092/partial-derivative.pdf)
* [phi.pdf](https://github.com/py-pdf/pypdf/files/12145093/phi.pdf)
* [varphi.pdf](https://github.com/py-pdf/pypdf/files/12145094/varphi.pdf)

## Proof that it's relevant

* https://stackoverflow.com/q/76539981/562769
* https://stackoverflow.com/q/53386054/562769
* https://stackoverflow.com/q/48587318/562769
* https://stackoverflow.com/q/66454037/562769

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve math character extraction #2009

Explanation

Expectations and current state

Proof that it's relevant

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Expected	pypdf	PyMuPDF	PDFium	Tika	Copy-paste from Evince
cdot.pdf	·		·	·	·	·
hbar.pdf	ħ	~	ℏ	~	~	~
integral.pdf	∫	R	�	R	∫	R
partial-derivative.pdf	∂	@	∂	∂	∂	∂
phi.pdf	φ		φ	φ	φ	φ
varphi.pdf	φ	'	ϕ	ϕ	ϕ	φ

Improve math character extraction #2009

Description

Explanation

Expectations and current state

Proof that it's relevant

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions