extract_text doesn't extract ligatures correctly

I appreciate it that your tool keep the original reading order of text and omit those repeated headers and footers that interfere with main body text that crosses pages, which makes conversion from PDF to EPUB convenient. However, there seems to be a gremlin for the extracted text - some strings are misreplaced by some punctuation marks, like
- `fi` misreplaced by `˛` (e.g. `fields` extracted to be `˛elds`)
- `ff` misreplaced by `˙` or `˜` (e.g. `different` extracted to be `di˙erent` and `Differential` extracted to be `Di˜erential`)
- `ft` misreplaced by `˚` (e.g. `after` extracted to be `a˚er`)
- `th` misreplaced by `˜` (e.g. `this` extracted to be `˜is`)
For comparison I used Python `pdftotext` package too and found out there's no such a problem, but that package keep the original typography which puts two columns of texts in a row and is not ideal for PDF conversion.
Any clue about this issue? Looking forward to a perfect PyPDF because it is so useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text doesn't extract ligatures correctly #598

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

extract_text doesn't extract ligatures correctly #598

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions