Skip to content

extract_text doesn't extract ligatures correctly #598

@Baytars

Description

@Baytars

I appreciate it that your tool keep the original reading order of text and omit those repeated headers and footers that interfere with main body text that crosses pages, which makes conversion from PDF to EPUB convenient. However, there seems to be a gremlin for the extracted text - some strings are misreplaced by some punctuation marks, like

  • fi misreplaced by ˛ (e.g. fields extracted to be ˛elds)
  • ff misreplaced by ˙ or ˜ (e.g. different extracted to be di˙erent and Differential extracted to be Di˜erential)
  • ft misreplaced by ˚ (e.g. after extracted to be a˚er)
  • th misreplaced by ˜ (e.g. this extracted to be ˜is)
    For comparison I used Python pdftotext package too and found out there's no such a problem, but that package keep the original typography which puts two columns of texts in a row and is not ideal for PDF conversion.
    Any clue about this issue? Looking forward to a perfect PyPDF because it is so useful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions