-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I appreciate it that your tool keep the original reading order of text and omit those repeated headers and footers that interfere with main body text that crosses pages, which makes conversion from PDF to EPUB convenient. However, there seems to be a gremlin for the extracted text - some strings are misreplaced by some punctuation marks, like
fimisreplaced by˛(e.g.fieldsextracted to be˛elds)ffmisreplaced by˙or˜(e.g.differentextracted to bedi˙erentandDifferentialextracted to beDi˜erential)ftmisreplaced by˚(e.g.afterextracted to bea˚er)thmisreplaced by˜(e.g.thisextracted to be˜is)
For comparison I used Pythonpdftotextpackage too and found out there's no such a problem, but that package keep the original typography which puts two columns of texts in a row and is not ideal for PDF conversion.
Any clue about this issue? Looking forward to a perfect PyPDF because it is so useful!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow