-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Hello ,
I am using pypdf2 to extract Spanish text. I am using below code piece to do that. The problem is comes to in the part of these marks -> “ ” . In extractText part, ı get the throughput that format-> fi + someting + fl
For example -> “Quijote” : fiQuijotefl or “De la Mancha” : fiDe La Manchafl
I have tried to remove them as like that -> page_text= re.sub(r"['',\“\”\Œ]",'',page_text) Have not worked.
Is there any way to prevent it? Thanks.
import PyPDF2
pdfFileObj = open('X.pdf', 'rb')
text=[]
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
for p in range(4,pdfReader.numPages):
pageObj = pdfReader.getPage(p)
page_text=pageObj.extractText()
text.append(page_text)
pdfFileObj.close()
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow