Skip to content

reading spanish text - mark convert issue  #635

@senemaktas

Description

@senemaktas

Hello ,

I am using pypdf2 to extract Spanish text. I am using below code piece to do that. The problem is comes to in the part of these marks -> “ ” . In extractText part, ı get the throughput that format-> fi + someting + fl

For example -> “Quijote” : Quijote or “De la Mancha” : De La Mancha

I have tried to remove them as like that -> page_text= re.sub(r"['',\“\”\Œ]",'',page_text) Have not worked.
Is there any way to prevent it? Thanks.

import PyPDF2

pdfFileObj = open('X.pdf', 'rb')
text=[]
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)

for p in range(4,pdfReader.numPages):
    pageObj = pdfReader.getPage(p) 
    page_text=pageObj.extractText()
    text.append(page_text)
pdfFileObj.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions