Skip to content

Encoding issue in extract_text() #235

@macabeus

Description

@macabeus

I need to read this PDF.
However, it does not correctly extracts the text.

f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()

print(content)

This print

Resultado da Prova de Sele“‰o...

But I expected

Resultado da Prova de Seleção...

Accordance with the answer on Stack Overflow, this problem is in PyPDF

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions