Skip to content

Failed extracting text from French texts #524

@martosc

Description

@martosc

I am trying to get the text from a pdf written in French. I am having trouble with some symbols from the text (é. û, ç, etc).
I am reading the documentation of the function extractText() and I see it says "This works well for some PDF files, but poorly for others, depending on the generator used." so I don't know if the poor behaviour I am getting is because of the lack of choosing a generator.

I am using the next piece of code:

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
page = reader.getPage(2)
page.extractText()

Should I choose a generator in any part of my code or this is normal behaviour for certain pdfs?

Thank-you in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions