Skip to content

ExtractText yields nothing for apparently good PDF #168

@chrisinmtown

Description

@chrisinmtown

PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
http://emma.msrb.org/EP295293-EP10300-EP632440.pdf

The file seems well-formed to me; both Acrobat and evince display it nicely. The linux utility pdftotext converts it to text and I see the expected content just fine.

Here's the relevant bit of my little script:

    with open(filename, "rb") as pdf_file:
        try:
            pdf_obj = PdfFileReader(pdf_file)
            # gather properties
            prop_en = pdf_obj.getIsEncrypted()
            err = ""
            if not prop_en:
                # Look for any text on the first N pages
                prop_img = True
                prop_pg = pdf_obj.getNumPages()
                for i in xrange(min(prop_pg, 3)):
                    pagei = pdf_obj.getPage(i)
                    pageitext = pagei.extractText()
                    # Set property and stop searching at first text found
                    if len(pageitext) > 0:
                        prop_img = False
                        break

Is there a gotcha here that I'm missing? Pls advise, thanks in advance for help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions