ExtractText yields nothing for apparently good PDF

PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
   http://emma.msrb.org/EP295293-EP10300-EP632440.pdf

The file seems well-formed to me; both Acrobat and evince display it nicely.  The linux utility pdftotext converts it to text and I see the expected content just fine.

Here's the relevant bit of my little script:

``` python
    with open(filename, "rb") as pdf_file:
        try:
            pdf_obj = PdfFileReader(pdf_file)
            # gather properties
            prop_en = pdf_obj.getIsEncrypted()
            err = ""
            if not prop_en:
                # Look for any text on the first N pages
                prop_img = True
                prop_pg = pdf_obj.getNumPages()
                for i in xrange(min(prop_pg, 3)):
                    pagei = pdf_obj.getPage(i)
                    pageitext = pagei.extractText()
                    # Set property and stop searching at first text found
                    if len(pageitext) > 0:
                        prop_img = False
                        break
```

Is there a gotcha here that I'm missing?  Pls advise, thanks in advance for help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractText yields nothing for apparently good PDF #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ExtractText yields nothing for apparently good PDF #168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions