-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
http://emma.msrb.org/EP295293-EP10300-EP632440.pdf
The file seems well-formed to me; both Acrobat and evince display it nicely. The linux utility pdftotext converts it to text and I see the expected content just fine.
Here's the relevant bit of my little script:
with open(filename, "rb") as pdf_file:
try:
pdf_obj = PdfFileReader(pdf_file)
# gather properties
prop_en = pdf_obj.getIsEncrypted()
err = ""
if not prop_en:
# Look for any text on the first N pages
prop_img = True
prop_pg = pdf_obj.getNumPages()
for i in xrange(min(prop_pg, 3)):
pagei = pdf_obj.getPage(i)
pageitext = pagei.extractText()
# Set property and stop searching at first text found
if len(pageitext) > 0:
prop_img = False
breakIs there a gotcha here that I'm missing? Pls advise, thanks in advance for help.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow