-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFneeds-pdfThe issue needs a PDF file to show the problemThe issue needs a PDF file to show the problemworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I am using Python 3.6.1 on Windows 8.1 and I want to extract certain texts from a group of PDF files. To do so, I am using this code and it works fine returning the PDF as a continuous text as string variable:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('C:/Google Drive/Ward 29/data/ndvi.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
# getting the number of pages in pdf file
number_of_pages =pdfReader.getNumPages()
# creating a page object
pageObj = pdfReader.getPage(0)
page_content = pageObj.extractText()
print(page_content)
# closing the pdf file objectHowever, print(page_content) does return null if I use another PDF file, “55 HARRISON GARDEN.pdf” which I actually need to extract some information from:
### This code works for the ndvi file, but returns empty string for the
harrison gdn file! I need to figure out why
import PyPDF2
# creating a pdf file object
pdfFileObj = open('C:/Google Drive/Ward 29/data/55 HARRISON GARDEN.pdf',
'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
# getting the number of pages in pdf file
number_of_pages =pdfReader.getNumPages()
# creating a page object
pageObj = pdfReader.getPage(0)
page_content = pageObj.extractText()
print(page_content)
# closing the pdf file object
pdfFileObj.close()Can anyone help me figure how I can fix it to read that pdf, “55 Harrison Garden.pdf” as well?
Here are the files I mentioned: https://drive.google.com/drive/folders/1wzrPsPoeqZolsd7u0NS-I44sliFhxIpu?usp=sharing
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFneeds-pdfThe issue needs a PDF file to show the problemThe issue needs a PDF file to show the problemworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow