Skip to content

extract_text works for some PDF files, but not the others #437

@babak-khamsehi

Description

@babak-khamsehi

I am using Python 3.6.1 on Windows 8.1 and I want to extract certain texts from a group of PDF files. To do so, I am using this code and it works fine returning the PDF as a continuous text as string variable:

import PyPDF2
# creating a pdf file object
pdfFileObj = open('C:/Google Drive/Ward 29/data/ndvi.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
# getting the number of pages in pdf file
number_of_pages =pdfReader.getNumPages()
# creating a page object
pageObj = pdfReader.getPage(0)
page_content = pageObj.extractText()
print(page_content)
# closing the pdf file object

However, print(page_content) does return null if I use another PDF file, “55 HARRISON GARDEN.pdf” which I actually need to extract some information from:

### This code works for the ndvi file, but returns empty string for the 
harrison gdn file! I need to figure out why
import PyPDF2
# creating a pdf file object
pdfFileObj = open('C:/Google Drive/Ward 29/data/55 HARRISON GARDEN.pdf', 
'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
# getting the number of pages in pdf file
number_of_pages =pdfReader.getNumPages()
# creating a page object
pageObj = pdfReader.getPage(0)
page_content = pageObj.extractText()
print(page_content)
# closing the pdf file object
pdfFileObj.close()

Can anyone help me figure how I can fix it to read that pdf, “55 Harrison Garden.pdf” as well?

Here are the files I mentioned: https://drive.google.com/drive/folders/1wzrPsPoeqZolsd7u0NS-I44sliFhxIpu?usp=sharing

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFneeds-pdfThe issue needs a PDF file to show the problemworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions