Skip to content

Unable to read bullets  #230

@bonsonsm

Description

@bonsonsm

Hello,
I am converting a pdf file into a text file. In the extracted text file, I am not getting the bullet where ever any text starts with a bullet point.
I need to know when a bullet exists to be able to do some post processing. However, when I am getting the extracted text it is without the bullet point.

Below is my code:

def extractTextFromPDF(strDownloadDirectory, fileName, txtFilePath):
        filePathName = strDownloadDirectory + fileName
        pdfFileObj = open(filePathName, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        intPages = pdfReader.getNumPages()
        print(intPages)
        strText = ''
        print(fileName)
        fileName =fileName[0:len(fileName)-4]
        txtFilePath = txtFilePath +fileName  + '.txt'
        target_file = open(txtFilePath, "w" , encoding='utf-8')
        for i in range(0,intPages):
            objPDFObj = pdfReader.getPage(i)
            strText =  objPDFObj.extractText().rstrip()
            strText = " ".join(strText.replace(u"\xa0", " ").strip().split())
            print(strText)
        target_file.write(strText)
        target_file.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions