-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Hello,
I am converting a pdf file into a text file. In the extracted text file, I am not getting the bullet where ever any text starts with a bullet point.
I need to know when a bullet exists to be able to do some post processing. However, when I am getting the extracted text it is without the bullet point.
Below is my code:
def extractTextFromPDF(strDownloadDirectory, fileName, txtFilePath):
filePathName = strDownloadDirectory + fileName
pdfFileObj = open(filePathName, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
intPages = pdfReader.getNumPages()
print(intPages)
strText = ''
print(fileName)
fileName =fileName[0:len(fileName)-4]
txtFilePath = txtFilePath +fileName + '.txt'
target_file = open(txtFilePath, "w" , encoding='utf-8')
for i in range(0,intPages):
objPDFObj = pdfReader.getPage(i)
strText = objPDFObj.extractText().rstrip()
strText = " ".join(strText.replace(u"\xa0", " ").strip().split())
print(strText)
target_file.write(strText)
target_file.close()Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow