I just want Covert a pdf file to a txt file , but the run failed
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
root@0cc46add0ae3:/home/learn/IR/irBooks# python -m platform
Linux-5.10.124-linuxkit-x86_64-with-glibc2.31
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
root@0cc46add0ae3:/home/learn/IR/irBooks# python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0
Code + PDF
This is a minimal, complete example that shows the issue:
import PyPDF2
print(PyPDF2.__version__)
def hanleOnePage(pdfreader, pIdx, outputFile) :
print("hanle Page:%s" % (pIdx))
pageobj=pdfreader.getPage(pIdx)
text=pageobj.extractText()
# print(text)
file1=open(outputFile,"a")
file1.writelines(text)
#create file object variable
#opening method will be rb
# pdffileobj=open('01bool.pdf','rb')
pdffileobj=open('02voc.pdf','rb')
#create reader variable that will read the pdffileobj
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
#This will store the number of pages of this pdf file
x=pdfreader.numPages
print("PDF:numPages:%s" % (x))
#create a variable that will select the selected number of pages
pIndex = 0
while pIndex < x:
hanleOnePage(pdfreader, pIndex, "all.txt")
pIndex += 1
pdffileobj.close()
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
02voc.pdf
Traceback
This is the complete Traceback I see:
root@0cc46add0ae3:/home/learn/IR/irBooks# python tokens-step1.py
2.11.0
PDF:numPages:29
hanle Page:0
hanle Page:1
hanle Page:2
Traceback (most recent call last):
File "/home/learn/IR/irBooks/tokens-step1.py", line 39, in
hanleOnePage(pdfreader, pIndex, "3.txt")
File "/home/learn/IR/irBooks/tokens-step1.py", line 12, in hanleOnePage
text=pageobj.extractText()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1865, in extractText
return self.extract_text()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1818, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1323, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 27, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 193, in parse_to_unicode
cm = prepare_cm(ft)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 210, in prepare_cm
.replace(b"beginbfchar", b"\nbeginbfchar\n")
TypeError: replace() argument 1 must be str, not bytes
I just want Covert a pdf file to a txt file , but the run failed
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
02voc.pdf
Traceback
This is the complete Traceback I see:
root@0cc46add0ae3:/home/learn/IR/irBooks# python tokens-step1.py
2.11.0
PDF:numPages:29
hanle Page:0
hanle Page:1
hanle Page:2
Traceback (most recent call last):
File "/home/learn/IR/irBooks/tokens-step1.py", line 39, in
hanleOnePage(pdfreader, pIndex, "3.txt")
File "/home/learn/IR/irBooks/tokens-step1.py", line 12, in hanleOnePage
text=pageobj.extractText()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1865, in extractText
return self.extract_text()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1818, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1323, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 27, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 193, in parse_to_unicode
cm = prepare_cm(ft)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 210, in prepare_cm
.replace(b"beginbfchar", b"\nbeginbfchar\n")
TypeError: replace() argument 1 must be str, not bytes