I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.
c:\>python --version
Python 3.11.1
c:\>pip show pyPdf2
Name: PyPDF2
Version: 3.0.1
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page:
Author:
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License:
Location: C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires:
Required-by:
I can share the PDF in email as it contains personal data (invoice). Let me know where to send it
Traceback (most recent call last):
File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 63, in <module>
em.parse_invoices()
File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 22, in parse_invoices
self.ip.parse_invoices(self.config['input_data']['invoices']['directory_path'])
File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 47, in parse_invoices
self.extract_pdf(os.path.join(directory, file))
File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 63, in extract_pdf
text = page.extract_text()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
process_rg, process_char, multiline_rg = process_cm_line(
^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
nbi = max(len(lst[0]), len(lst[1]))
~~~^^^
IndexError: list index out of range
I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.
Environment
Windows 10
Code + PDF
I can share the PDF in email as it contains personal data (invoice). Let me know where to send it
Traceback