Skip to content

PyPDF2 throws exception during extract_text() #1533

@lenemeth

Description

@lenemeth

I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.

Environment

Windows 10

c:\>python --version
Python 3.11.1

c:\>pip show pyPdf2
Name: PyPDF2
Version: 3.0.1
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page:
Author:
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License:
Location: C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires:
Required-by:

Code + PDF

from PyPDF2 import PdfReader
reader = PdfReader(filePath)

for page in reader.pages:
     text = page.extract_text()

I can share the PDF in email as it contains personal data (invoice). Let me know where to send it

Traceback

Traceback (most recent call last):
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 63, in <module>
    em.parse_invoices()
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 22, in parse_invoices
    self.ip.parse_invoices(self.config['input_data']['invoices']['directory_path'])
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 47, in parse_invoices
    self.extract_pdf(os.path.join(directory, file))
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 63, in extract_pdf
    text = page.extract_text()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
                                             ^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
                               ~~~^^^
IndexError: list index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions