Skip to content

'IndexError: list index out of range' when extracting text #1091

@MartinThoma

Description

@MartinThoma

I've got an IndexError when extracting text. The file opens fine in Chrome.

Environment

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The file: pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf')
/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py:1229: PdfReadWarning: incorrect startxref pointer(1)
  warnings.warn(
>>> for page in reader.pages: print(page.extract_text())
[...]
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1507, in extract_text
    return self._extract_text(
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1441, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in process_operation
    float(operands[5]),
IndexError: list index out of range

It's print(reader.pages[10].extract_text()) to be exact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    MCVE in TestsThe MCVE was added to PyPDF2 test suiteis-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions