Skip to content

index out of bounds in pypdf._text_extraction.handle_tj #2320

@rgwood-rely

Description

@rgwood-rely

On decoding a pdf in the second line:

if orientation in orientations:
    if isinstance(operands[0], str):

len(operands) == 0 and it raises an ex.

Should change it to:

if orientation in orientations and len(operands) > 0:

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.1.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('cryptography', '3.3.2'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

# sorry; PDF is confidential

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:

<our software>
    page_text = page_obj.extract_text()
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2279, in extract_text
    return self._extract_text(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2115, in _extract_text
    process_operation(b"Tj", operands)
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2075, in process_operation
    text, rtl_dir = handle_tj(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_text_extraction/__init__.py", line 220, in handle_tj
    if isinstance(operands[0], str):
IndexError: list index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustness

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions