Skip to content

Improve handling of LZW decoder table overflow #3032

@stefan6419846

Description

@stefan6419846

Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an IndexError. Check if there is something we can do about this or at least report a proper pypdf-specific exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.11.0-108013-tuxedo-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


pdf_file = 'a71cf4dab6840030878d668ae37a9edb10522aec.pdf'
with PdfReader(pdf_file) as reader:
    for index, page in enumerate(reader.pages, start=1):
        page.extract_text()
        list(page.images.items())

An example file is available here. I apparently do not own any rights on this file.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2136, in _extract_text
    text = self.extract_xform_text(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2430, in extract_xform_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1882, in _extract_text
    content = ContentStream(content, pdf, "bytes")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1184, in __init__
    stream_data = stream.get_data()
                  ^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1111, in get_data
    decoded.set_data(decode_stream_data(self))
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 636, in decode_stream_data
    data = LZWDecode._decodeb(data, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 402, in _decodeb
    return LZWDecode.Decoder(data).decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 382, in decode
    return _LzwCodec().decode(self.data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 237, in decode
    self._add_entry_decode(self.decoding_table[old_code], string[0])
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 253, in _add_entry_decode
    self.decoding_table[self._table_index] = new_string
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: list assignment index out of range

Using another PDF file I cannot link here additionally uses logger_warning to report the following (the leading whitespace seems to be a typo in the warning message itself):

 impossible to decode XFormObject /Fm2

It is unclear what the actual issue is due to omitting the actual exception message in

pypdf/pypdf/_page.py

Lines 2150 to 2154 in c6dcdc6

except Exception:
logger_warning(
f" impossible to decode XFormObject {operands[0]}",
__name__,
)
Further analysis shows that is indeed this LZW issue here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions