Improve handling of LZW decoder table overflow

Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an `IndexError`. Check if there is something we can do about this or at least report a proper *pypdf*-specific exception.

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-6.11.0-108013-tuxedo-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
from pypdf import PdfReader


pdf_file = 'a71cf4dab6840030878d668ae37a9edb10522aec.pdf'
with PdfReader(pdf_file) as reader:
    for index, page in enumerate(reader.pages, start=1):
        page.extract_text()
        list(page.images.items())

```

An example file is available [here](https://piatnik.com/uploads/media/default/0001/01/a71cf4dab6840030878d668ae37a9edb10522aec.pdf). I apparently do not own any rights on this file.

## Traceback

This is the complete traceback I see:

```
Traceback (most recent call last):
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2136, in _extract_text
    text = self.extract_xform_text(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2430, in extract_xform_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1882, in _extract_text
    content = ContentStream(content, pdf, "bytes")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1184, in __init__
    stream_data = stream.get_data()
                  ^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1111, in get_data
    decoded.set_data(decode_stream_data(self))
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 636, in decode_stream_data
    data = LZWDecode._decodeb(data, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 402, in _decodeb
    return LZWDecode.Decoder(data).decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 382, in decode
    return _LzwCodec().decode(self.data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 237, in decode
    self._add_entry_decode(self.decoding_table[old_code], string[0])
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 253, in _add_entry_decode
    self.decoding_table[self._table_index] = new_string
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: list assignment index out of range
```

Using another PDF file I cannot link here additionally uses `logger_warning` to report the following (the leading whitespace seems to be a typo in the warning message itself):

```
 impossible to decode XFormObject /Fm2
```

It is unclear what the actual issue is due to omitting the actual exception message in https://github.com/py-pdf/pypdf/blob/c6dcdc61386b9f3a30190b9f13aa6a1585b8f93d/pypdf/_page.py#L2150-L2154 Further analysis shows that is indeed this LZW issue here.


	except Exception:
	logger_warning(
	f" impossible to decode XFormObject {operands[0]}",
	__name__,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of LZW decoder table overflow #3032

Environment

Code + PDF

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve handling of LZW decoder table overflow #3032

Description

Environment

Code + PDF

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions