-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an IndexError. Check if there is something we can do about this or at least report a proper pypdf-specific exception.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.11.0-108013-tuxedo-x86_64-with-glibc2.39
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
pdf_file = 'a71cf4dab6840030878d668ae37a9edb10522aec.pdf'
with PdfReader(pdf_file) as reader:
for index, page in enumerate(reader.pages, start=1):
page.extract_text()
list(page.images.items())An example file is available here. I apparently do not own any rights on this file.
Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2136, in _extract_text
text = self.extract_xform_text(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2430, in extract_xform_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1882, in _extract_text
content = ContentStream(content, pdf, "bytes")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1184, in __init__
stream_data = stream.get_data()
^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1111, in get_data
decoded.set_data(decode_stream_data(self))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 636, in decode_stream_data
data = LZWDecode._decodeb(data, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 402, in _decodeb
return LZWDecode.Decoder(data).decode()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 382, in decode
return _LzwCodec().decode(self.data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 237, in decode
self._add_entry_decode(self.decoding_table[old_code], string[0])
File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 253, in _add_entry_decode
self.decoding_table[self._table_index] = new_string
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: list assignment index out of range
Using another PDF file I cannot link here additionally uses logger_warning to report the following (the leading whitespace seems to be a typo in the warning message itself):
impossible to decode XFormObject /Fm2
It is unclear what the actual issue is due to omitting the actual exception message in
Lines 2150 to 2154 in c6dcdc6
| except Exception: | |
| logger_warning( | |
| f" impossible to decode XFormObject {operands[0]}", | |
| __name__, | |
| ) |