-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Handling of missing newlines before endstream marker #2523
Copy link
Copy link
Closed
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness
Description
I just stumbled upon some odd PDF files (generated by Microsoft Word for Microsoft 365 in 2022) where there would be missing newlines before the endstream marker.
It seems like neither Ghostscript nor Poppler like this behavior, while pdf.js does indeed. For this reason, I am not sure whether we consider this something which we want/should fix on our side or not.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader('file.pdf')
for page in reader.pages:
print(page)
for key in page.images.keys():
print(key)
print(page.images[key])I have no public reproducer for this, but in theory I would consider this rather easy to reproduce with any crafted PDF file which uses a snippet like this:
Öìendstream
endobj
Traceback
This is the complete traceback I see (lines might be slightly off due to debugging purposes):
Traceback (most recent call last):
File "/home/stefan/tmp/run.py", line 24, in <module>
for key in page.images.keys():
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2397, in keys
return self.ids_function()
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 443, in _get_ids_image
content = self._get_contents_as_bytes() or b""
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in _get_contents_as_bytes
return b"".join(x.get_object().get_data() for x in obj)
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in <genexpr>
return b"".join(x.get_object().get_data() for x in obj)
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 284, in get_object
return self.pdf.get_object(self)
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_reader.py", line 1296, in get_object
retval = read_object(self.stream, self) # type: ignore
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1194, in read_object
return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 499, in read_from_stream
data["__streamdata__"] = read_unsized_from_stream(stream, pdf)
File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 393, in read_unsized_from_stream
raise PdfReadError(
pypdf.errors.PdfReadError: Unable to find 'endstream' marker for obj starting at 807.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness