Skip to content

Handling of missing newlines before endstream marker #2523

@stefan6419846

Description

@stefan6419846

I just stumbled upon some odd PDF files (generated by Microsoft Word for Microsoft 365 in 2022) where there would be missing newlines before the endstream marker.

It seems like neither Ghostscript nor Poppler like this behavior, while pdf.js does indeed. For this reason, I am not sure whether we consider this something which we want/should fix on our side or not.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


reader = PdfReader('file.pdf')
for page in reader.pages:
    print(page)
    for key in page.images.keys():
        print(key)
        print(page.images[key])

I have no public reproducer for this, but in theory I would consider this rather easy to reproduce with any crafted PDF file which uses a snippet like this:

Öìendstream
endobj

Traceback

This is the complete traceback I see (lines might be slightly off due to debugging purposes):

Traceback (most recent call last):
  File "/home/stefan/tmp/run.py", line 24, in <module>
    for key in page.images.keys():
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2397, in keys
    return self.ids_function()
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 443, in _get_ids_image
    content = self._get_contents_as_bytes() or b""
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in _get_contents_as_bytes
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in <genexpr>
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_reader.py", line 1296, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1194, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 499, in read_from_stream
    data["__streamdata__"] = read_unsized_from_stream(stream, pdf)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 393, in read_unsized_from_stream
    raise PdfReadError(
pypdf.errors.PdfReadError: Unable to find 'endstream' marker for obj starting at 807.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustness

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions