Handling of missing newlines before `endstream` marker

I just stumbled upon some odd PDF files (generated by Microsoft Word for Microsoft 365 in 2022) where there would be missing newlines before the `endstream` marker.

It seems like neither Ghostscript nor Poppler like this behavior, while pdf.js does indeed. For this reason, I am not sure whether we consider this something which we want/should fix on our side or not.

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
from pypdf import PdfReader


reader = PdfReader('file.pdf')
for page in reader.pages:
    print(page)
    for key in page.images.keys():
        print(key)
        print(page.images[key])
```

I have no public reproducer for this, but in theory I would consider this rather easy to reproduce with any crafted PDF file which uses a snippet like this:

```
Öìendstream
endobj
```

## Traceback

This is the complete traceback I see (lines might be slightly off due to debugging purposes):

```
Traceback (most recent call last):
  File "/home/stefan/tmp/run.py", line 24, in <module>
    for key in page.images.keys():
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2397, in keys
    return self.ids_function()
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 443, in _get_ids_image
    content = self._get_contents_as_bytes() or b""
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in _get_contents_as_bytes
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in <genexpr>
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_reader.py", line 1296, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1194, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 499, in read_from_stream
    data["__streamdata__"] = read_unsized_from_stream(stream, pdf)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 393, in read_unsized_from_stream
    raise PdfReadError(
pypdf.errors.PdfReadError: Unable to find 'endstream' marker for obj starting at 807.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of missing newlines before `endstream` marker #2523

Environment

Code + PDF

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling of missing newlines before endstream marker #2523

Description

Environment

Code + PDF

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Handling of missing newlines before `endstream` marker #2523