Skip to content

Insufficient handling of inline images containing EI sequences #3107

@stefan6419846

Description

@stefan6419846

pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence EI . This breaks text extraction as well.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.33-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
reader.pages[1].extract_text()

I currently do not have a file which would not contain personal data.

Excerpt of the relevant section (... marks redacted content):

...
BI
/IM true
/W 41
/H 41
/BPC 1
/D[1
0]
/F/CCF
/DP<</K -1
/Columns 41>>
ID >...EI E...
EI Q
q
...

Traceback

This is the complete traceback I see (... marks redacted content):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2073, in _extract_text
    for operands, operator in content.operations:
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1423, in operations
    self._parse_content_stream(BytesIO(self._data))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1496, in read_object
    raise PdfReadError(
pypdf.errors.PdfReadError: Invalid Elementary Object starting with b'\x0b' @1495: b'I E\x0e\x1e\x8a\...\xe0\xc7\x0b$;...'

Metadata

Metadata

Assignees

No one assigned

    Labels

    genericThe generic submodule is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions