-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Description
pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence EI . This breaks text extraction as well.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.4.0-150600.23.33-default-x86_64-with-glibc2.38
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader('file.pdf')
reader.pages[1].extract_text()I currently do not have a file which would not contain personal data.
Excerpt of the relevant section (... marks redacted content):
...
BI
/IM true
/W 41
/H 41
/BPC 1
/D[1
0]
/F/CCF
/DP<</K -1
/Columns 41>>
ID >...EI E...
EI Q
q
...
Traceback
This is the complete traceback I see (... marks redacted content):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2378, in extract_text
return self._extract_text(
File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2073, in _extract_text
for operands, operator in content.operations:
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1423, in operations
self._parse_content_stream(BytesIO(self._data))
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1496, in read_object
raise PdfReadError(
pypdf.errors.PdfReadError: Invalid Elementary Object starting with b'\x0b' @1495: b'I E\x0e\x1e\x8a\...\xe0\xc7\x0b$;...'
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF