-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness
Description
I am currently trying to handle some partially broken PDF files which have root objects not carrying a /Type, thus failing
Lines 210 to 211 in b7f3811
| cast(DictionaryObject, cast(PdfObject, root).get_object()).get("/Type") | |
| == "/Catalog" |
Line 226 in b7f3811
| if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog": |
Line 230 in b7f3811
| if self._validated_root is None: |
self._validated_root = root.get_object() as a fallback seems to work in this case, but probably has other side effects.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.4.0-150600.23.38-default-x86_64-with-glibc2.38
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader('file.pdf')
page = reader.pages[0]This should be easy enough to reproduce by tampering with a valid PDF file, while the original file contains confidential information. The relevant root object:
2 0 obj
<<
/Pages 3 0 R
/Metadata 4 0 R
>>
endobj
{'/Pages': IndirectObject(3, 0, 140442733989840), '/Metadata': IndirectObject(4, 0, 140442733989840)}
Traceback
This is the complete traceback I see (line numbers might be off):
WARNING:pypdf._reader:Invalid Root object in trailer
WARNING:pypdf._reader:Searching object with "/Catalog" key
WARNING:pypdf._reader:Object 44 0 found
Traceback (most recent call last):
File "/home/stefan/tmp/pypdf/run.py", line 12, in <module>
page = reader.pages[0]
~~~~~~~~~~~~^^^
File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2524, in __getitem__
len_self = len(self)
^^^^^^^^^
File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2505, in __len__
return self.length_function()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 357, in get_num_pages
self._flatten(self._readonly)
File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 1161, in _flatten
catalog = self.root_object
^^^^^^^^^^^^^^^^
File "/home/stefan/tmp/pypdf/pypdf/_reader.py", line 234, in root_object
raise PdfReadError("Cannot find Root object in pdf")
pypdf.errors.PdfReadError: Cannot find Root object in pdf
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness