Skip to content

Handling of root objects without a Type #3164

@stefan6419846

Description

@stefan6419846

I am currently trying to handle some partially broken PDF files which have root objects not carrying a /Type, thus failing

pypdf/pypdf/_reader.py

Lines 210 to 211 in b7f3811

cast(DictionaryObject, cast(PdfObject, root).get_object()).get("/Type")
== "/Catalog"
and finally
if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":
and running into
if self._validated_root is None:
Just doing self._validated_root = root.get_object() as a fallback seems to work in this case, but probably has other side effects.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.38-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]

This should be easy enough to reproduce by tampering with a valid PDF file, while the original file contains confidential information. The relevant root object:

2 0 obj
<<
/Pages 3 0 R
/Metadata 4 0 R
>>
endobj
{'/Pages': IndirectObject(3, 0, 140442733989840), '/Metadata': IndirectObject(4, 0, 140442733989840)}

Traceback

This is the complete traceback I see (line numbers might be off):

WARNING:pypdf._reader:Invalid Root object in trailer
WARNING:pypdf._reader:Searching object with "/Catalog" key
WARNING:pypdf._reader:Object 44 0 found
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 12, in <module>
    page = reader.pages[0]
           ~~~~~~~~~~~~^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2524, in __getitem__
    len_self = len(self)
               ^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2505, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 357, in get_num_pages
    self._flatten(self._readonly)
  File "/home/stefan/tmp/pypdf/pypdf/_doc_common.py", line 1161, in _flatten
    catalog = self.root_object
              ^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/pypdf/pypdf/_reader.py", line 234, in root_object
    raise PdfReadError("Cannot find Root object in pdf")
pypdf.errors.PdfReadError: Cannot find Root object in pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-robustness-issueFrom a users perspective, this is about robustness

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions