Skip to content

Linearized PDF with degraded trailer gives "KeyError: '/Root'"  #989

@MartinThoma

Description

@MartinThoma

When trying to extract the text from a PDF, I get an exception.

Environment

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.2.0

MCVE: PDF + Code

Using this PDF: https://corpora.tika.apache.org/base/docs/govdocs1/989/989691.pdf

from PyPDF2 import PdfReader
reader = PdfReader("pdf/989691.pdf")  # PdfReadWarning: incorrect startxref pointer(1)
reader.pages[0].extract_text()

I get this traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1462, in __getitem__
    len_self = len(self)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1453, in __len__
    return self.length_function()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 362, in _get_num_pages
    self._flatten()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 929, in _flatten
    catalog = self.trailer[TK.ROOT].get_object()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 623, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/Root'

Metadata

Metadata

Assignees

Labels

Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-robustness-issueFrom a users perspective, this is about robustness

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions