-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Raise KeyError: /Parent when attempting to extract the text from an empty page using layout mode #2533
Copy link
Copy link
Closed
Labels
key-errorCould be a bug, but also a robustness issueCould be a bug, but also a robustness issueworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
pypdf will raise a KeyError: /Parent when attempting to extract the text from an empty page using layout mode. (Default mode works fine)
Environment
ubuntu 22.04 + py3.10 + pypdf4.1.0
$ python test.py
Code + PDF
from pypdf import PdfReader
reader = PdfReader("test.pdf")
pages = reader.pages
for page in pages:
text = page.extract_text()# works fine!
text = page.extract_text(extraction_mode="layout")# raise KeyErrorShare here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "/mnt/d/codes/tmp/test.py", line 6, in <module>
text = page.extract_text(extraction_mode="layout")# raise KeyError
File "/home/hello/.local/lib/python3.10/site-packages/pypdf/_page.py", line 2050, in extract_text
return self._layout_mode_text(
File "/home/hello/.local/lib/python3.10/site-packages/pypdf/_page.py", line 1948, in _layout_mode_text
fonts = self._layout_mode_fonts()
File "/home/hello/.local/lib/python3.10/site-packages/pypdf/_page.py", line 1895, in _layout_mode_fonts
objr = objr["/Parent"].get_object()
File "/home/hello/.local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 319, in __getitem__
return dict.__getitem__(self, key).get_object()
KeyError: '/Parent'
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
key-errorCould be a bug, but also a robustness issueCould be a bug, but also a robustness issueworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow