-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
is-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Text extraction from a specific PDF fails
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.8.0-51-generic-x86_64-with-glibc2.39
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=noneCode + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("test-anon.pdf")
for page in reader.pages:
print(page.extract_text(extraction_mode="layout") + "\n")Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "/home/mark/dev/pdf-test/test.py", line 10, in <module>
print(page.extract_text(extraction_mode="layout") + "\n")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 2361, in extract_text
return self._layout_mode_text(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 2266, in _layout_mode_text
return _layout_mode.fixed_width_page(ty_groups, char_width, space_vertically, font_height_weight)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 367, in fixed_width_page
int(abs(y_coord - last_y_coord) / (line_data[0]["font_height"] * font_height_weight)) - 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
is-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow