Skip to content

extract_text fails with ZeroDivisionError: float division by zero #3074

@blushingpenguin

Description

@blushingpenguin

Text extraction from a specific PDF fails

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-51-generic-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("test-anon.pdf")
for page in reader.pages:
    print(page.extract_text(extraction_mode="layout") + "\n")

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/home/mark/dev/pdf-test/test.py", line 10, in <module>
    print(page.extract_text(extraction_mode="layout") + "\n")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 2361, in extract_text
    return self._layout_mode_text(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 2266, in _layout_mode_text
    return _layout_mode.fixed_width_page(ty_groups, char_width, space_vertically, font_height_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/pdf-test/.venv/lib/python3.12/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 367, in fixed_width_page
    int(abs(y_coord - last_y_coord) / (line_data[0]["font_height"] * font_height_weight)) - 1
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions