Skip to content

Crash during page text extraction  #2975

@neeraj9

Description

@neeraj9

Trying to extract text from first two pages of the PDF and the error occurred. I have a sample workaround at neeraj9@75b4e42 to get past the error

Environment

OS: Windows 11 version 23H2
Python: Python 3.12

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

range_of_pages = range(first_page - 1, last_page)
for page_num in range_of_pages:
    page = pdf_reader.pages[page_num]
    page.extract_text()

Sample workaround:
neeraj9@75b4e42

PDF causing error:
9E5E080E-C8DB-4A6B-822B-9A67DC04E526-120438.pdf

Traceback

This is the complete traceback I see:

Lost the traceback. I will add if required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions