Skip to content

BUG: 'IndexError: index out of range' when using extract_text #1358

@diavral

Description

@diavral

I am a new user of PyPDF2 within 24 hours. Maybe it is my problem.
I got an error when using extract_text,a suggestion of extract_text and a mistake in document.

Environment

(PDFProcess) E:\pyProject\PDFProcess>python -m platform
Windows-10-10.0.19041-SP0(Windows家庭中文版)
(PDFProcess) E:\pyProject\PDFProcess>python -c "import PyPDF2,sys;print(PyPDF2.__version__,sys.version,sep='###')"
2.10.9###3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]

PDF-> UnicodeCharts

from PyPDF2 import PdfReader

reader = PdfReader("Unicode/CodeCharts_15.0.0.pdf")
page_0 = reader.pages[0]
page_0.extract_text() 

Bug

Location: PyPDF2/generic/_data_structures.py --> class ContentStream(DecodedStreamObject)::__init__
approximately in line 690, code if data[-1] != b"\n": will raise IndexError when data == b""
maybe should change it to if-elif statement:

if len(data) ==0:pass
elif if data[-1] != b"\n": 
    data += b"\n"

or just change to:

if len(data) ==0 or data[-1] != b"\n":
    data += b"\n"

Suggestion

Location:PyPDF2/_page.py --> class PageObject(DictionaryObject)::_extract_text --> function process_operation -->elif operator == b"Tj":
approximately in line 1514 ,not sure yet.
when I use (fixed) page_num.extract_text() ,I got a String without appropriate separator such as '\n' to break or split lines,
I try to add a newline between #fmt: on and else:return None

                # fmt: on
    text+="*LineBreak*"
else:
    return None

It works in pure text page,but have bad performance in other formats like table.
I have little knowledge about where is right place to add linebreaks.
So,I think it is necessary to add a new argument like def extract_text(sep:str=""): and then implement.

Documentation

https://pypdf2.readthedocs.io/en/latest/user/reading-pdf-annotations.html#attachments

Location:docs/user/reading-pdf-annotations.md --> Attachments
The example code has NameError,

attachments = {}
for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/FileAttachment":
                fileobj = annotobj["/FS"]
                attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()

for annotation --> subtype = annot --> fileobj = annotobj
Variables' name should be uniformed in the above example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions