-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I am a new user of PyPDF2 within 24 hours. Maybe it is my problem.
I got an error when using extract_text,a suggestion of extract_text and a mistake in document.
Environment
(PDFProcess) E:\pyProject\PDFProcess>python -m platform
Windows-10-10.0.19041-SP0(Windows家庭中文版)
(PDFProcess) E:\pyProject\PDFProcess>python -c "import PyPDF2,sys;print(PyPDF2.__version__,sys.version,sep='###')"
2.10.9###3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]
PDF-> UnicodeCharts
from PyPDF2 import PdfReader
reader = PdfReader("Unicode/CodeCharts_15.0.0.pdf")
page_0 = reader.pages[0]
page_0.extract_text() Bug
Location: PyPDF2/generic/_data_structures.py --> class ContentStream(DecodedStreamObject)::__init__
approximately in line 690, code if data[-1] != b"\n": will raise IndexError when data == b""
maybe should change it to if-elif statement:
if len(data) ==0:pass
elif if data[-1] != b"\n":
data += b"\n"
or just change to:
if len(data) ==0 or data[-1] != b"\n":
data += b"\n"
Suggestion
Location:PyPDF2/_page.py --> class PageObject(DictionaryObject)::_extract_text --> function process_operation -->elif operator == b"Tj":
approximately in line 1514 ,not sure yet.
when I use (fixed) page_num.extract_text() ,I got a String without appropriate separator such as '\n' to break or split lines,
I try to add a newline between #fmt: on and else:return None
# fmt: on
text+="*LineBreak*"
else:
return None
It works in pure text page,but have bad performance in other formats like table.
I have little knowledge about where is right place to add linebreaks.
So,I think it is necessary to add a new argument like def extract_text(sep:str=""): and then implement.
Documentation
https://pypdf2.readthedocs.io/en/latest/user/reading-pdf-annotations.html#attachments
Location:docs/user/reading-pdf-annotations.md --> Attachments
The example code has NameError,
attachments = {}
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
if subtype == "/FileAttachment":
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()for annotation --> subtype = annot --> fileobj = annotobj
Variables' name should be uniformed in the above example.