DictionaryObject.read_from_stream contains this code:
if length is None: # if the PDF is damaged
length = -1
pstart = stream.tell()
if length > 0:
data["__streamdata__"] = stream.read(length)
else:
data["__streamdata__"] = read_until_regex(
stream, re.compile(b"endstream")
)
Since read_until_regex doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n" or b"\r\n" instead of b"".
I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500 that contain JBIG2-encoded pages with /JBIG2Globals pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals stream is invalid, and some (not all) PDF viewers fail to render the pages.
Suggested fix:
- If there exist broken PDFs in the wild with
/Length 0 followed by a stream of nonzero length that pypdf needs to support, check for stream\r?\n\r?\n?endstream as a special case first before falling back to read_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly.
- Or, if there are no such PDFs, and
length > 0 was just meant to catch the -1 case, change the test to length >= 0.
- In the
read_until_regex case, if endstream is preceded by \r then strip it, or if it's preceded by \r\n then strip the \n, and strip the \r also iff stream was followed by \r. That isn't guaranteed to work, but it's probably the best one can do.
DictionaryObject.read_from_streamcontains this code:Since
read_until_regexdoesn't strip the trailing newline, this will read almost all length-0 streams asb"\n"orb"\r\n"instead ofb"".I have some PDFs with creator
PFU ScanSnap Manager 5.1.30 #S1500that contain JBIG2-encoded pages with/JBIG2Globalspointing to an empty stream object. After loading and saving them with pypdf, the/JBIG2Globalsstream is invalid, and some (not all) PDF viewers fail to render the pages.Suggested fix:
/Length 0followed by a stream of nonzero length that pypdf needs to support, check forstream\r?\n\r?\n?endstreamas a special case first before falling back toread_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly.length > 0was just meant to catch the-1case, change the test tolength >= 0.read_until_regexcase, ifendstreamis preceded by\rthen strip it, or if it's preceded by\r\nthen strip the\n, and strip the\ralso iffstreamwas followed by\r. That isn't guaranteed to work, but it's probably the best one can do.