-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I have some existing code that appends pages from a reader into a writer, and scales the new pages.
After 08e951d this code throws an exception for specific problematic PDFs.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
macOS-14.6.1-arm64-arm-64bit-Mach-O
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.5.0, crypt_provider=('cryptography', '44.0.3'), PIL=11.3.0
# NOTE this actually fails on 6.6.0 and works fine on 6.5.0, I have isolated it to exactly the commit 08e951d that breaks it (before the 6.6.0 version bump)Code + PDF
Tragically, I cannot upload a PDF that breaks this, I have seen two so far, both contain client data and I have been unsuccessful in generating a sharable PDF that exhibits the same problem.
This is a minimal, complete example that shows the issue:
import os
from pypdf import PdfReader, PdfWriter
input_path = "TODO_SET_INPUT_PATH.pdf"
output_path = "TODO_SET_OUTPUT_PATH.pdf"
def write_to_form(input_name: str, output_name: str):
reader = PdfReader(input_name, strict=False)
writer = PdfWriter()
for i, page in enumerate(reader.pages):
print(f"DEBUG - Adding page: {i}")
# on page 2, we get: AttributeError: 'NullObject' object has no attribute 'indirect_reference' on this line
writer.append(fileobj=reader, pages=[i], import_outline=False)
# writer.add_page(reader.get_page(i)) # fails with the same end of the stack as the append call
new_page = writer.pages[i]
print(f"DEBUG - New page: {new_page}")
# in the real flow, the point of this is to fix some PDFs before another scaling operation happens that would break them (I think because they don't have a mediabox?)
# if you remove this line it works fine in 6.6.0
new_page.scale_by(1)
print(f"DEBUG - Scaled page: {new_page}")
os.makedirs(os.path.dirname(output_name), exist_ok=True)
with open(output_name, "wb") as output_stream:
writer.write(output_stream)
print(f"Wrote to {output_path}")
write_to_form(input_path, output_path)Traceback
This is the complete traceback I see (paths have been slightly modified and line numbers aren't identical for the reference code but the actual line contents are the same):
DEBUG - Adding page: 1
Traceback (most recent call last):
File "/Users/pfay/pypdf_6_0_0_breakage.py", line 41, in <module>
write_to_form(input_path, output_path)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/pypdf_6_0_0_breakage.py", line 25, in write_to_form
writer.append(fileobj=reader, pages=[i], import_outline=False)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/_writer.py", line 2587, in append
self.merge(
~~~~~~~~~~^
None,
^^^^^
...<4 lines>...
excluded_fields,
^^^^^^^^^^^^^^^^
)
^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/_writer.py", line 2668, in merge
srcpages[pg.indirect_reference.idnum] = self.add_page(
~~~~~~~~~~~~~^
pg, [*list(excluded_fields), 1, "/B", 1, "/Annots"] # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/_writer.py", line 605, in add_page
return self._add_page(page, len(self.flattened_pages), excluded_keys)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/_writer.py", line 508, in _add_page
"PageObject", page_org.clone(self, False, excluded_keys).get_object()
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/generic/_data_structures.py", line 301, in clone
d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/generic/_data_structures.py", line 412, in _clone
v.clone(pdf_dest, force_duplicate, ignore_fields)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/generic/_data_structures.py", line 140, in clone
arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pfay/.venv/lib/python3.13/site-packages/pypdf/generic/_base.py", line 370, in clone
assert dup.indirect_reference is not None, "mypy"
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NullObject' object has no attribute 'indirect_reference'
In order to determine the exact breaking change between 6.5.0 and 6.6.0 I went through the commits in 6.5.0...6.6.0 and found that I don't have any problems on 8e1ccea but when moving to 08e951d I get the exception.
Specifically it's this exact change from that PR that if I undo it I don't get an exception and it works fine:

Also I don't see the deprecation warning right above it fire.
For now I am going to downgrade to pypdf 6.5.0 to avoid the problem.
Please let me know if I can provide further information or test fixes, sorry I'm not able to include PDFs at this time.