2

A Django 4.2 on Python 3.10 application misbehaves sometimes, in a place that should be stateless. If I access the database (postgresql 14 via pcygopg 3.1.18) while working with pypdf 3.17.4, the output document is broken.. in roughly 1 out of 8 attempts. How can I determine the cause of the erratic behavior?

The resulting PDF has missing content, so comparing output size is sufficient to determine the bug was triggered:

# call as: python3 manage.py minrepro input.pdf
import argparse, io
from django.core.management.base import BaseCommand
from pypdf import PdfReader, PdfWriter
from djangoapp.models import DjangoModel


def main(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=200, height=200)

    for i in range(8):
        # Note: accessing the database *during* PDF merging is relevant!
        # without the next line, the problem cannot be reproduced
        for c in range(31): a = DjangoModel.objects.first()

        fin.seek(0)
        for pagein in PdfReader(fin, strict=True).pages:
            pageout.merge_page(pagein)

    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument(dest="pdf", type=argparse.FileType("rb") )
    def handle(self, *args, **options):
        for i in range(30):
            if i == 0: first_size = main(options["pdf"])
            current_size = main(options["pdf"])
            if not first_size == current_size:
                print(f"presumed stateless call was not {i=}, {first_size=} != {current_size=}")

  • My fist theory was pypdf failing to generate truly unique names for (post-merge) duplicate identifiers, and the DB access merely giving the opportunity (delay) for that to become noticeable. I looked for any time or random imports in the relevant libraries, no obvious bugs there. Also, nothing like time.sleep(0.02) could not make my reproducer work, only database access.
  • This is CPython with insert-ordered dicts, most code should behave the same on identical input. Also called python with -W module::ResourceWarning -W module::DeprecationWarning -W module::PendingDeprecationWarning to be more likely notified of any mistakes on my part, nothing.
  • Have reproduced the behavior in a single-threaded management command, it is therefore not related to the any http/web server issue.
7
  • 2
    Have you tried replacing DB calls with smth like time.sleep(0.002)? Commented Mar 19, 2024 at 11:59
  • 1
    You may want to share your input file: for a simple PDF of 10 pages with a couple of images and text this does not reproduce on my local system (django 4.2, py3.11, pg14, psycopg 3.1.18, pypdf 3.17.4). Commented Mar 19, 2024 at 12:28
  • Other I/O and sleep not once managed to trigger it, DB access remains the only way. This file gives me 1/8 reproduction rate. Commented Mar 19, 2024 at 12:51
  • 1
    @AbdulAzizBarkat while not a root cause, the code in question is highly django-specific, and such tag would be helpful for discovery. It's very likely to occur in another django-based environment. I rolled back your edit and added one more tag to satisfy your "database query is enough" argument. Commented Mar 19, 2024 at 13:22
  • 1
    @STerliakov it's not really the database query, but rather enough stuff happening for the garbage collector to clean up the old objects. It should be possible to reproduce this issue without Django or a database query given the nature of the problem. Commented Mar 19, 2024 at 14:01

1 Answer 1

2

Root cause identified: Dropping and recreating PdfReader in the loop objects has a low probability for pypdf.PdfWriter.merge to recycle into already used identifier mappings.

The trigger appears to be garbage collection in my loop, with psycopg3 indeed just providing the opportunity for that to happen, not specifically causing it.

The relevant documentation for resetting the association between source and destination PDF object has lead me to the previously missed detail that pypdf assumes, but never checks that PdfReader objects used during merging will never recycle python id().

Two objects with non-overlapping lifetimes may have the same id() value. -- Python id(object)

That has happened in my code, as easily visible by printing id(page.pdf).

Keeping a reference to the PdfReader object until the last input is merged (or simply until the output is written) also stops my reproducer from working, like this:

store_for_pypdf = []
for loop:
    ...
    reader = PdfReader(fin, strict=True)
    store_for_pypdf.append(reader)
    ... # use reader
...
pdfout.write(fout)
del store_for_pypdf
Sign up to request clarification or add additional context in comments.

1 Comment

That is interesting! Do you want to open a bug ticket for pypdf? github.com/py-pdf/pypdf/issues As you have likely already read parts of the code you might even have a proposal for a fix?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.