Non-Deterministic behavior in PDF library when accessing Django model in between

Question

A Django 4.2 on Python 3.10 application misbehaves sometimes, in a place that should be stateless. If I access the database (postgresql 14 via pcygopg 3.1.18) while working with pypdf 3.17.4, the output document is broken.. in roughly 1 out of 8 attempts. How can I determine the cause of the erratic behavior?

The resulting PDF has missing content, so comparing output size is sufficient to determine the bug was triggered:

# call as: python3 manage.py minrepro input.pdf
import argparse, io
from django.core.management.base import BaseCommand
from pypdf import PdfReader, PdfWriter
from djangoapp.models import DjangoModel


def main(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=200, height=200)

    for i in range(8):
        # Note: accessing the database *during* PDF merging is relevant!
        # without the next line, the problem cannot be reproduced
        for c in range(31): a = DjangoModel.objects.first()

        fin.seek(0)
        for pagein in PdfReader(fin, strict=True).pages:
            pageout.merge_page(pagein)

    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument(dest="pdf", type=argparse.FileType("rb") )
    def handle(self, *args, **options):
        for i in range(30):
            if i == 0: first_size = main(options["pdf"])
            current_size = main(options["pdf"])
            if not first_size == current_size:
                print(f"presumed stateless call was not {i=}, {first_size=} != {current_size=}")

My fist theory was pypdf failing to generate truly unique names for (post-merge) duplicate identifiers, and the DB access merely giving the opportunity (delay) for that to become noticeable. I looked for any time or random imports in the relevant libraries, no obvious bugs there. Also, nothing like time.sleep(0.02) could not make my reproducer work, only database access.
This is CPython with insert-ordered dicts, most code should behave the same on identical input. Also called python with -W module::ResourceWarning -W module::DeprecationWarning -W module::PendingDeprecationWarning to be more likely notified of any mistakes on my part, nothing.
Have reproduced the behavior in a single-threaded management command, it is therefore not related to the any http/web server issue.

Have you tried replacing DB calls with smth like time.sleep(0.002)? — STerliakov
– STerliakov, Commented Mar 19, 2024 at 11:59
You may want to share your input file: for a simple PDF of 10 pages with a couple of images and text this does not reproduce on my local system (django 4.2, py3.11, pg14, psycopg 3.1.18, pypdf 3.17.4). — STerliakov
– STerliakov, Commented Mar 19, 2024 at 12:28
Other I/O and sleep not once managed to trigger it, DB access remains the only way. This file gives me 1/8 reproduction rate. — user
– user, Commented Mar 19, 2024 at 12:51
@AbdulAzizBarkat while not a root cause, the code in question is highly django-specific, and such tag would be helpful for discovery. It's very likely to occur in another django-based environment. I rolled back your edit and added one more tag to satisfy your "database query is enough" argument. — STerliakov
– STerliakov, Commented Mar 19, 2024 at 13:22
@STerliakov it's not really the database query, but rather enough stuff happening for the garbage collector to clean up the old objects. It should be possible to reproduce this issue without Django or a database query given the nature of the problem. — Abdul Aziz Barkat
– Abdul Aziz Barkat, Commented Mar 19, 2024 at 14:01

user · Accepted Answer · 2024-03-19 12:56:09Z

2

Root cause identified: Dropping and recreating PdfReader in the loop objects has a low probability for pypdf.PdfWriter.merge to recycle into already used identifier mappings.

The trigger appears to be garbage collection in my loop, with psycopg3 indeed just providing the opportunity for that to happen, not specifically causing it.

The relevant documentation for resetting the association between source and destination PDF object has lead me to the previously missed detail that pypdf assumes, but never checks that PdfReader objects used during merging will never recycle python id().

Two objects with non-overlapping lifetimes may have the same id() value. -- Python id(object)

That has happened in my code, as easily visible by printing id(page.pdf).

Keeping a reference to the PdfReader object until the last input is merged (or simply until the output is written) also stops my reproducer from working, like this:

store_for_pypdf = []
for loop:
    ...
    reader = PdfReader(fin, strict=True)
    store_for_pypdf.append(reader)
    ... # use reader
...
pdfout.write(fout)
del store_for_pypdf

answered Mar 19, 2024 at 12:56

user

614 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Martin Thoma Over a year ago

That is interesting! Do you want to open a bug ticket for pypdf? github.com/py-pdf/pypdf/issues As you have likely already read parts of the code you might even have a proposal for a fix?

Collectives™ on Stack Overflow

Non-Deterministic behavior in PDF library when accessing Django model in between

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related