BUG: Ambiguous translated references by biredel · Pull Request #2558 · py-pdf/pypdf

biredel · 2024-03-29T20:33:36Z

pypdf uses id(obj) to keep track of objects in _id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.

The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/

# python3 repro.py < resources/multilang.pdf

import gc, io, sys
from pypdf import PdfReader, PdfWriter

def pdfsize(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=10, height=10)
    reader = None
    for i in range(80):
        fin.seek(0)
        del reader
        gc.collect(0)
        reader = PdfReader(fin, strict=True)
        for pagein in reader.pages:
            pageout.merge_page(pagein)
    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

if __name__ == "__main__":
    with io.BytesIO() as fin:
        fin.write(sys.stdin.buffer.read())
        for i in range(50):
            if i == 0:
                first_size = pdfsize(fin)
                first_size = pdfsize(fin)
            current_size = pdfsize(fin)
            if not first_size == current_size:
                raise RuntimeError(
                    f"generated PDF file size changed after repeating {i} times {first_size=} != {current_size=}"
                )

The attached draft replaces dict[id(obj)] with dict[weakref(obj)] - For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.

There is another instance of id(obj) outside of __repr__() methods in PageObject.hash_value_data - I have not determined whether that can cause related problems.

Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.

id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways. Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.

pubpub-zz · 2024-03-29T21:51:40Z

@biredel
can you have a review at #1788.
I'm expecting PdfReader used in merging into PdfWriters should not be deleted from memory through the PR attached.

biredel · 2024-03-29T23:03:02Z

@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug.

What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code.

I expect that (costly) workaround can be removed if the weakref fix works.

pubpub-zz · 2024-03-30T07:53:31Z

Also. Have you try first to clone the pages before calling merge_page ?

This reverts commit 8188553.

codecov · 2024-03-30T11:43:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.73%. Comparing base (e35df5a) to head (03e8b81).
⚠️ Report is 660 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2558      +/-   ##
==========================================
+ Coverage   94.71%   94.73%   +0.01%     
==========================================
  Files          50       50              
  Lines        8237     8257      +20     
  Branches     1646     1651       +5     
==========================================
+ Hits         7802     7822      +20     
  Misses        267      267              
  Partials      168      168

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

biredel · 2024-03-30T12:08:32Z

Also. Have you try first to clone the pages before calling merge_page ?

That is how I know both code paths are exposed. Have the page clone() first, then PdfObject._reference_clone with the workaround catches the problem. Call merge_page() right away, then IndirectObject.clone bites you.

pubpub-zz · 2024-03-31T18:37:39Z

@biredel
I've tried your code unsuccessfully : I met no issue.

also there is a trick in the code : _id_translated values are not only Dict[int, int]] but there is also in the dict an element with "PreventGC" that keeps a link to the reader that should prevent the GC to delete it before the writer. This being assigned during the cloning process, it is currently not called when using merge_page

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.

Can you give feedback ?

biredel · 2024-04-01T09:50:58Z

I've tried your code unsuccessfully : I met no issue.

What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading resources/multilang.pdf.

also there is a trick in the code

I saw that and recommend this is discarded once the reason for keeping that reference is resolved.

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.
Can you give feedback ?

Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output.

[1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object.

stefan6419846 · 2025-06-05T11:09:58Z

What is the state of this PR?

biredel · 2026-04-08T12:07:05Z

What is the state of this PR?

After #1841 and #3520 this PR is no longer important.
The workaround applied to all inserts fixes the bug. The remaining difference to the weakref approach is just memory usage and code quality (easier to reintroduce the bug when the data type does not imply storing what the code needs: a unique reference). I have not bothered testing a proper rebase - and will not in the foreseeable future - so closing this.

biredel added 2 commits March 30, 2024 11:34

Python 3.7 (EoL) syntax

1484610

Revert "ROB: Cope with garbage collector during cloning (py-pdf#1841)"

03e8b81

This reverts commit 8188553.

patrick91 mentioned this pull request Nov 13, 2025

BUG: Fix missing "PreventGC" when cloning #3520

Merged

stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jan 9, 2026

biredel closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Ambiguous translated references#2558

BUG: Ambiguous translated references#2558
biredel wants to merge 3 commits intopy-pdf:mainfrom
biredel:patch-id-is-evil

biredel commented Mar 29, 2024 •

edited

Loading

Uh oh!

pubpub-zz commented Mar 29, 2024

Uh oh!

biredel commented Mar 29, 2024 •

edited

Loading

Uh oh!

pubpub-zz commented Mar 30, 2024

Uh oh!

codecov bot commented Mar 30, 2024 •

edited

Loading

Uh oh!

biredel commented Mar 30, 2024

Uh oh!

pubpub-zz commented Mar 31, 2024

Uh oh!

biredel commented Apr 1, 2024

Uh oh!

stefan6419846 commented Jun 5, 2025

Uh oh!

biredel commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

biredel commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Mar 29, 2024

Uh oh!

biredel commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Mar 30, 2024

Uh oh!

codecov bot commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

biredel commented Mar 30, 2024

Uh oh!

pubpub-zz commented Mar 31, 2024

Uh oh!

biredel commented Apr 1, 2024

Uh oh!

stefan6419846 commented Jun 5, 2025

Uh oh!

biredel commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

biredel commented Mar 29, 2024 •

edited

Loading

biredel commented Mar 29, 2024 •

edited

Loading

codecov bot commented Mar 30, 2024 •

edited

Loading