Skip to content

BUG: Ambiguous translated references#2558

Closed
biredel wants to merge 3 commits intopy-pdf:mainfrom
biredel:patch-id-is-evil
Closed

BUG: Ambiguous translated references#2558
biredel wants to merge 3 commits intopy-pdf:mainfrom
biredel:patch-id-is-evil

Conversation

@biredel
Copy link
Copy Markdown

@biredel biredel commented Mar 29, 2024

pypdf uses id(obj) to keep track of objects in _id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.

The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/

# python3 repro.py < resources/multilang.pdf

import gc, io, sys
from pypdf import PdfReader, PdfWriter

def pdfsize(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=10, height=10)
    reader = None
    for i in range(80):
        fin.seek(0)
        del reader
        gc.collect(0)
        reader = PdfReader(fin, strict=True)
        for pagein in reader.pages:
            pageout.merge_page(pagein)
    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

if __name__ == "__main__":
    with io.BytesIO() as fin:
        fin.write(sys.stdin.buffer.read())
        for i in range(50):
            if i == 0:
                first_size = pdfsize(fin)
                first_size = pdfsize(fin)
            current_size = pdfsize(fin)
            if not first_size == current_size:
                raise RuntimeError(
                    f"generated PDF file size changed after repeating {i} times {first_size=} != {current_size=}"
                )

The attached draft replaces dict[id(obj)] with dict[weakref(obj)] - For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.

There is another instance of id(obj) outside of __repr__() methods in PageObject.hash_value_data - I have not determined whether that can cause related problems.

Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.

id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways.

Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.
@pubpub-zz
Copy link
Copy Markdown
Collaborator

@biredel
can you have a review at #1788.
I'm expecting PdfReader used in merging into PdfWriters should not be deleted from memory through the PR attached.

@biredel
Copy link
Copy Markdown
Author

biredel commented Mar 29, 2024

@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug.

What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code.

I expect that (costly) workaround can be removed if the weakref fix works.

@pubpub-zz
Copy link
Copy Markdown
Collaborator

Also. Have you try first to clone the pages before calling merge_page ?

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2024

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.73%. Comparing base (e35df5a) to head (03e8b81).
⚠️ Report is 660 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2558      +/-   ##
==========================================
+ Coverage   94.71%   94.73%   +0.01%     
==========================================
  Files          50       50              
  Lines        8237     8257      +20     
  Branches     1646     1651       +5     
==========================================
+ Hits         7802     7822      +20     
  Misses        267      267              
  Partials      168      168              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@biredel
Copy link
Copy Markdown
Author

biredel commented Mar 30, 2024

Also. Have you try first to clone the pages before calling merge_page ?

That is how I know both code paths are exposed. Have the page clone() first, then PdfObject._reference_clone with the workaround catches the problem. Call merge_page() right away, then IndirectObject.clone bites you.

@pubpub-zz
Copy link
Copy Markdown
Collaborator

@biredel
I've tried your code unsuccessfully : I met no issue.

also there is a trick in the code : _id_translated values are not only Dict[int, int]] but there is also in the dict an element with "PreventGC" that keeps a link to the reader that should prevent the GC to delete it before the writer. This being assigned during the cloning process, it is currently not called when using merge_page

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.

Can you give feedback ?

@biredel
Copy link
Copy Markdown
Author

biredel commented Apr 1, 2024

I've tried your code unsuccessfully : I met no issue.

What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading resources/multilang.pdf.

also there is a trick in the code

I saw that and recommend this is discarded once the reason for keeping that reference is resolved.

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.
Can you give feedback ?

Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output.

[1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object.

@stefan6419846
Copy link
Copy Markdown
Collaborator

What is the state of this PR?

@stefan6419846 stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jan 9, 2026
@biredel
Copy link
Copy Markdown
Author

biredel commented Apr 8, 2026

What is the state of this PR?

After #1841 and #3520 this PR is no longer important.
The workaround applied to all inserts fixes the bug. The remaining difference to the weakref approach is just memory usage and code quality (easier to reintroduce the bug when the data type does not imply storing what the code needs: a unique reference). I have not bothered testing a proper rebase - and will not in the foreseeable future - so closing this.

@biredel biredel closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-discussion The PR/issue needs more discussion before we can continue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants