Skip to content

BUG: Non-deterministic accidental object reuse#1995

Closed
sjoerdjob wants to merge 3 commits intopy-pdf:mainfrom
sjoerdjob:issue-1788
Closed

BUG: Non-deterministic accidental object reuse#1995
sjoerdjob wants to merge 3 commits intopy-pdf:mainfrom
sjoerdjob:issue-1788

Conversation

@sjoerdjob
Copy link
Copy Markdown
Contributor

@sjoerdjob sjoerdjob commented Jul 21, 2023

The following section of code could sometimes have unintended effects where pages of an earlier integrated PDF file were used instead.

writer = PdfWriter()

reader1 = PdfReader(some_file)
id_reader1 = id(reader1)
writer.add_page(reader1.pages[0])
del reader1

reader2 = PdfReader(other_file)
id_reader2 = id(reader2)
writer.add_page(reader2.pages[0])
del reader2

writer.write(third_file)

because the reader1 is no longer in memory when reader2 gets initialized, the area in memory is free, so id_reader1 and id_reader2 might end up having the same value.

Due to PyPDF using id(reader) internally for an object-cache, it sometimes happened that writer.add_page(reader2.pages[0]) would result in duplicating reader1.pages[0] instead.

fixes #1788 .

The following section of code could sometimes have unintended effects
where pages of an earlier integrated PDF file were used instead.

```
writer = PdfWriter()

reader1 = PdfReader(some_file)
id_reader1 = id(reader1)
writer.add_page(reader1.pages[0])
del reader1

reader2 = PdfReader(other_file)
id_reader2 = id(reader2)
writer.add_page(reader2.pages[0])
del reader2

writer.write(third_file)
```

because the `reader1` is no longer in memory when `reader2` gets
initialized, the area in memory is free, so `id_reader1` and
`id_reader2` might end up having the same value.

Due to PyPDF using `id(reader)` internally for an object-cache, it
sometimes happened that `writer.add_page(reader2.pages[0])` would result
in duplicating `reader1.pages[0]` instead.
Using a WeakKeyDictionary was an implementation detail that does not
have to be matched in the type.
@MartinThoma
Copy link
Copy Markdown
Member

@sjoerdjob Your PR cannot be merged like this as the CI fails:

pypdf/_protocols.py:68: error: Name "WeakKeyDictionary" is not defined

There are now also a couple of merge conflicts.

@pubpub-zz
Copy link
Copy Markdown
Collaborator

@MartinThoma this looks similar (different approach) to a PR you've already about random error. not sure this is still required

@MartinThoma
Copy link
Copy Markdown
Member

#1788 was fixed via #1841. test_merging_many_temporary_files succeeds as well.

MartinThoma added a commit that referenced this pull request Oct 8, 2023
Full credit to sjoerdjob for this contribution via #1995

See #1788

Co-authored-by: Sjoerd Job Postmus <sjoerdjob@sjec.nl>
MartinThoma added a commit that referenced this pull request Oct 8, 2023
#2244)

Full credit to sjoerdjob for this contribution via #1995

See #1788

Co-authored-by: Sjoerd Job Postmus <sjoerdjob@sjec.nl>
@MartinThoma
Copy link
Copy Markdown
Member

We didn't have a test that could detect this type of issue, hence I added the test from this PR via #2244

@sjoerdjob Sorry that it took so long. I'm closing this PR now as the problem was fixed via #1841 and your test was added via #2244

@MartinThoma MartinThoma closed this Oct 8, 2023
@MartinThoma
Copy link
Copy Markdown
Member

Thank you for your support 🤗 If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random bug with add_page

3 participants