Skip to content

BUG: Use remove_orphans in compress_identical_objects#3310

Merged
stefan6419846 merged 34 commits intopy-pdf:mainfrom
j-t-1:compress
Mar 27, 2026
Merged

BUG: Use remove_orphans in compress_identical_objects#3310
stefan6419846 merged 34 commits intopy-pdf:mainfrom
j-t-1:compress

Conversation

@j-t-1
Copy link
Copy Markdown
Contributor

@j-t-1 j-t-1 commented Jun 11, 2025

Issue #3306: PdfWriter.compress_identical_objects ignored remove_orphans. Correct for this. Also deprecate_with_replacement remove_orphans to remove_unreferenced.

j-t-1 added 4 commits June 10, 2025 18:39
Issue py-pdf#3306: PdfWriter.compress_identical_objects ignored remove_orphans. Correct for this. Also deprecate_with_replacement remove_orphans to remove_unreferenced.
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.43%. Comparing base (018a52e) to head (e684a13).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3310   +/-   ##
=======================================
  Coverage   97.43%   97.43%           
=======================================
  Files          55       55           
  Lines       10005    10016   +11     
  Branches     1837     1841    +4     
=======================================
+ Hits         9748     9759   +11     
  Misses        149      149           
  Partials      108      108           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pypdf/_writer.py Outdated
@stefan6419846 stefan6419846 added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Jul 31, 2025
@j-t-1
Copy link
Copy Markdown
Contributor Author

j-t-1 commented Feb 26, 2026

Fixed deprecate_with_replacement. Ready for code review.

Comment thread pypdf/_writer.py Outdated
Comment thread pypdf/_writer.py Outdated
@stefan6419846
Copy link
Copy Markdown
Collaborator

Could you please elaborate if we have any test in place to properly check that the remove_orphans parameter is properly respected and where? #3691 has been opened recently and I want to avoid merging this first when your PR already covers all relevant aspects.

@j-t-1
Copy link
Copy Markdown
Contributor Author

j-t-1 commented Mar 23, 2026

Could you please elaborate if we have any test in place to properly check that the remove_orphans parameter is properly respected and where?

Lines 2578 to 2585 do this check.

@stefan6419846
Copy link
Copy Markdown
Collaborator

Why is this test in a section which checks that warnings are emitted? As far as I can see, this always sets remove_orphans, which actually is the inverse of what I am asking for.

Additionally, from the code it is not directly obvious what is checked here due to some "arbitrary" length checks - could you please try to improve it to make it more obvious what happened? Maybe there are some unreferenced dummy objects we could check for explicitly?

@j-t-1
Copy link
Copy Markdown
Contributor Author

j-t-1 commented Mar 24, 2026

Why is this test in a section which checks that warnings are emitted? As far as I can see, this always sets remove_orphans, which actually is the inverse of what I am asking for.

It does the warning and then removes duplicates. Sorry I do not understand what you mean. Do you mean split it in to two, one just tests a warning and then one tests deduplication?

Additionally, from the code it is not directly obvious what is checked here due to some "arbitrary" length checks - could you please try to improve it to make it more obvious what happened? Maybe there are some unreferenced dummy objects we could check for explicitly?

This is a good idea. How about loop over writer._objects, take the id()of each and then count using a dict? Do this before and after and then ensure any values that were greater than one are now equal to one.

@stefan6419846
Copy link
Copy Markdown
Collaborator

Do you mean split it in to two, one just tests a warning and then one tests deduplication?

Yes, because I usually tend to remove all warning code where possible when preparing a new major release, thus having this separated making it easier to maintain.

Regarding the proposed test, doing something like this should be sufficient in theory:

def test_remove_orphans():
    writer = PdfWriter(clone_from=RESOURCE_ROOT / "crazyones.pdf")
    writer._add_object(DictionaryObject({}))
    dictionary_object = DictionaryObject({NameObject("/Testing"): NameObject("/UniqueNameForTesting")})
    reference = writer._add_object(dictionary_object)

    writer.compress_identical_objects(remove_orphans=False)
    assert writer.get_object(reference) == dictionary_object

    writer.compress_identical_objects(remove_orphans=True)
    with pytest.raises(AssertionError):
        writer.get_object(reference)

The AssertionError might be replaced by a proper exception here - mypy is the wrong justification here.

@j-t-1
Copy link
Copy Markdown
Contributor Author

j-t-1 commented Mar 25, 2026

Is test_remove_orphans the test of the deprecation? If yes, it should have the warnings, if not it should use the renamed values.

Can you clarify the names of the test functions that test deprecation and test compression?

@stefan6419846
Copy link
Copy Markdown
Collaborator

No, this would be the one for the new way. I just used the old names as this is what my local testing environment for quick verification has.

Comment thread pypdf/_writer.py Outdated
Comment thread pypdf/_writer.py Outdated
Comment thread pypdf/_writer.py Outdated
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Comment thread tests/test_workflows.py Outdated
Comment thread tests/test_writer.py Outdated
Comment thread tests/test_writer.py Outdated
Comment thread tests/test_writer.py Outdated
Comment thread tests/test_writer.py Outdated
@stefan6419846 stefan6419846 removed the needs-change The PR/issue cannot be handled as issue and needs to be improved label Mar 26, 2026
j-t-1 and others added 4 commits March 26, 2026 13:14
@j-t-1 j-t-1 requested a review from stefan6419846 March 26, 2026 14:01
Comment thread tests/test_writer.py Outdated
@j-t-1 j-t-1 requested a review from stefan6419846 March 26, 2026 15:48
@stefan6419846 stefan6419846 merged commit 6f10e02 into py-pdf:main Mar 27, 2026
24 of 32 checks passed
@j-t-1 j-t-1 deleted the compress branch March 27, 2026 10:52
stefan6419846 added a commit that referenced this pull request Apr 10, 2026
## What's new

### Security (SEC)
- Disallow custom XML entity declarations for XMP metadata (#3724) by @stefan6419846

### New Features (ENH)
- Skip MD5 key derivation for AES-256 encrypted PDFs (#3694) by @Ygnas

### Bug Fixes (BUG)
- Use remove_orphans in compress_identical_objects (#3310) by @j-t-1
- Fix PdfReadError when xref table contains comments before trailer (#3710) by @rassie
- Correctly verify AES padding during decryption (#3699) by @stefan6419846
- Fix stale object cache from non-authoritative object streams (#3698) by @astahlman
- Fix extract_links pairing when annotations include non-links (#3687) by @ReinerBRO

### Documentation (DOC)
- Add AI policy (#3717) by @stefan6419846

[Full Changelog](6.9.2...6.10.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants