BUG: Fix extract_links pairing when annotations include non-links#3687
BUG: Fix extract_links pairing when annotations include non-links#3687stefan6419846 merged 6 commits intopy-pdf:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3687 +/- ##
=======================================
Coverage 97.43% 97.43%
=======================================
Files 55 55
Lines 10003 10005 +2
Branches 1837 1838 +1
=======================================
+ Hits 9746 9748 +2
Misses 149 149
Partials 108 108 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
stefan6419846
left a comment
There was a problem hiding this comment.
Thanks for the PR.
Could you please elaborate on how this solves the issue with mismatching list sizes? Additionally, I think this still be can be a comprehension?
new_links = list(filter(None, [_build_link(link, new_page) for link in new_annotations]))Besides this, please revert all unrelated changes which do not belong to this PR.
|
Followed up on the review:
Why this fixes the mismatch issue: the old code zipped the raw annotation-derived lists first and only filtered out non-links afterwards. If one side had an extra non-link annotation, the zip alignment could shift or drop a valid link pair. Filtering each side before zipping preserves the relative order of actual link annotations and avoids that offset. Validation:
|
|
I just tested with #3667 (comment) and it seems like there is no change before and after this PR. In which real-world example would I see a difference now? If I understand the code correctly, the condition/logging could be enabled again if moved after the filtering. Then we would see a real effect for the aforementioned example as the URI-based link annotations would be correctly filtered out now, thus having matching sizes afterwards. By the way: Are there any benefits when using new_links = [
link
for annotation in new_annotations
if (link := _build_link(annotation, new_page)) is not None
]instead of new_links = list(filter(None, (_build_link(link, new_page) for link in new_annotations)))as initially proposed? |
|
Followed up in the latest push. You were right that the previous revision did not visibly change the issue example yet, because I had only changed the pairing logic and had not restored the size check after filtering. I have now moved the size warning to the filtered-link lists instead. That means there is now a real observable difference for a case like the issue discussion: if one side contains an extra URI / non-GoTo annotation, it gets filtered out first, so we no longer warn just because the raw annotation arrays differ while the actual internal links still match. A concrete example where behavior changes now:
Before this change, zipping the raw annotation-derived lists could drop the valid On the comprehension form: there is no strong requirement for the walrus form, but I chose it because Validation:
|
|
The remaining macOS CI failure looks unrelated to the link-pairing changes in this PR. I checked the failing job and it is currently failing in I can follow up further on the PR code if needed, but this specific macOS failure does not appear to come from the changes in |
|
Latest CI rerun is green now, including macOS. Together with the earlier follow-up that moved the size warning after filtering and added the URI/non-GoTo regression case, the current head should now be ready for another look when you have a chance. |
Summary
extract_links()ignore non-link annotations before pairing old/new linksWhy
Issue #3667 points out that
extract_links()can pair viazip()on raw annotation arrays, which can lose links when array positions diverge. Pairing only extracted links keeps behavior stable for mixed annotation arrays.Tests
. .venv/bin/activate && python -m pytest tests/generic/test_link.py -q. .venv/bin/activate && python -m pytest tests/test_merger.py -k add_page -q. .venv/bin/activate && pre-commit run --files pypdf/generic/_link.py tests/generic/test_link.pyCloses #3667