Skip to content

BUG: Fix extract_links pairing when annotations include non-links#3687

Merged
stefan6419846 merged 6 commits intopy-pdf:mainfrom
ReinerBRO:fix-3667-stable-link-pairing
Mar 24, 2026
Merged

BUG: Fix extract_links pairing when annotations include non-links#3687
stefan6419846 merged 6 commits intopy-pdf:mainfrom
ReinerBRO:fix-3667-stable-link-pairing

Conversation

@ReinerBRO
Copy link
Contributor

Summary

  • make extract_links() ignore non-link annotations before pairing old/new links
  • avoid dropping valid links when one side has additional non-link annotation entries
  • add a regression test for offset non-link annotation positions

Why

Issue #3667 points out that extract_links() can pair via zip() on raw annotation arrays, which can lose links when array positions diverge. Pairing only extracted links keeps behavior stable for mixed annotation arrays.

Tests

  • . .venv/bin/activate && python -m pytest tests/generic/test_link.py -q
  • . .venv/bin/activate && python -m pytest tests/test_merger.py -k add_page -q
  • . .venv/bin/activate && pre-commit run --files pypdf/generic/_link.py tests/generic/test_link.py

Closes #3667

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.43%. Comparing base (da867f4) to head (ae6a45d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3687   +/-   ##
=======================================
  Coverage   97.43%   97.43%           
=======================================
  Files          55       55           
  Lines       10003    10005    +2     
  Branches     1837     1838    +1     
=======================================
+ Hits         9746     9748    +2     
  Misses        149      149           
  Partials      108      108           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ReinerBRO ReinerBRO changed the title Fix extract_links pairing when annotations include non-links BUG: Fix extract_links pairing when annotations include non-links Mar 18, 2026
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Could you please elaborate on how this solves the issue with mismatching list sizes? Additionally, I think this still be can be a comprehension?

new_links = list(filter(None, [_build_link(link, new_page) for link in new_annotations]))

Besides this, please revert all unrelated changes which do not belong to this PR.

@ReinerBRO
Copy link
Contributor Author

Followed up on the review:

  • reverted the unrelated tests/__init__.py changes so this PR is back to the link-pairing fix plus its regression test
  • switched the filtering back to comprehensions
  • kept the pairing change as zip(filtered_new_links, filtered_old_links)

Why this fixes the mismatch issue: the old code zipped the raw annotation-derived lists first and only filtered out non-links afterwards. If one side had an extra non-link annotation, the zip alignment could shift or drop a valid link pair. Filtering each side before zipping preserves the relative order of actual link annotations and avoids that offset.

Validation:

  • .venv/bin/python -m pytest tests/generic/test_link.py -q
  • ruff check pypdf/generic/_link.py tests/generic/test_link.py tests/__init__.py

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Mar 20, 2026

I just tested with #3667 (comment) and it seems like there is no change before and after this PR. In which real-world example would I see a difference now?

If I understand the code correctly, the condition/logging could be enabled again if moved after the filtering. Then we would see a real effect for the aforementioned example as the URI-based link annotations would be correctly filtered out now, thus having matching sizes afterwards.

By the way: Are there any benefits when using

    new_links = [
        link
        for annotation in new_annotations
        if (link := _build_link(annotation, new_page)) is not None
    ]

instead of

    new_links = list(filter(None, (_build_link(link, new_page) for link in new_annotations)))

as initially proposed?

@ReinerBRO
Copy link
Contributor Author

Followed up in the latest push.

You were right that the previous revision did not visibly change the issue example yet, because I had only changed the pairing logic and had not restored the size check after filtering. I have now moved the size warning to the filtered-link lists instead.

That means there is now a real observable difference for a case like the issue discussion: if one side contains an extra URI / non-GoTo annotation, it gets filtered out first, so we no longer warn just because the raw annotation arrays differ while the actual internal links still match.

A concrete example where behavior changes now:

  • old page annotations: [GoTo]
  • new page annotations: [URI, GoTo]

Before this change, zipping the raw annotation-derived lists could drop the valid GoTo pair entirely. Now the URI annotation is ignored first, the GoTo links are paired correctly, and no mismatch warning is emitted for that case.

On the comprehension form: there is no strong requirement for the walrus form, but I chose it because is not None is explicit and avoids relying on truthiness. If you prefer the filter(None, ...) variant for style consistency, I can switch that too.

Validation:

  • .venv/bin/python -m pytest tests/generic/test_link.py -q
  • ruff check pypdf/generic/_link.py tests/generic/test_link.py

@ReinerBRO
Copy link
Contributor Author

The remaining macOS CI failure looks unrelated to the link-pairing changes in this PR.

I checked the failing job and it is currently failing in tests/test_cmap.py::test_text_extraction_fast[...] because tests/pdf_cache/ASurveyofImageClassificationBasedTechniques.pdf is missing on the runner (FileNotFoundError). The link-specific tests added/updated here are passing locally, and the other CI jobs for this PR are green.

I can follow up further on the PR code if needed, but this specific macOS failure does not appear to come from the changes in pypdf/generic/_link.py.

@ReinerBRO
Copy link
Contributor Author

Latest CI rerun is green now, including macOS. Together with the earlier follow-up that moved the size warning after filtering and added the URI/non-GoTo regression case, the current head should now be ready for another look when you have a chance.

@stefan6419846 stefan6419846 merged commit 23d6683 into py-pdf:main Mar 24, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extract_links regularly zip()s with mismatching sizes (when concatenating files)

2 participants