BUG: Fix extract_links pairing when annotations include non-links by ReinerBRO · Pull Request #3687 · py-pdf/pypdf

ReinerBRO · 2026-03-18T19:08:49Z

Summary

make extract_links() ignore non-link annotations before pairing old/new links
avoid dropping valid links when one side has additional non-link annotation entries
add a regression test for offset non-link annotation positions

Why

Issue #3667 points out that extract_links() can pair via zip() on raw annotation arrays, which can lose links when array positions diverge. Pairing only extracted links keeps behavior stable for mixed annotation arrays.

Tests

. .venv/bin/activate && python -m pytest tests/generic/test_link.py -q
. .venv/bin/activate && python -m pytest tests/test_merger.py -k add_page -q
. .venv/bin/activate && pre-commit run --files pypdf/generic/_link.py tests/generic/test_link.py

Closes #3667

codecov · 2026-03-18T19:17:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.43%. Comparing base (da867f4) to head (ae6a45d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3687   +/-   ##
=======================================
  Coverage   97.43%   97.43%           
=======================================
  Files          55       55           
  Lines       10003    10005    +2     
  Branches     1837     1838    +1     
=======================================
+ Hits         9746     9748    +2     
  Misses        149      149           
  Partials      108      108

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846

Thanks for the PR.

Could you please elaborate on how this solves the issue with mismatching list sizes? Additionally, I think this still be can be a comprehension?

new_links = list(filter(None, [_build_link(link, new_page) for link in new_annotations]))

Besides this, please revert all unrelated changes which do not belong to this PR.

ReinerBRO · 2026-03-18T20:24:00Z

Followed up on the review:

reverted the unrelated tests/__init__.py changes so this PR is back to the link-pairing fix plus its regression test
switched the filtering back to comprehensions
kept the pairing change as zip(filtered_new_links, filtered_old_links)

Why this fixes the mismatch issue: the old code zipped the raw annotation-derived lists first and only filtered out non-links afterwards. If one side had an extra non-link annotation, the zip alignment could shift or drop a valid link pair. Filtering each side before zipping preserves the relative order of actual link annotations and avoids that offset.

Validation:

.venv/bin/python -m pytest tests/generic/test_link.py -q
ruff check pypdf/generic/_link.py tests/generic/test_link.py tests/__init__.py

stefan6419846 · 2026-03-20T11:30:44Z

I just tested with #3667 (comment) and it seems like there is no change before and after this PR. In which real-world example would I see a difference now?

If I understand the code correctly, the condition/logging could be enabled again if moved after the filtering. Then we would see a real effect for the aforementioned example as the URI-based link annotations would be correctly filtered out now, thus having matching sizes afterwards.

By the way: Are there any benefits when using

    new_links = [
        link
        for annotation in new_annotations
        if (link := _build_link(annotation, new_page)) is not None
    ]

instead of

    new_links = list(filter(None, (_build_link(link, new_page) for link in new_annotations)))

as initially proposed?

ReinerBRO · 2026-03-20T12:28:54Z

Followed up in the latest push.

You were right that the previous revision did not visibly change the issue example yet, because I had only changed the pairing logic and had not restored the size check after filtering. I have now moved the size warning to the filtered-link lists instead.

That means there is now a real observable difference for a case like the issue discussion: if one side contains an extra URI / non-GoTo annotation, it gets filtered out first, so we no longer warn just because the raw annotation arrays differ while the actual internal links still match.

A concrete example where behavior changes now:

old page annotations: [GoTo]
new page annotations: [URI, GoTo]

Before this change, zipping the raw annotation-derived lists could drop the valid GoTo pair entirely. Now the URI annotation is ignored first, the GoTo links are paired correctly, and no mismatch warning is emitted for that case.

On the comprehension form: there is no strong requirement for the walrus form, but I chose it because is not None is explicit and avoids relying on truthiness. If you prefer the filter(None, ...) variant for style consistency, I can switch that too.

Validation:

.venv/bin/python -m pytest tests/generic/test_link.py -q
ruff check pypdf/generic/_link.py tests/generic/test_link.py

ReinerBRO · 2026-03-22T18:28:42Z

The remaining macOS CI failure looks unrelated to the link-pairing changes in this PR.

I checked the failing job and it is currently failing in tests/test_cmap.py::test_text_extraction_fast[...] because tests/pdf_cache/ASurveyofImageClassificationBasedTechniques.pdf is missing on the runner (FileNotFoundError). The link-specific tests added/updated here are passing locally, and the other CI jobs for this PR are green.

I can follow up further on the PR code if needed, but this specific macOS failure does not appear to come from the changes in pypdf/generic/_link.py.

ReinerBRO · 2026-03-23T04:31:31Z

Latest CI rerun is green now, including macOS. Together with the earlier follow-up that moved the size warning after filtering and added the URI/non-GoTo regression case, the current head should now be ready for another look when you have a chance.

Fix extract_links pairing with non-link annotations

7f1077e

ROB: make test fixture downloads deterministic

37e55ea

ReinerBRO changed the title ~~Fix extract_links pairing when annotations include non-links~~ BUG: Fix extract_links pairing when annotations include non-links Mar 18, 2026

ROB: avoid race creating pdf_cache during concurrent prefetch

f33c89b

stefan6419846 requested changes Mar 18, 2026

View reviewed changes

BUG: Narrow link pairing follow-up to requested changes

b6daaa6

BUG: filter annotations before size warning

4d32c3f

Merge branch 'main' into fix-3667-stable-link-pairing

ae6a45d

stefan6419846 approved these changes Mar 24, 2026

View reviewed changes

stefan6419846 merged commit 23d6683 into py-pdf:main Mar 24, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix extract_links pairing when annotations include non-links#3687

BUG: Fix extract_links pairing when annotations include non-links#3687
stefan6419846 merged 6 commits intopy-pdf:mainfrom
ReinerBRO:fix-3667-stable-link-pairing

ReinerBRO commented Mar 18, 2026

Uh oh!

codecov bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

stefan6419846 left a comment

Uh oh!

ReinerBRO commented Mar 18, 2026

Uh oh!

stefan6419846 commented Mar 20, 2026 •

edited

Loading

Uh oh!

ReinerBRO commented Mar 20, 2026

Uh oh!

ReinerBRO commented Mar 22, 2026

Uh oh!

ReinerBRO commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ReinerBRO commented Mar 18, 2026

Summary

Why

Tests

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

ReinerBRO commented Mar 18, 2026

Uh oh!

stefan6419846 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ReinerBRO commented Mar 20, 2026

Uh oh!

ReinerBRO commented Mar 22, 2026

Uh oh!

ReinerBRO commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 18, 2026 •

edited

Loading

stefan6419846 commented Mar 20, 2026 •

edited

Loading