Skip to content

fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367)#3022

Merged
PeterStaar-IBM merged 1 commit intodocling-project:mainfrom
HemantSudarshan:fix/handle-none-address-hyperlinks-2367
Feb 24, 2026
Merged

fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367)#3022
PeterStaar-IBM merged 1 commit intodocling-project:mainfrom
HemantSudarshan:fix/handle-none-address-hyperlinks-2367

Conversation

@HemantSudarshan
Copy link
Copy Markdown
Contributor

@HemantSudarshan HemantSudarshan commented Feb 22, 2026

Description

Fixes #2367

Background

Issue #2367 reports an IndexError: list index out of range when processing DOCX files with empty paragraph runs in _get_paragraph_elements. The original crash at c.runs[0] (line 395 in v2.36.1) has since been addressed with a bounds check (if c.runs and len(c.runs) > 0).

However, a closely related crash path in the same Hyperlink handling block remains: Path(c.address) on the line immediately above raises a TypeError when c.address is None. This occurs with internal bookmark hyperlinks (e.g., Table of Contents entries, cross-references), where the DOCX XML uses w:anchor instead of a relationship ID (r:id), causing python-docx's Hyperlink.address to return None.

Both bugs share the same root cause — incomplete defensive handling of Hyperlink objects from python-docx — and affect the same class of documents (DOCX files with certain structural patterns that cause complete parsing failure).

Changes

docling/backend/msword_backend.py (1 line)

  • Added a conditional guard: hyperlink = Path(c.address) if c.address else None
  • When c.address is None, hyperlink is set to None, which downstream logic already handles correctly — the hyperlink text is extracted and grouped with surrounding text instead of being wrapped in a Path object

tests/test_backend_msword.py (56 lines)

  • Added regression test test_hyperlink_with_none_address
  • Programmatically creates a DOCX containing an internal bookmark hyperlink (w:hyperlink with w:anchor, no r:id) via raw XML manipulation
  • Converts the document via DocumentConverter and asserts:
    • No exception raised during conversion
    • Surrounding paragraph text is correctly extracted in the markdown export

Why this is safe

  • The fix is a single conditional expression — no new branches, no changed return types
  • Downstream code in the same method (lines 611-626) already handles hyperlink is None by grouping text normally rather than emitting a link
  • All 12 existing DOCX backend tests continue to pass unchanged
  • The fix follows the same defensive pattern already used for c.runs on the adjacent line

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 22, 2026

DCO Check Passed

Thanks @HemantSudarshan, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 22, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot Bot commented Feb 22, 2026

Related Documentation

Checked 17 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@HemantSudarshan HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 6c4abfd to 786177a Compare February 22, 2026 15:59
@ceberam ceberam added docx issue related to docx backend bug Something isn't working labels Feb 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HemantSudarshan for finding and fixing this bug!

Since the bug and the resolution is so obvious, I would drop the new test that you have introduced and thus avoid more processing in our heavy Docling test suite (even though it's a tiny impact).

@ceberam
Copy link
Copy Markdown
Member

ceberam commented Feb 23, 2026

@HemantSudarshan in addition, could you please fix the linting issue?

Please, check our CONTRIBUTING.md page and, in particular, the Code Style Guidelines section.
We strongly recommend installing pre-commit locally to prevent commits that do not pass the styling checks.

@HemantSudarshan HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 786177a to 88d7b75 Compare February 23, 2026 17:25
Guard Path(c.address) against None in _get_paragraph_elements to prevent
TypeError when processing DOCX files with internal bookmark hyperlinks
(e.g. Table of Contents entries). Internal hyperlinks use w:anchor instead
of r:id, causing python-docx's Hyperlink.address to return None.

Add regression test that creates a DOCX with an internal bookmark
hyperlink via raw XML and verifies successful conversion.

Closes docling-project#2367

Signed-off-by: Hemantsudarshan <hemanthsudarshan2002@gmail.com>
@HemantSudarshan HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 88d7b75 to 28175c1 Compare February 23, 2026 17:28
@PeterStaar-IBM PeterStaar-IBM merged commit 236216e into docling-project:main Feb 24, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working docx issue related to docx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IndexError: list index out of range in msword_backend.py when processing DOCX files with empty paragraph runs

4 participants