fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367) by HemantSudarshan · Pull Request #3022 · docling-project/docling

HemantSudarshan · 2026-02-22T15:58:05Z

Description

Background

Issue #2367 reports an IndexError: list index out of range when processing DOCX files with empty paragraph runs in _get_paragraph_elements. The original crash at c.runs[0] (line 395 in v2.36.1) has since been addressed with a bounds check (if c.runs and len(c.runs) > 0).

However, a closely related crash path in the same Hyperlink handling block remains: Path(c.address) on the line immediately above raises a TypeError when c.address is None. This occurs with internal bookmark hyperlinks (e.g., Table of Contents entries, cross-references), where the DOCX XML uses w:anchor instead of a relationship ID (r:id), causing python-docx's Hyperlink.address to return None.

Both bugs share the same root cause — incomplete defensive handling of Hyperlink objects from python-docx — and affect the same class of documents (DOCX files with certain structural patterns that cause complete parsing failure).

Changes

docling/backend/msword_backend.py (1 line)

Added a conditional guard: hyperlink = Path(c.address) if c.address else None
When c.address is None, hyperlink is set to None, which downstream logic already handles correctly — the hyperlink text is extracted and grouped with surrounding text instead of being wrapped in a Path object

tests/test_backend_msword.py (56 lines)

Added regression test test_hyperlink_with_none_address
Programmatically creates a DOCX containing an internal bookmark hyperlink (w:hyperlink with w:anchor, no r:id) via raw XML manipulation
Converts the document via DocumentConverter and asserts:
- No exception raised during conversion
- Surrounding paragraph text is correctly extracted in the markdown export

Why this is safe

The fix is a single conditional expression — no new branches, no changed return types
Downstream code in the same method (lines 611-626) already handles hyperlink is None by grouping text normally rather than emitting a link
All 12 existing DOCX backend tests continue to pass unchanged
The fix follows the same defensive pattern already used for c.runs on the adjacent line

Type of change

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have added tests that prove my fix is effective
New and existing unit tests pass locally with my changes

github-actions · 2026-02-22T15:58:17Z

✅ DCO Check Passed

Thanks @HemantSudarshan, all your commits are properly signed off. 🎉

mergify · 2026-02-22T15:58:40Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-02-22T15:59:22Z

Related Documentation

Checked 17 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

codecov · 2026-02-23T08:39:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ceberam

Thanks @HemantSudarshan for finding and fixing this bug!

Since the bug and the resolution is so obvious, I would drop the new test that you have introduced and thus avoid more processing in our heavy Docling test suite (even though it's a tiny impact).

ceberam · 2026-02-23T16:06:28Z

@HemantSudarshan in addition, could you please fix the linting issue?

Please, check our CONTRIBUTING.md page and, in particular, the Code Style Guidelines section.
We strongly recommend installing pre-commit locally to prevent commits that do not pass the styling checks.

Guard Path(c.address) against None in _get_paragraph_elements to prevent TypeError when processing DOCX files with internal bookmark hyperlinks (e.g. Table of Contents entries). Internal hyperlinks use w:anchor instead of r:id, causing python-docx's Hyperlink.address to return None. Add regression test that creates a DOCX with an internal bookmark hyperlink via raw XML and verifies successful conversion. Closes docling-project#2367 Signed-off-by: Hemantsudarshan <hemanthsudarshan2002@gmail.com>

HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 6c4abfd to 786177a Compare February 22, 2026 15:59

ceberam added docx issue related to docx backend bug Something isn't working labels Feb 23, 2026

ceberam requested changes Feb 23, 2026

View reviewed changes

HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 786177a to 88d7b75 Compare February 23, 2026 17:25

HemantSudarshan force-pushed the fix/handle-none-address-hyperlinks-2367 branch from 88d7b75 to 28175c1 Compare February 23, 2026 17:28

ceberam approved these changes Feb 23, 2026

View reviewed changes

dolfim-ibm approved these changes Feb 23, 2026

View reviewed changes

PeterStaar-IBM merged commit 236216e into docling-project:main Feb 24, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367)#3022

fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367)#3022
PeterStaar-IBM merged 1 commit intodocling-project:mainfrom
HemantSudarshan:fix/handle-none-address-hyperlinks-2367

HemantSudarshan commented Feb 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 22, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Feb 22, 2026

Uh oh!

dosubot Bot commented Feb 22, 2026

Uh oh!

codecov Bot commented Feb 23, 2026

Uh oh!

ceberam left a comment

Uh oh!

ceberam commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

HemantSudarshan commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

Changes

Why this is safe

Type of change

Checklist

Uh oh!

github-actions Bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Feb 22, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot Bot commented Feb 22, 2026

Uh oh!

codecov Bot commented Feb 23, 2026

Codecov Report

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

ceberam commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HemantSudarshan commented Feb 22, 2026 •

edited

Loading

github-actions Bot commented Feb 22, 2026 •

edited

Loading