fix(docx): guard against None hyperlink address in _get_paragraph_elements (#2367)#3022
Conversation
|
✅ DCO Check Passed Thanks @HemantSudarshan, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
6c4abfd to
786177a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @HemantSudarshan for finding and fixing this bug!
Since the bug and the resolution is so obvious, I would drop the new test that you have introduced and thus avoid more processing in our heavy Docling test suite (even though it's a tiny impact).
|
@HemantSudarshan in addition, could you please fix the linting issue? Please, check our CONTRIBUTING.md page and, in particular, the Code Style Guidelines section. |
786177a to
88d7b75
Compare
Guard Path(c.address) against None in _get_paragraph_elements to prevent TypeError when processing DOCX files with internal bookmark hyperlinks (e.g. Table of Contents entries). Internal hyperlinks use w:anchor instead of r:id, causing python-docx's Hyperlink.address to return None. Add regression test that creates a DOCX with an internal bookmark hyperlink via raw XML and verifies successful conversion. Closes docling-project#2367 Signed-off-by: Hemantsudarshan <hemanthsudarshan2002@gmail.com>
88d7b75 to
28175c1
Compare
Description
Fixes #2367
Background
Issue #2367 reports an
IndexError: list index out of rangewhen processing DOCX files with empty paragraph runs in _get_paragraph_elements. The original crash atc.runs[0](line 395 in v2.36.1) has since been addressed with a bounds check (if c.runs and len(c.runs) > 0).However, a closely related crash path in the same
Hyperlinkhandling block remains:Path(c.address)on the line immediately above raises aTypeErrorwhenc.addressisNone. This occurs with internal bookmark hyperlinks (e.g., Table of Contents entries, cross-references), where the DOCX XML uses w:anchor instead of a relationship ID (r:id), causingpython-docx'sHyperlink.addressto returnNone.Both bugs share the same root cause — incomplete defensive handling of
Hyperlinkobjects frompython-docx— and affect the same class of documents (DOCX files with certain structural patterns that cause complete parsing failure).Changes
docling/backend/msword_backend.py (1 line)
hyperlink = Path(c.address) if c.address else Nonec.addressisNone, hyperlink is set toNone, which downstream logic already handles correctly — the hyperlink text is extracted and grouped with surrounding text instead of being wrapped in aPathobjecttests/test_backend_msword.py (56 lines)
DocumentConverterand asserts:Why this is safe
hyperlink is Noneby grouping text normally rather than emitting a linkc.runson the adjacent lineType of change
Checklist