Skip to content

lexers: devicetree: Fix catastrophic backtracking bug#3057

Merged
birkenfeld merged 2 commits intopygments:masterfrom
grahamroff-dev:devicetree-backtracking-fix
Mar 7, 2026
Merged

lexers: devicetree: Fix catastrophic backtracking bug#3057
birkenfeld merged 2 commits intopygments:masterfrom
grahamroff-dev:devicetree-backtracking-fix

Conversation

@grahamroff-dev
Copy link
Copy Markdown
Contributor

The regex for parsing devicetree statements contains a property name lookahead that results in a near infinite loop if there are a lot of whitespace characters in the property value.

Problem can be seen by running this example script:

from pygments import highlight
from pygments.lexers import DevicetreeLexer
from pygments.formatters import NullFormatter

def test():
    dts_contents = '''
/ {
	soc {
		plic: interrupt-controller@c000000 {
			interrupts-extended = < &hlic0
			                        0x9 >;
		};
    }
}
    '''
    print('Starting highlight')
    highlight(dts_contents, DevicetreeLexer(), NullFormatter())

if __name__ == "__main__":
    test()

The script will appear to hang forever (technically it should eventually finish, maybe in a day or so).

Root Cause: Catastrophic Backtracking

The hang is caused by catastrophic backtracking in the property name lookahead regex inside lexers/devicetree.py.

The problematic pattern in the statements token rule is:

(r'[a-zA-Z_][\w-]*(?=(?:\s*,\s*[a-zA-Z_][\w-]*|(?:' + _ws + r'))*\s*[=;])',
 Name),

The second alternative inside the outer * is \s*(?:/[*][^*/]*?[*]/\s*)* (the _ws pattern), which can match an empty string. This creates the classic catastrophic backtracking pattern (A|B)* where B can match empty.

More details: In the example the lookahead is applied to \n\t\t\t 0x9 >; (~28 whitespace characters before 0x9). Since 0x9 > is not = or ;, the lookahead must fail — but the regex engine tries to split those 28 whitespace characters among the outer * iterations in 2²⁸ ≈ 268 million different ways before giving up. When 0x9 is on the same line without the extra indentation, the whitespace is just one space (2¹ = 2 attempts), so it completes instantly.

Fix applied: restructured the lookahead so the outer * can only match when a , is present (it can never match empty string):

(r'[a-zA-Z_][\w-]*(?=(?:\s*,' + _ws + r'[a-zA-Z_][\w-]*)*' + _ws + r'[=;])',
 Name),

This correctly handles all cases (comma-separated property names, comments before =/;) while eliminating the exponential backtracking.

The regex for parsing devicetree statements contains a property name
lookahead that results in a near infinite loop if there are a lot of
whitespace characters in the property value.
Restructure the lookahead regex to avoid this scenario.
@birkenfeld
Copy link
Copy Markdown
Member

Thanks for the PR! Can you add the example to the test cases?

Update the devicetree example file to include the problematic
syntax that causes catastrophic backtracking.
@grahamroff-dev
Copy link
Copy Markdown
Contributor Author

Thanks for the PR! Can you add the example to the test cases?

I updated the test case to include the problematic syntax, thanks!

@birkenfeld birkenfeld merged commit 524b5d3 into pygments:master Mar 7, 2026
15 checks passed
@birkenfeld
Copy link
Copy Markdown
Member

Great! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-lexing area: changes to individual lexers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants