Skip to content

Fix CFamilyLexer preprocessor tokenization errors#1830

Merged
Anteru merged 1 commit intopygments:masterfrom
henkkuli:1820/henkkuli/c-lexer-preprocessor
Jun 20, 2021
Merged

Fix CFamilyLexer preprocessor tokenization errors#1830
Anteru merged 1 commit intopygments:masterfrom
henkkuli:1820/henkkuli/c-lexer-preprocessor

Conversation

@henkkuli
Copy link
Copy Markdown
Contributor

@henkkuli henkkuli commented Jun 2, 2021

CFamilyLexer fails to tokenize preprocessor macros when they are preceded by a line break surrounded by spaces. This is the case because prerpocessor regex rule expects to start at the beginning of the line, but the space regex rule matches also the whitespace after the line break. Now the space rule has been refined not to match the line break. Because of this, the preprocessor regex rule correctly matches prerpocessor tokens even when they are preceded by spaces, at the cost of adding some more tokens in the token stream in some cases.

The main change is in pygments/lexers/c_cpp.py. The generic whitespace rule \s+ has been changed to [^\S\n] to avoid matching line breaks. As a consequence of this, many files under tests/examplefiles changed. All of the changes seem to be of the form

'      \n      ' Text

changed to

'      '      Text
'\n'          Text

'      '      Text

In addition to these, the PR adds three new tests under tests/snippets which test the behavior of the preprocessor tokenizer in different situations. The test tests/snippets/c/test_preproc_file5.txt can be controversial as it tests the behavior in situation where the code is invalid and hence the output contains an error token. I'll let the maintainers decide whether that should be included or removed.

Fixes #1820.

CFamilyLexer failed to tokenize preprocessor macros when they were
preceded by line break surrounded by spaces. This was the case because
prerpocessor regex rule expected to start at the beginning of the line,
but the space regex rule matched also the whitespace after the line
break. Now the space rule has been refined not to match the line break.
Because of this, the preprocessor regex rule correctly matches
prerpocessor tokens even when they are preceded by white spaces, at the
cost of adding some more tokens in the token stream in some cases. This
change preserves the behavior of invalid preprocessor usage failing to
tokenize.
@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Jun 20, 2021
@Anteru Anteru self-assigned this Jun 20, 2021
@Anteru Anteru merged commit fea1fbc into pygments:master Jun 20, 2021
@Anteru
Copy link
Copy Markdown
Collaborator

Anteru commented Jun 20, 2021

Merged, thanks!

@Anteru Anteru added this to the 2.10 milestone Jul 18, 2021
@Anteru Anteru added A-lexing area: changes to individual lexers and removed changelog-update Items which need to get mentioned in the changelog labels Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-lexing area: changes to individual lexers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CFamilyLexer fails to tokenize spaces before preprocessor macro

2 participants