Skip to content

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

@henkkuli

Description

@henkkuli

The following string code fails to tokenize even though it is valid C++ code:

; 
 #define A 0

Note that there are spaces around the line break. I would expect the #define A 0 to be tokenized as macro, but # produces an error token. The tokens produced are

  • print(list(CFamilyLexer().get_tokens("; \n #define A 0"))):
    [(Token.Punctuation, ';'), (Token.Text, ' \n '), (Token.Error, '#'), (Token.Name, 'define'), (Token.Text, ' '), (Token.Name, 'A'), (Token.Text, ' '), (Token.Literal.Number.Integer, '0'), (Token.Text, '\n')]
    

What I have found

I have done some debugging and found that the spaces around the line break are important. Removing either of the spaces produces correct tokenization.

  • print(list(CFamilyLexer().get_tokens("; \n#define A 0"))):
    [(Token.Punctuation, ';'), (Token.Text, ' \n'), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
    
  • print(list(CFamilyLexer().get_tokens(";\n #define A 0")))
    [(Token.Punctuation, ';'), (Token.Text, '\n'), (Token.Text, ' '), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
    

As far as I understand, the relevant part of the tokenizer are the following definitions:

'whitespace': [
    # preprocessor directives: without whitespace
    ('^#', Comment.Preproc, 'macro'),
    # or with whitespace
    ('^(' + _ws1 + ')(#)',
     bygroups(using(this), Comment.Preproc), 'macro'),
    (r'\n', Text),
    (r'\s+', Text),
],

Here in the failing case the ";" first matches as a punctuation. Then '\s+' matches " \n ". Now we would want either '^#', or '^(' + _ws1 + ')(#)' to match the "#", but this isn't the case because ^ matches only the start of the line, and here we have already matched the space at the start of the line. Hence the tokenizer produces an error.

In the case that we remove the space after line break, '\s+' matches " \n". As there is no space at the start of the line, '^#' matches "#" and the lexer begins to tokenize a macro.

In the case that we remove the space before the line break, '\n' comes before \s+ in the list of regexes and consumes "\n". At this point we still haven't matched the space, but it gets matched by '^(' + _ws1 + ')(#)' as it is at the start of the line. Hence the lexer again begins to tokenize a macro.

My analysis

I fail to see why ^ at the start of Comment.Preproc regexes would be necessary. Maybe that could be removed? I also think that \s+ should not match line breaks, but I'm not sure how easy that would be as \s is local-aware.

Ps. The above example is a minimal example I have found. I have encountered this bug in the wild, for example the following code also fails to tokenize:

int main() {
  prepare(); 
  #pragma omp parallel for
  for (int i = 0; i < 10; i++) {}
}

Metadata

Metadata

Assignees

Labels

changelog-updateItems which need to get mentioned in the changelog

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions