CFamilyLexer fails to tokenize spaces before preprocessor macro

The following string code fails to tokenize even though it is valid C++ code:
```cpp
; 
 #define A 0
```
Note that there are spaces around the line break. I would expect the `#define A 0` to be tokenized as macro, but `#` produces an error token. The tokens produced are
- ` print(list(CFamilyLexer().get_tokens("; \n #define A 0")))`:
  ```
  [(Token.Punctuation, ';'), (Token.Text, ' \n '), (Token.Error, '#'), (Token.Name, 'define'), (Token.Text, ' '), (Token.Name, 'A'), (Token.Text, ' '), (Token.Literal.Number.Integer, '0'), (Token.Text, '\n')]
  ```

## What I have found
I have done some debugging and found that the spaces around the line break are important. Removing either of the spaces produces correct tokenization.
- `print(list(CFamilyLexer().get_tokens("; \n#define A 0")))`:
  ```
  [(Token.Punctuation, ';'), (Token.Text, ' \n'), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
  ```
- `print(list(CFamilyLexer().get_tokens(";\n #define A 0")))`
  ```
  [(Token.Punctuation, ';'), (Token.Text, '\n'), (Token.Text, ' '), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
  ```

As far as I understand, the relevant part of the tokenizer are the following definitions:
```python
'whitespace': [
    # preprocessor directives: without whitespace
    ('^#', Comment.Preproc, 'macro'),
    # or with whitespace
    ('^(' + _ws1 + ')(#)',
     bygroups(using(this), Comment.Preproc), 'macro'),
    (r'\n', Text),
    (r'\s+', Text),
],
```
Here in the failing case the `";"` first matches as a punctuation. Then `'\s+'` matches `" \n "`. Now we would want either `'^#',` or `'^(' + _ws1 + ')(#)'` to match the `"#"`, but this isn't the case because `^` matches only the start of the line, and here we have already matched the space at the start of the line. Hence the tokenizer produces an error.

In the case that we remove the space after line break, `'\s+'` matches `" \n"`. As there is no space at the start of the line, ` '^#'` matches `"#"` and the lexer begins to tokenize a macro.

In the case that we remove the space before the line break, `'\n'` comes before `\s+` in the list of regexes and consumes `"\n"`. At this point we still haven't matched the space, but it gets matched by `'^(' + _ws1 + ')(#)'` as it is at the start of the line. Hence the lexer again begins to tokenize a macro.

## My analysis

I fail to see why `^` at the start of `Comment.Preproc` regexes would be necessary. Maybe that could be removed? I also think that `\s+` should not match line breaks, but I'm not sure how easy that would be as `\s` is local-aware.

Ps. The above example is a minimal example I have found. I have encountered this bug in the wild, for example the following code also fails to tokenize:
```cpp
int main() {
  prepare(); 
  #pragma omp parallel for
  for (int i = 0; i < 10; i++) {}
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

What I have found

My analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

Description

What I have found

My analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions