Fix comments within function declarations in C (#1891) by lambda-karlculus · Pull Request #2140 · pygments/pygments

lambda-karlculus · 2022-05-17T00:49:07Z

Fixes #1891 and other combinations of comments within function declarations in C.

Comments were not properly detected between the parts of function declarations. This was fixed by detecting those comments and passing them to a minimal lexer. A new test snippet file was created to test the fix.

The regex to detect comments was tweaked to handle whitespace better. A group near the final punctuation in the function regex was accidentally forgotten. The group is now handled using(this).

The existing C parsing code places newlines into their own token, instead of combining them with other whitespace. This change results in the new parser not breaking any exisiting tests.

jeanas

Sorry for the delay. Sounds good, a few comments and we should be good to go.

pygments/lexers/c_cpp.py

amitkummer · 2022-05-29T15:32:51Z

Also, I think that fixing this minor issue by adding this much logic to an already complex part of the lexer is uncalled for, and can introduce maintenance problems. Please simplify this @lambda-karlculus.

jeanas · 2022-05-29T15:38:00Z

Also, I think that fixing this minor issue by adding this much logic to an already complex part of the lexer is uncalled for, and can introduce maintenance problems. Please simplify this @lambda-karlculus.

IMHO, with using(this, 'comments') and a variable for the regex, it will be OK.

This now fails the tests, with: TypeError: get_tokens_unprocessed() got an unexpected keyword argument 'stack' Do not merge.

jeanas · 2022-05-30T11:09:49Z

OK, I went ahead and changed your PR directly (don't expect me to do that every time 😄).

Fixed the basic failure: using(this, state=...) requires that the get_tokens_unprocessed method accept a stack optional argument defaulting to ('root',). (Adjusted in subclasses too.)
Fixed some test failures in tests/examplefiles/freefem. This was super tricky. The comment regexes used things like [\w\W]*? for the inner content of the comment. For a standalone regex, that is fine: if you run it on a comment, it will stop at the end of the comment because *? is non-greedy. But embedded within another regex, it can cause trouble, because if the part of the larger regex after this one doesn't match with the smallest possible comment, it backtracks and *? starts matching more. It was marking huge swaths of code as comments …
The catastrophic backtracking in _possible_comments was still there! 😄 I've fixed it.
Also simplified: the 'comments' state was actually useless because it was a superset of the 'whitespace' state.

@amitkummer I think this is good to go. You have more experience than me tinkering with the CFamilyLexer, so I'll leave this for a while in case you want to take another look.

lambda-karlculus · 2022-05-30T12:58:29Z

Thank you! That's amazing!

Thanks for explaining the backtracking and errors in the regex. I tried to reuse the existing regexes, but understand now why that does not work.
I agree that the 'comments' state is just a subset of 'whitespace' and should be removed.
The first point (that using(this, state=...) requires get_tokens_unprocessed accept the stack argument) is interesting. It is not clear from the documentation that the former requires the latter. Unless I'm wrong here I may send a new pull request to add that to the docs.
I noticed you made a change to stop "separating newline tokens". This was the existing behavior. Should this be changed in the 'whitespace' state?

jeanas · 2022-05-30T13:15:35Z

I noticed you made a change to stop "separating newline tokens". This was the existing behavior. Should this be changed in the 'whitespace' state?

I did that one as a simplification before I realized that the 'comments' state could be dropped entirely. As far as I can see, the 'whitespace' state does this in order to be able to use ^ (it needs to stop at newlines for that). Well, it looks like it could also be refactored in a simpler way not requiring this, but let's not mix too many changes in this PR.

jeanas · 2022-05-30T13:16:30Z

Unless I'm wrong here I may send a new pull request to add that to the docs.

That would be welcome.

amitkummer · 2022-05-30T16:10:53Z

Looks excellent now @jean-abou-samra, thanks!

jeanas · 2022-05-30T17:47:40Z

Thanks for reviewing, merging then.

lambda-karlculus and others added 7 commits May 9, 2022 14:21

Add comment detection around functions in C family

36663b5

Add docstring to CFamilyComments

fa837f4

Tweak regex handling in c_cpp lexer to handle comments

8ea711d

The regex to detect comments was tweaked to handle whitespace better. A group near the final punctuation in the function regex was accidentally forgotten. The group is now handled using(this).

Update the CFamily Comment Parser to separate out newlines

17c0411

The existing C parsing code places newlines into their own token, instead of combining them with other whitespace. This change results in the new parser not breaking any exisiting tests.

Add test snippet for c to test comments around function declarations

a6c5899

Merge branch 'pygments:master' into fix-function-comments

52a1b30

Merge branch 'pygments:master' into fix-function-comments

81f1669

jeanas requested changes May 29, 2022

View reviewed changes

pygments/lexers/c_cpp.py Outdated Show resolved Hide resolved

pygments/lexers/c_cpp.py Outdated Show resolved Hide resolved

lambda-karlculus and others added 7 commits May 30, 2022 15:34

Merge branch 'pygments:master' into fix-function-comments

480d760

Made requested changes as per the pull request

2d2ccf6

This now fails the tests, with: TypeError: get_tokens_unprocessed() got an unexpected keyword argument 'stack' Do not merge.

Use API compatible with using(this, state=...)

a692482

Fix those comment regexes

62ceab4

Fix catastrophic backtracking in the _possible_comments regex

8cae0ce

Don't try to "separate newline tokens"

a18f430

Simplify: use already existing 'whitespace' state

fbbb222

jeanas approved these changes May 30, 2022

View reviewed changes

jeanas merged commit a9641d7 into pygments:master May 30, 2022

jeanas added this to the 2.13.0 milestone May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix comments within function declarations in C (#1891)#2140

Fix comments within function declarations in C (#1891)#2140
jeanas merged 14 commits intopygments:masterfrom
lambda-karlculus:fix-function-comments

lambda-karlculus commented May 17, 2022

Uh oh!

jeanas left a comment

Uh oh!

Uh oh!

Uh oh!

amitkummer commented May 29, 2022

Uh oh!

jeanas commented May 29, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

lambda-karlculus commented May 30, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

amitkummer commented May 30, 2022 •

edited

Loading

Uh oh!

jeanas commented May 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lambda-karlculus commented May 17, 2022

Uh oh!

jeanas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amitkummer commented May 29, 2022

Uh oh!

jeanas commented May 29, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

lambda-karlculus commented May 30, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

jeanas commented May 30, 2022

Uh oh!

amitkummer commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeanas commented May 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amitkummer commented May 30, 2022 •

edited

Loading