Skip to content

Fix comments within function declarations in C (#1891)#2140

Merged
jeanas merged 14 commits intopygments:masterfrom
lambda-karlculus:fix-function-comments
May 30, 2022
Merged

Fix comments within function declarations in C (#1891)#2140
jeanas merged 14 commits intopygments:masterfrom
lambda-karlculus:fix-function-comments

Conversation

@lambda-karlculus
Copy link
Copy Markdown
Contributor

Fixes #1891 and other combinations of comments within function declarations in C.

Comments were not properly detected between the parts of function declarations. This was fixed by detecting those comments and passing them to a minimal lexer. A new test snippet file was created to test the fix.

lambda-karlculus and others added 7 commits May 9, 2022 14:21
The regex to detect comments was tweaked to handle whitespace better.
A group near the final punctuation in the function regex was
accidentally forgotten. The group is now handled using(this).
The existing C parsing code places newlines into their own token,
instead of combining them with other whitespace. This change results
in the new parser not breaking any exisiting tests.
Copy link
Copy Markdown
Contributor

@jeanas jeanas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Sounds good, a few comments and we should be good to go.

@amitkummer
Copy link
Copy Markdown
Contributor

Also, I think that fixing this minor issue by adding this much logic to an already complex part of the lexer is uncalled for, and can introduce maintenance problems. Please simplify this @lambda-karlculus.

@jeanas
Copy link
Copy Markdown
Contributor

jeanas commented May 29, 2022

Also, I think that fixing this minor issue by adding this much logic to an already complex part of the lexer is uncalled for, and can introduce maintenance problems. Please simplify this @lambda-karlculus.

IMHO, with using(this, 'comments') and a variable for the regex, it will be OK.

@jeanas
Copy link
Copy Markdown
Contributor

jeanas commented May 30, 2022

OK, I went ahead and changed your PR directly (don't expect me to do that every time 😄).

  • Fixed the basic failure: using(this, state=...) requires that the get_tokens_unprocessed method accept a stack optional argument defaulting to ('root',). (Adjusted in subclasses too.)
  • Fixed some test failures in tests/examplefiles/freefem. This was super tricky. The comment regexes used things like [\w\W]*? for the inner content of the comment. For a standalone regex, that is fine: if you run it on a comment, it will stop at the end of the comment because *? is non-greedy. But embedded within another regex, it can cause trouble, because if the part of the larger regex after this one doesn't match with the smallest possible comment, it backtracks and *? starts matching more. It was marking huge swaths of code as comments …
  • The catastrophic backtracking in _possible_comments was still there! 😄 I've fixed it.
  • Also simplified: the 'comments' state was actually useless because it was a superset of the 'whitespace' state.

@amitkummer I think this is good to go. You have more experience than me tinkering with the CFamilyLexer, so I'll leave this for a while in case you want to take another look.

@lambda-karlculus
Copy link
Copy Markdown
Contributor Author

Thank you! That's amazing!

  • Thanks for explaining the backtracking and errors in the regex. I tried to reuse the existing regexes, but understand now why that does not work.
  • I agree that the 'comments' state is just a subset of 'whitespace' and should be removed.
  • The first point (that using(this, state=...) requires get_tokens_unprocessed accept the stack argument) is interesting. It is not clear from the documentation that the former requires the latter. Unless I'm wrong here I may send a new pull request to add that to the docs.
  • I noticed you made a change to stop "separating newline tokens". This was the existing behavior. Should this be changed in the 'whitespace' state?

@jeanas
Copy link
Copy Markdown
Contributor

jeanas commented May 30, 2022

  • I noticed you made a change to stop "separating newline tokens". This was the existing behavior. Should this be changed in the 'whitespace' state?

I did that one as a simplification before I realized that the 'comments' state could be dropped entirely. As far as I can see, the 'whitespace' state does this in order to be able to use ^ (it needs to stop at newlines for that). Well, it looks like it could also be refactored in a simpler way not requiring this, but let's not mix too many changes in this PR.

@jeanas
Copy link
Copy Markdown
Contributor

jeanas commented May 30, 2022

  • Unless I'm wrong here I may send a new pull request to add that to the docs.

That would be welcome.

@amitkummer
Copy link
Copy Markdown
Contributor

amitkummer commented May 30, 2022

Looks excellent now @jean-abou-samra, thanks!

@jeanas
Copy link
Copy Markdown
Contributor

jeanas commented May 30, 2022

Thanks for reviewing, merging then.

@jeanas jeanas merged commit a9641d7 into pygments:master May 30, 2022
@jeanas jeanas added this to the 2.13.0 milestone May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C-language code-block lexing error with specific comment syntax "//@"

3 participants