Skip to content

Lex identifiers in comments#1901

Merged
damiendoligez merged 4 commits intoocaml:trunkfrom
314eter:apostrophe-in-comment
Apr 30, 2019
Merged

Lex identifiers in comments#1901
damiendoligez merged 4 commits intoocaml:trunkfrom
314eter:apostrophe-in-comment

Conversation

@314eter
Copy link
Copy Markdown
Contributor

@314eter 314eter commented Jul 12, 2018

I don't know whether it's documented anywhere, but it seems comments, strings and character literals in comments are lexed such that all code can be commented. But there is one exception.

(* f' '"' *)

is lexed as the character literal ' ' followed by the unclosed string "' *).

This can be solved by lexing identifiers in comments too.

Copy link
Copy Markdown
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second:

  • I'm pretty sure this is not complete.
  • @xavierleroy will probably have objections too.

@xavierleroy
Copy link
Copy Markdown
Contributor

I guess @damiendoligez summons me so that I can express my usual concern about this topic.

Comments are intended first and foremost for writing prose in natural language. Commenting code out comes second. "Lexing" the contents of comments using the same lexical rules as for OCaml code might improve the "commenting out" behavior, but should not complicate the writing of natural language.

Here, I think @damiendoligez has counterexamples showing that lexing identifiers is not enough, and I would like to be reassured that normal uses of single quotes in text is not impacted.

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 12, 2018

I agree that commenting out code is not what comments are intended for, but comments are used for documentation too, and thus can contain code. In any case, the current situation is that the content of comments is checked for strings and character literals, which maybe already complicates things more than it should, especially if it doesn't even work. I think this simple fix makes it work in all cases without complicating things further, but maybe I'm missing something.

The only use of single quotes in natural language that is affected, is when the sequence '"' (single-double-single) occurs, and even then it's only a problem in certain combinations like a'"' or a' '"'.

I stumbled upon it while trying to let Atom highlight comments correctly, which turns out to be rather difficult, and I don't think any editor does it right (but I'm getting there).

@damiendoligez
Copy link
Copy Markdown
Member

I don't think any editor does it right

Emacs (caml-font.el) did, at some point before the lexing of number literals was simplified. It should be easier now.

@damiendoligez
Copy link
Copy Markdown
Member

Give me a few days to think about completeness and check this against OPAM packages.

@pmetzger
Copy link
Copy Markdown
Member

Side note: I almost wish that there was a distinct syntax for commenting out blocks of code, because I sometimes get myself in trouble when I include double quotes and other such things in comments. The fact that comments aren't pure uninterpreted blocks of text is a double edged sword. It makes it much easier to comment out code, but it makes the behavior of comments as documentation a bit more unexpected.

@alainfrisch
Copy link
Copy Markdown
Contributor

I almost wish that there was a distinct syntax

Note especially lightweight, but you can use quoted strings:

(*{foo|
      blabl"abla
   |foo}*)

@pmetzger
Copy link
Copy Markdown
Member

Note especially lightweight, but you can use quoted strings:

I suppose you can, though you need caution if commenting out the last part of a function. More to the point, though, this doesn't fix the fact that a comment like:

(* recognizes: ' " @ + *)

is unexpectedly hard to write.

@damiendoligez
Copy link
Copy Markdown
Member

I think Alain's point is that you just write:

(*{| recognizes: ' " @ + |}*)

I.e. this is a "special syntax" for free-form comments, not for commented code.

@alainfrisch
Copy link
Copy Markdown
Contributor

alainfrisch commented Jul 13, 2018

I suppose you can, though you need caution if commenting out the last part of a function.

What do you mean?

More to the point, though, this doesn't fix the fact that a comment like:

(*{|
recognizes: ' " @ +
|}*)

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 13, 2018

Not especially lightweight, but you can use quoted strings

This doesn't work for docstrings. Something like "quoted comments" (e.g. (*foo| recognizes: ' " @ + |foo*)) would work, but it's too late for that.

@pmetzger
Copy link
Copy Markdown
Member

What do you mean?

I mean that if one is naively writing comments and is used to other languages, one believes that the contents of a comment are free text, but they aren't. You can't just write whatever you feel like inside them. If there was a distinct way to comment out code blocks, this would not have to be the case. But that's not going to happen so it's just me musing.

@alainfrisch alainfrisch reopened this Jul 13, 2018
@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 13, 2018

Emacs (caml-font.el) did, at some point before the lexing of number literals was simplified.

I only tried tuareg mode, which does it wrong. caml-font.el handles comments correctly, but is worse than tuareg for other things.

@pmetzger
Copy link
Copy Markdown
Member

@314eter Possibly a good reason to file a report with the Tuareg team, pointing them at the code that works correctly.

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 14, 2018

The same problem occurs in ocamllex, where something like

rule token = parse _ { f' '"' }

gives an error, and it can be solved the same way in lex/lexer.mll.

Can I include that in this PR, or is it better to open a new one?

@damiendoligez
Copy link
Copy Markdown
Member

I think it's better to open a new PR for ocamllex. This is a language change, while for ocamllex it's a bug fix.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe this is complete, and a good change. It doesn't break any OPAM package (that can be compiled with trunk).

However, we'll need to update caml-font.el accordingly. I can help with that.

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 16, 2018

I have no experience with emacs and/or lisp, so I'll probably mess something up if I try to update caml-font.el.

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Jul 18, 2018

Fixing caml-font.el was not as difficult as I thought.

@pmetzger
Copy link
Copy Markdown
Member

Fixing caml-font.el was not as difficult as I thought.

Would similar fixes to the tuareg code be straightforward?

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Dec 3, 2018

@damiendoligez Do you have time to take a look at this (and maybe #1932) again?

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Feb 26, 2019

I removed caml-font.el and opened ocaml/caml-mode#4 instead.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge as soon as Changes is rebased. And apologies for the delay.

@314eter 314eter force-pushed the apostrophe-in-comment branch from d0bca98 to 4a72f59 Compare April 26, 2019 13:23
@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Apr 29, 2019

I rebased.

@damiendoligez damiendoligez merged commit 9f904f9 into ocaml:trunk Apr 30, 2019
@damiendoligez
Copy link
Copy Markdown
Member

Merged, thanks!

@kit-ty-kate
Copy link
Copy Markdown
Member

In ocaml/opam-repository#16114 (comment) the current maintainer of pfff (cc @aryx) did hit an issue that seems to have been carried by this change in the lexer.

For example, comparing the same code with OCaml 4.09.0 and 4.10.0, we get a different result:

        OCaml version 4.09.0

# type t = Test of unit (* '"' or b'"' *) * unit list * unit (* '"' *);;
type t = Test of unit * unit list * unit

vs.

        OCaml version 4.10.0

# type t = Test of unit (* '"' or b'"' *) * unit list * unit (* '"' *);;
type t = Test of unit

Is this change intended?

@314eter
Copy link
Copy Markdown
Contributor Author

314eter commented Mar 31, 2020

That change is intended, since comments are used to comment out OCaml code or to write OCaml code in documentation. In OCaml, b' would be an identifier, followed by the string "' *)...(* '".

Unfortunately, this breaks in other languages like PHP. I don't know why it didn't fail to build when @damiendoligez tested this change on opam (#1901 (review)).

@kit-ty-kate
Copy link
Copy Markdown
Member

I don't know why it didn't fail to build when @damiendoligez tested this change on opam (#1901 (review)).

The latest version of pfff that was available on opam did not support OCaml >= 4.09, so this particular one wasn't tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants