Skip to content

Quoted extensions#8820

Merged
gasche merged 15 commits intoocaml:trunkfrom
Drup:stringquot
Nov 25, 2019
Merged

Quoted extensions#8820
gasche merged 15 commits intoocaml:trunkfrom
Drup:stringquot

Conversation

@Drup
Copy link
Copy Markdown
Contributor

@Drup Drup commented Jul 19, 2019

This is the follow up of discussion started in #8802. cc @lpw25 @gasche @bikallem

This introduces a new shortcut for ppx extensions. Quoted extensions of the form {%foo|....|}
are desugared into [%foo{|...|}]. ... can be arbitrary content, similar to quoted strings.

The goal of such quoted extensions is to allow easier insertion of sublanguages. For instance, tyxml's ppx currently uses the form:

let content = [%html{|<div id="content">some content</div>|}]

With quoted extensions, we can get rid of string syntax, which makes it lighter

let content = {%html| <div id="content">some content</div>|}

This would be useful for many things, such as better support for PPX that implement jsx, sql, regex, etc.
Currently this syntax does not implement antiquotations. The plan is to first implement antiquotations outside the compiler (using the ocaml parser directly) as a library, experiment with it, before eventually upstreaming a concrete syntax. @lpw25 also proposed to add string interpolation to the language, which would interact well.

Examples are available in the testfile. Here are some examples.

(* Structure/signature forms *)
{%%M.foo| <hello>{x} |}

(* Structure with explicit terminator *)
{%%M.foo bar| <hello>{|x|} |bar}

(* Expressions/Pattern/Types *)
let {%M.foo| <hello>{x} |} : {%M.foo| <hello>{x} |} = {%M.foo| <hello>{x} |}

(* Multiline *)
{%%M.foo|
 <hello>
   {x}
 </hello>
|}

Implementation Discussion

The implementation is fairly straighforward, but almost completely in the lexer. This means the mechanism to build the extension id is different from other extensions and does not have the same sensitivity to whitespaces ([%M. bar. baz "hello" ] is allowed in extensions, but not in the new quoted extensions). I would tend to say such things should be forbidden.

One more thorny discussion is where to place the locations for the internal string. Currently, the locations are mostly wrong. One proposition would be to change the type of Pconst_string to add a location which only spans the inside of the string. Having to do index computations to figure this out inside a ppx is fairly common, and rather annoying. This would solve the question once and for all.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 19, 2019

We need to have some syntax to distinguish the barred delimiters {%foo-bar|...|bar} from the nonbarred delimiters {%foo ...}, but I don't really like the choice of - (despite having proposed it myself). Could we consider, for example, {%foo: ... } (non-significant spaces) in the non-barred case and {%foo bar|...|bar} in the barred case? That is, using a space in the barred case, and some other suggestive token (but not a dot) in the non-barred case?

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jul 19, 2019

My vote would be for just having the "barred" versions. I think that the simplicity and the consistency with the existing quotation syntax is worth the cost of an extra couple of unobtrusive characters.

@bikallem
Copy link
Copy Markdown

bikallem commented Jul 19, 2019

In the above example,

let content = {%html <div id="content">some content</div>}

Does the quoted string work when we have the balanced use of } delimiter as mentioned in #8802 (comment)?

Is the syntax below supported?
let div1 = {%jsx <div id="id1">{text("hello world")}</div>}

That feature would enable ppxes to support a large number of dsl. I think it best if we leave the anti-quotation mechanism to ppx authors to exploit and implement.

@bikallem
Copy link
Copy Markdown

To provide more context to my example above, here is the jsx syntax I would like to eventually support/implement in ocaml. https://github.com/bikallem/jsx-ssr/blob/master/examples/JsxTester.re

Note that the example is my current experiments with the ReasonML JSX syntax. I would like to implement the same syntax with the addition of {%jsx } bit.

@cfcs
Copy link
Copy Markdown

cfcs commented Jul 19, 2019

  1. OCaml is already full of magical and surprising syntax variations like ; ; in match leafs and whatnot -- in my opinion they make things harder to read/learn, and they make things harder to refactor. I tentatively agree with @lpw25; and if we can avoid introducing syntax that works on single lines but fails across multiple lines, I would be happy. That being said, [%html{|<div id="content">some content</div>|}] is pretty ugly, and I think this PR is a good idea.

  2. I find it confusing that div1 and div2 below are essentially equivalent, yet look pretty different:
    Why does div2 have two percentage signs, but `div1 only has one? (copied from @Drup 's example without knowing what I'm doing)

Would replacing div1 with something like the div3 given be too ugly?

let div1 = {%M.foo <hello> <hi> }
let div2 = {%%M.foo| <hello>
 <hi> |}

let div3 = {%M.foo| <hello> <hi> |}
  1. Agree with @gasche that - is easily overlooked. I think the {%M.foo bar| hello hi |bar} suggestion seems pleasant.

@Drup
Copy link
Copy Markdown
Contributor Author

Drup commented Jul 20, 2019

Some clarifications:

  • All forms accept multiline.
  • The only difference between the two forms are what the end delimiters look like. either }, |} or |xxx}, by increasing verbosity.
  • The difference between % and %% is exactly the same as for regular ppx extensions: one is for structure/signature items, the other is for expression/pattern/types. It's consistent with the rest of the language.
  • The shortest form doesn't care about balanced content. Since the (slightly) longer form exists, I don't think it's necessary.

I strongly think we should optimize for the most common form, i.e, the short one. For quoted strings, the long form {xx|....|xx} is almost never used, since the short form {|...|} is sufficient 99% of the time. Given that, I don't see why using dash is a problem. It'll almost never be used anyway.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 20, 2019

Some remarks:

  • The one use-case that started this discussion (the domain-specific JSX syntax) profusely uses braces, which led to my idea of assuming well-balancing the non-barred case. Requiring well-balancing is not a perfect solution either, due to braces occurring within string literals (typically using single- or double-quotes) in the embedded language). It supports more use-cases without any extra syntactic weight, but it makes the specification more complex and less regular.

  • Speaking of regularity, there is no precedent for {...} being used as opaque literals, while there is precedent for the barred forms. We would need a strong justification to add the non-barred form, and I'm not convinced by the gain/risk ratio of the current proposal. (We add one syntax to shorten the short path by one character; but it is non-expected and makes it likely that users run into unexpected-end-of-literal issue, then having to move to a less-discoverable alternative.)

  • It's not very nice that someone that would write {%foo bar|...|bar} would have their content silently accepted by your grammar as a non-barred payload. (They probably would get an error from the extension parser.) One more reason to get rid of the non-barred case at first.

@Drup
Copy link
Copy Markdown
Contributor Author

Drup commented Jul 20, 2019

Alright, I still would prefer the shortest version, but consistency is more important. I'll change the PR to only add barred versions.

Any opinion on the position question?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 20, 2019

Your plan wrt. position sounds very reasonable to me.

@Drup
Copy link
Copy Markdown
Contributor Author

Drup commented Jul 22, 2019

Done and done. All the locations should be decent. The patch to add locations to string constants is big, but boring.

@bikallem
Copy link
Copy Markdown

@Drup @gasche Will this be merged for ocaml 4.10.0 perhaps?

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 12, 2019

The implementation is fine, but there hasn't been enough discussion among language maintainers to create a consensus on the new language feature. We may discuss it on the occasion of an upcoming development meeting, or in an online discussion round.

@bikallem
Copy link
Copy Markdown

@gasche ping. Any movement on this perhaps? I think this new ppx syntax along with the upcoming ppxlib work (https://discuss.ocaml.org/t/the-future-of-ppx/3766) will make the ppx system in ocaml much more ergonomic and useful for use cases other than UI(jsx/html) such as sql, configs, json and so on. Would really love to see this merged soonish. 🥇

@Drup
Copy link
Copy Markdown
Contributor Author

Drup commented Oct 21, 2019

I just rebased. The change entry is still in the 4.10 section, but I suspect I'll have to move it before merging.

@Drup Drup force-pushed the stringquot branch 2 times, most recently from aba02be to 890b644 Compare November 13, 2019 14:27
@Drup
Copy link
Copy Markdown
Contributor Author

Drup commented Nov 13, 2019

Rebased, Short manual added, All tests pass. :)

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 13, 2019

We had a developer meeting today where the syntax as currently proposed (only the barred form, with a space between the extension and quote name) was approved in principle. We still need someone to do a thorough review of the implementation. I did a review of an earlier version of the patch (without the location changes), this would need to be reviewed again.

let kwdopchar =
['$' '&' '*' '+' '-' '/' '<' '=' '>' '@' '^' '|']

let extattrident = identchar+ ('.' identchar+)*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require a comment to explain why the definition is as such. If I understand correctly, you are trying to accept the same as the parser's attr_id category, modulo whitespace differences.

The parser definition requires components (between the dots) to be valid OCaml identifiers or keywords. Your definition accepts components that start with a number. Why not require a (lowercase | uppercase) at the head of components?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

let store_escaped_uchar lexbuf u =
if in_comment () then store_lexeme lexbuf else store_string_utf_8_uchar u

let with_string f lexbuf =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a clearer name for f? consume_string?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even with_string is a confusing name; usually we call with_foo functions of the form init -> (foo -> 'a) -> 'a. What about build_string_token or just string_token?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called with_string because it's exactly the same as with_comment which is defined just underneath. Should I also change that function ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a semi-valid argument for leaving the code exactly as ugly as it currently is. I think you should do as you think is best.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to wrap_(string|comment)_lexer.

%token STAR
%token <string * string option> STRING
%token <string * Location.t * string option> STRING
%token <string * string * Location.t * string option> BRACEPERCENTBRACE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the token name. To me a token named BRACEPERCENTBRACE looks like {%{ in program text. You should call it QUOTED_STRING_EXTENSION or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other tokens are named after an approximation of the syntax they represent in term of characters, not in term of semantics, but I don't care enough to argue either way.

Copy link
Copy Markdown
Member

@gasche gasche Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly: a token called BRACEPERCENTBRACE should have the text representation {%{, not the whole string. We call the string token STRING, not DOUBLEQUOTE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

let wrap_mksig_ext ~loc (item, ext) =
wrap_sig_ext ~loc (mksig ~loc item) ext

let mk_quotedext ~loc ~shift (id, content, strloc, delim) =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's late and my head hurts looking at this code. Why is it so complex if you did all the work in the lexer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because life is not all pink and sometimes you need to get your hand dirty with locations. :)

The complex part of this code computes the position of foo.bar in {%foo.bar-baz|...baz|}. It's only complicated because the representation of position is mildly akward.

Moving this code to the lexer is feasible, but requires differentiating code path between strings and quoted extensions. I do not see how that improves anything.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I find suspicious for example, is that the values passed to the ~shift parameter depend on the lexical representation of the token. This sounds like lexer-level logic rather than parser-level logic.

At least it would be nice there were comments on what various parts of the code do. (What is the shift argument for? What is the difference between loc and strloc?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit moves it to the lexer.

Co-Authored-By: Alain Frisch <alain@frisch.fr>
Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the new implementation approach easier to read, thanks. (Having a name for the auxiliary concept of "get the location of the identifier" makes it easier to follow.)

I think (again) that this is generally good to go. Please see the last question about the use of ghost locations in the grammar rule.

@gasche gasche merged commit 05f15cf into ocaml:trunk Nov 25, 2019
@314eter
Copy link
Copy Markdown
Contributor

314eter commented Dec 5, 2019

Quoted extensions should probably be handled in comments too, such that (* {%foo|*)|} *) is a valid comment.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 5, 2019

I would expect {%foo|...|} to be equivalent to [%foo {|...|}] in any fragment of OCaml syntax; this indeed supports your suggestion.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 5, 2019

We would like the following extension of the testsuite to work properly. It currently does not:

diff --git a/testsuite/tests/parsing/quotedextensions.ml b/testsuite/tests/parsing/quotedextensions.ml
index 6fcb9e2654..3962cef32d 100644
--- a/testsuite/tests/parsing/quotedextensions.ml
+++ b/testsuite/tests/parsing/quotedextensions.ml
@@ -30,3 +30,15 @@ let {%M.foo bar| <hello>{|x|} |bar}
    {x}
  </hello>
 |}
+
+(* Double quotes inside quoted strings inside comments: *)
+(* {|"|}, and *)
+(* [%foo {|"|}], and *)
+(* {%foo|"|}
+  should be valid inside comments *)
+
+(* Double quotes inside quoted strings inside comments: *)
+(* {|*)|}, and *)
+(* [%foo {|*)|}], and *)
+(* {%foo|*)|}
+  should be valid inside comments *)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants