Skip to content

String interpolation#72

Merged
aryx merged 10 commits intotree-sitter:masterfrom
cfroystad:strings
Aug 31, 2021
Merged

String interpolation#72
aryx merged 10 commits intotree-sitter:masterfrom
cfroystad:strings

Conversation

@cfroystad
Copy link
Collaborator

@cfroystad cfroystad commented Apr 27, 2021

See #34

After some blood, sweat and tears, I proudly present interpolated strings for PHP!

If the code can be improved in any way or there is an edge case I've missed, please educate me!

Big thanks to @ahelwer for helping me get unstuck on this issue!

Added nodes:

  • encapsed_string
  • escape_sequence
  • _simple_string_part (for inline $ usage without curly brackets)
  • _complex_string_part (for the complex notation using {$...})
  • _simple_string_member_access_expression, _simple_string_subscript_expression

Examples:

"This is {$great}";

(expression_statement
  (encapsed_string
    (string)
    (variable_name (name))
  )
)

"This works: {$arr['key']}";

(expression_statement
  (encapsed_string
    (string)
    (subscript_expression
      (variable_name (name))
      (string)
    )
  )
)

"Hello $people, you're awesome!";

(expression_statement
  (encapsed_string
    (string)
    (variable_name (name))
    (string)
  )
)

"$people->john drank some $juices[0] juice.".PHP_EOL;

(expression_statement
  (binary_expression
    (encapsed_string
      (member_access_expression
        (variable_name (name))
        (name)
      )
      (string)
      (subscript_expression
        (variable_name (name))
        (integer)
      )
      (string)
    )
    (qualified_name (name))
  )
)

Checklist:

  • All tests pass in CI
  • There are sufficient tests for the new fix/feature (see string.txt)
  • Grammar rules have not been renamed unless absolutely necessary (0 rules renamed)
  • The conflicts section hasn't grown too much (0 new conflicts)
  • The parser size hasn't grown too much (master: 2595, PR: 2626)

@cfroystad cfroystad mentioned this pull request May 1, 2021
32 tasks
@cfroystad

This comment has been minimized.

@ghost
Copy link

ghost commented Jul 16, 2021

I just want to thank you @cfroystad and @maxbrunsfeld for doing all the work here. I have no idea how all this parsing stuff works, but you make my day job A LOT easier ❤️

@cfroystad

This comment has been minimized.

@cfroystad cfroystad changed the title WIP: Initial work on string interpolation Initial work on string interpolation Aug 18, 2021
@cfroystad cfroystad marked this pull request as ready for review August 18, 2021 17:21
@cfroystad cfroystad requested a review from maxbrunsfeld August 18, 2021 17:22
@ahelwer
Copy link

ahelwer commented Aug 18, 2021

I see you took option two, very nice! One trip-up I should have mentioned when branching off values in the valid_symbols array: if tree-sitter encounters a parse error and enters error recovery mode, the first thing it does is call your external scanner with all symbols marked as valid (I first thought this was a bug). So it's probably a good idea to detect this has happened and then decide what to do; see my code here: https://github.com/tlaplus-community/tree-sitter-tlaplus/blob/4c05e1c54e1e26624b5f9cccc840159ab8bb924b/src/scanner.cc#L1626

@cfroystad
Copy link
Collaborator Author

Thanks for yet another helpful insight!

Since the main grammar is in the internal parser, I suppose it's correct to do as you've done in your code: Hand control back from the external scanner and rather have the main parser call it again if it should find it useful.

@ahelwer
Copy link

ahelwer commented Aug 18, 2021

One idea I had was to add a token $.error_sentinel (unused within the grammar) to the very end of my externals list, then branch only on its value within the external scanner to reduce the fragility of this error recovery detection scheme. But I haven't checked whether that works. Perhaps I should do that now.

@ahelwer
Copy link

ahelwer commented Aug 18, 2021

@cfroystad
Copy link
Collaborator Author

Brilliant! That made the detection less brittle to future changes! Thanks 🙂

@cfroystad cfroystad changed the title Initial work on string interpolation String interpolation Aug 18, 2021
src/scanner.cc Outdated
}
}

bool is_escapable_sequence(TSLexer *lexer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked as static/private?

@Sjord
Copy link
Contributor

Sjord commented Aug 19, 2021

  • "hello ${a} world"; becomes (string [1, 1] - [1, 17]), but I would expect a variable_name to be in there.
  • "hello \0 world"; becomes a string, but I would expect an escape_sequence in there.
  • The following gives an error:
<?php
$ab = "ab";
echo "hello ${"a"."b"} world";

@Sjord
Copy link
Contributor

Sjord commented Aug 19, 2021

Also, braces can be escaped with a backslash. The following is valid in PHP but gives an error with tree-sitter in this branch:

<?php "\{$";

Edit: maybe it's not true that braces can be escaped. If you run echo "\{$";, it prints \{$. Not {$, which is what you would expect if the backslash escapes the brace. The docs also don't mention this escape sequence.

@cfroystad
Copy link
Collaborator Author

Thank you for testing and identifying some of the things I've missed. I'll look into fixing them when I get home from work. Most of them look like they'll be easy to fix (although slightly embarrassing 😄)

@cfroystad
Copy link
Collaborator Author

All review comments have been addressed in the last commit.

Thank you both for identifying bugs and suggesting good improvements.

With regard to escaping \{ it appears it's not really escaping but whenever those two letter coincide, they're both printed instead of interpreted. I've changed the implementation accordingly.

@Sjord
Copy link
Contributor

Sjord commented Aug 20, 2021

Nice work. It seems I was mistaken about dynamic variable names. "$$$$$$$$$$$$$a" is interpreted by PHP as many dollars and a single variable, not as a dynamic variable.

I also found some more edge cases that fail to parse:

  • "${a}["
  • "\u{$a}"

@Sjord
Copy link
Contributor

Sjord commented Aug 20, 2021

Nice work. It seems I was mistaken about dynamic variable names. "$$$$$$$$$$$$$a" is interpreted by PHP as many dollars and a single variable, not as a dynamic variable.

I also found some more edge cases that fail to parse:

  • "${a}["
  • "\u{$a}"

@cfroystad
Copy link
Collaborator Author

Thanks for testing!

I've fixed the two first items, but might need some guidance on the last one.

With the current implementation, I'd need to have two candidate mark_end (which as far as I understand is not supported):

" <- mark_end[string end: potential escape sequence]
\
u <- mark_end[string end: potential complex string element]
{ <- we don't know which to use yet
$ <- Ah, it's a complex string element, use "string end: potential complex string element" end
a
}
"

This might all be a consequence of me doing too much in the external scanner due to not having found a way to solve them in the main parser.

Any pointers/ideas on how to handle this would be greatly appreciated! (one option is of course to wait until Hack is done ironing out the same bugs, and try to adopt their approach)

@Sjord
Copy link
Contributor

Sjord commented Aug 26, 2021

I guess the lexer needs to scan both \u{0123} and \u{$a}, and the parser (grammar.js) needs to distinguish between the two.

However, this is really an exceptional edge case, so it would also be fine to merge this PR and leave it for now, in my opinion.

@cfroystad
Copy link
Collaborator Author

Thank you, @Sjord! That was the pointer I needed (in combination with driving along the coast of Norway, having loads of time while on ferries). The last edge case is no fixed - even though slightly hackish.

The edge case "adasdfafd\u{$a} will be translated into:

(program
    (php_tag)
    (expression_statement
        (encapsed_string
            (string)
            (string)
            (variable_name (name)
        )
    )
)

@aryx
Copy link
Contributor

aryx commented Aug 30, 2021

There's a conflict. You probably need to merge master in your branch (or rebase).

@cfroystad
Copy link
Collaborator Author

cfroystad commented Aug 30, 2021

Thanks! I forgot to update the branch after merging another PR earlier. However, the conflict is only in the generated parser and should thus not be a blocker for final review (my merging powers for tree-sitter-php doesn't include changes as complex and wide-reaching as this)

I'll try to find a few minutes to rebase when I get home from work in a few hours 🙂

@cfroystad
Copy link
Collaborator Author

Rebased

@aryx aryx merged commit 50c6951 into tree-sitter:master Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error on string interpolation with double quotes Variables that will be expanded in double quoted strings are not parsed as variables

4 participants