SI-6476 Accept escaped quotes in interp strings#4308
SI-6476 Accept escaped quotes in interp strings#4308som-snytt wants to merge 5 commits intoscala:2.12.xfrom
Conversation
f838601 to
e45ca2d
Compare
|
Moved the error message improvement to #4316 |
bdf0c95 to
57f82f2
Compare
|
Started looking into this - big topic :) A first question: is the following expected? |
|
Here's another interesting one: EDIT: I get it, that's bc2a1eb |
|
There's another inconsistency mentioned in SI-6476: since the |
|
Here's a potential issue with bc2a1eb (works on 2.11.6): |
|
Question: what would be the impact of this change on a potential regex interpolator? @jrudolph mentions the example |
|
jrudolph asks what it would do without saying what he wants it to do.
The interpolator sees what the coder typed. This PR says: Let's make it easier for the coder to get their string into the interpolator. So To your first question: as shown by normalize, you escape backslash, quote and dollar in the usual way. In The second example is in the test, or maybe not. I mentioned it somewhere, maybe on the ML. The gotcha is: Is there an Abide hotline number? The distinct issue you mention, that triple quoting behaves differently for interpolators, is because interpolators don't know about triple quotes. All they see are parts and args. On your potential issue, that is what triple quotes are for: |
|
@lrytz Thanks for taking a look. My opinion is that this change is intuitive, easy to reason about for both coder and interpolator implementor, and places a minimum of fuss between the parser and the interpolator (namely, that the interpolator sees escaped delimiters, Actually, I was wrong about regex, r-interpolator must process |
I see what happens in the scanner implementation, but I think the result is not intuitive: |
|
Or maybe I have the wrong intuition, I have to think about it more. |
|
Does this capture the intuition? "As in most parsing, we want to consume the longest input that matches. So take to the next unescaped quote, or else take to the next quote." This is reasonable because ordinary strings are not multiline. Also worth noting that when you focus on the edge case, the bar for intuitive is set at "OK, I can reason about that if you'll give me a napkin and a pen." I mean a serviette. |
|
So, if I understand you correctly, you mean that in On the other hand, it would be nice if every target string could be represented by an interpolated one without having to resort to an
|
|
You can never do plain Unless you pick an arbitrary symbol like I guess we won't see guillemets to mean, Suppress interpolation of args, But For To keep the style choice open, dollar aliases for all three delimiters ( |
Absolutely agree, the PR is all about making string interpolations more intuitive to use, and I think it achieves that goal. I imagined somebody using So we have to explain the |
But as this change seems to introduce new edge cases it still makes sense to critical look at them and see how they are resolved. I think if you look at the 4 examples above and try to spell out how the r-interpolator would translate syntax into a regular expression you will see that the interpolator needs to make at least one arbitrary choice: which of, the now two, escape characters will be interpreted literally and which one will need to be translated. Here are the possible choices under this proposal with the translation from a regular expression on the left to its representation with an r-interpolator on the right. Possibility A (backslash will be interpreted literally while $ needs to be escaped):
Possibility B (both backslash and $ need to be escaped):
Possibility C (a variant of A, probably both could be supported at the same time):
In the first two cases, allowing $-escapes is necessary and not only to "keep the style choice open". In an alternative implementation where backslash isn't an escape character (and that's what I argued for in the previous discussion) it would be clear, dollar and doublequotes need to be escaped with dollar:
While possibility C is the shorter than the last one (without backslash escapes), it is less regular ($ has to be escaped in some cases but mustn't be escaped in others). So, in the end this is the compromise of the proposal in this PR: trading in a bit of regularity against a bit of added convenience. |
|
@jrudolph thanks for the elaboration. I agree the last alternative makes sense, it has simple rules:
This could be implemented after this PR. (no?) A corner case is that there are two options for the
What are you referring to, specifically? |
But it only works if backslash has no special meaning for the parser. This PR introduces backslash as an escape character, so no it couldn't be implemented after this PR.
You are right I omitted the choices wrt doublequotes. Note that your second option isn't possible under this PR IIUC (because Let's update the examples: Possibility A (backslash will be interpreted literally while $ needs to be escaped):
Possibility B (both backslash and $ need to be escaped):
Possibility C (a variant of A, probably both could be supported at the same time):
Possibility D (doesn't work with this PR, would work by adding just the
|
|
Actually, D has another problem (in addition to requiring As you point out, I made the same mistake in my example: for the regex |
|
@lrytz @jrudolph I started to attempt that tabulation yesterday; and this morning, still no time to pursue it. But, this PR only changes how the parser uses Semantics are still totally up to the interpolator. A regex"" would be free to ignore backslash, so that the anchor must be An interpolator could throw on embedded quote with "invalid escape", for instance. Require triple quotes in that instance. Another idea: is there any benefit to translating s"$$" the same as s"${ "$" }". For s"$"", that would distinguish between quote embedded in triple quotes and a dollar-interpolated quote. I haven't thought this thru: but possibly, s"$" could be translated as s"${ DOLLAR_TOKEN }", a special arg to indicate an uninterpolated dollar. DOLLAR_TOKEN would have to be translated by a macro or error out (I think there's an annotation for that). So |
|
Maybe to clarify: the question is not so much what the interpolator exactly gets as input but what kinds of input the Scala parser can parse at all. Out of all the parseable strings the question is then how to map inputs to semantics (regardless of the intermediate format the string interpolator has to deal with). Adding |
Everyone wants to say `s"\"hello,\" world"`. This commit considers `\"` when looking for the end of a string part. `\\` is permitted to enable `\\"` for literal `\"`. However, the interpolator still receives the source text. The wrinkly edge case is that trailing backslash must be escaped. For instance, an interpolator that creates `File` objects would require `F"~\Documents\\"`. Improve error message for unterminated string to highlight quoted quote behavior.
Escaped flag is internal.
Everyone wants to say `s"\$s to donuts"`. The double-dollar is especially onerous in format strings: ``` f"$pi %1$$.1f" ``` compared to ``` f"$pi %1\$.1f" ``` and regular expressions ``` scala> s"^hi$$".r res0: scala.util.matching.Regex = ^hi$ scala> s"^hi\$".r res1: scala.util.matching.Regex = ^hi$ ``` This commit makes dollars escapable in the way everyone expects, but retains the double-your-dollars escape for compatibility. However, "\$", escaped dollar in an ordinary string literal, is not tolerated. This is deemed good behavior because everyone should use the f-interpolator all the time, even for constant strings. That way, you'll never see the missing-interpolator lintage.
Everyone wants to say `raw"\hello, world.\"`. Parser accepts arbitrary escaped quotes optionally followed by a quote. That is, an escaped quote is deemed to terminate the interpolated string if it is not followed by another quote on the same line.
An alternative method for processing escapes, which processes only escaped delimiters (\\, \", \$). This is a minimal subset which an interpolator may choose to support. Optionally, it can also silently remove all escaping backslashes. It does not convert standard escapes such as \n for newline.
|
Rebasing. |
57f82f2 to
ffc94e9
Compare
|
Seems like the old off-by-one joke was off by one, there are in fact only three hard things in computer science, cache invalidation, naming things, escaping, and off-by-one errors. I don't think we're close enough to a consensus to leave this PR open. I'm going to close it for now, but please don't take this as a signal that we should give up on this avenue of improvement. |
|
I think this is going to need a SIP. |
|
@som-snytt Are you thinking of reopening and updating this proposal any time soon? If you're still interested, as you seem to point out here, I will schedule a review as soon as I have an open PR 😄. |
|
@som-snytt I encountered this issue myself recently and would love to see this reopened. Had this issue when I wanted to use string interpolation and triple quotes to create a regex string. |
|
@DanielGGordon https://github.com/scala/scala/pull/5771/files suggests |
Everyone wants to say
s"\"hello,\" world".This commit considers
\"when looking for the end ofa string part.
However, the interpolator still receives the source text.
Also adds a commit for escaped dollar.