Skip to content

SI-6476 Accept escaped quotes in interp strings#4308

Closed
som-snytt wants to merge 5 commits intoscala:2.12.xfrom
som-snytt:issue/6476-backslash-b
Closed

SI-6476 Accept escaped quotes in interp strings#4308
som-snytt wants to merge 5 commits intoscala:2.12.xfrom
som-snytt:issue/6476-backslash-b

Conversation

@som-snytt
Copy link
Contributor

Everyone wants to say s"\"hello,\" world".

This commit considers \" when looking for the end of
a string part.

However, the interpolator still receives the source text.

Also adds a commit for escaped dollar.

scala> s"^hi$$".r     // sip style
res0: scala.util.matching.Regex = ^hi$

scala> s"^hi\$".r      // proposed
res1: scala.util.matching.Regex = ^hi$

scala> raw"\text\"    // interpretation of escapes is still up to the interpolator
res2: String = \text\

scala> raw"\"text\""
res3: String = \"text\"

scala> s"\"text\""
res4: String = "text"

@som-snytt
Copy link
Contributor Author

Moved the error message improvement to #4316

@som-snytt som-snytt force-pushed the issue/6476-backslash-b branch 3 times, most recently from bdf0c95 to 57f82f2 Compare February 16, 2015 04:28
@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

Started looking into this - big topic :)

A first question: is the following expected?

scala> raw"\\""
<console>:1: error: unclosed string literal
       raw"\\""
              ^

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

Here's another interesting one:

scala> raw"\".length
res0: Int = 1

scala> raw"\".length"
res1: String = \".length

EDIT: I get it, that's bc2a1eb

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

There's another inconsistency mentioned in SI-6476: since the s interpolator is responsible for escaping, it does so even for triple-quote strings. That is not specific to \", it happens for all escapes. This is very different issue than the one fixed by this PR, so I think it's good to defer it to a different PR. Just pointing out that it's also part of SI-6476.

scala> "\\"
res16: String = \

scala> s"\\"
res17: String = \

scala> """\\"""
res18: String = \\

scala> s"""\\"""
res19: String = \             << oops

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

Here's a potential issue with bc2a1eb (works on 2.11.6):

scala> val s = "hi"
s: String = hi

scala> raw"\" + s
res27: String = \hi

scala> raw"\" + "hi"
<console>:1: error: ';' expected but string literal found.
       raw"\" + "hi"
                 ^
<console>:1: error: unclosed string literal
       raw"\" + "hi"
                    ^

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

Question: what would be the impact of this change on a potential regex interpolator? @jrudolph mentions the example regexp"[\\"]+" in the 2013 scala-internals thread

@som-snytt
Copy link
Contributor Author

jrudolph asks what it would do without saying what he wants it to do.

StringContext.normalize summarizes the minimum escape support required of interpolators. That is in lieu of "parser hooks" as jrudolph asks for in another comment.

The interpolator sees what the coder typed. This PR says: Let's make it easier for the coder to get their string into the interpolator.

So r"[\\\"]+" (adding a slash to the example) would mean match a string of at least one backslash or quote. The interpolator sees the backslashes, but they are native to regex, so it wouldn't have to do anything special. Today:

scala> val r = raw"[\\\${'"'}]+".r
r: scala.util.matching.Regex = [\\\"]+

scala> "abc" match { case r(_*) => }
scala.MatchError: abc (of class java.lang.String)
  ... 33 elided

scala> "\"" match { case r(_*) => }

scala> "\"\"" match { case r(_*) => }

scala> "\"\\" match { case r(_*) => }

To your first question: as shown by normalize, you escape backslash, quote and dollar in the usual way. In raw"\\"", backslash is obviously escaped and therefore does not escape the second quote.

The second example is in the test, or maybe not. I mentioned it somewhere, maybe on the ML. The gotcha is:

scala> raw"hi\"  // Use a " here
<console>:8: error: value here is not a member of String
              raw"hi\"  // Use a " here
                                   ^

Is there an Abide hotline number?

The distinct issue you mention, that triple quoting behaves differently for interpolators, is because interpolators don't know about triple quotes. All they see are parts and args.

On your potential issue, that is what triple quotes are for: raw"""\""" + "hi". Somewhere I mention the edge case of trailing backslash in the interpolated string. The obvious intuition is that backslash does not escape triple-quote.

@som-snytt
Copy link
Contributor Author

@lrytz Thanks for taking a look. My opinion is that this change is intuitive, easy to reason about for both coder and interpolator implementor, and places a minimum of fuss between the parser and the interpolator (namely, that the interpolator sees escaped delimiters, \\, \", \$).

Actually, I was wrong about regex, r-interpolator must process \$ obviously, for the anchor.

scala> val r = StringContext.normalize(raw"(?m)foo\$.*").r        // something like
r: scala.util.matching.Regex = (?m)foo$.*

scala> r findAllIn "foo\nbar" toList
res27: List[String] = List(foo)

scala> r findAllIn "foobaz\nbar" toList
res28: List[String] = List()

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

In raw""", backslash is obviously escaped and therefore does not escape the second quote

I see what happens in the scanner implementation, but I think the result is not intuitive:

scala> raw"\\"
res0: String = \\

scala> raw"\""
res1: String = \"

scala> raw"\\"" // combine the two
<console>:1: error: unclosed string literal
       raw"\\"" // combine the two
              ^

@lrytz
Copy link
Member

lrytz commented Mar 25, 2015

Or maybe I have the wrong intuition, I have to think about it more.

@som-snytt
Copy link
Contributor Author

Does this capture the intuition? "As in most parsing, we want to consume the longest input that matches. So take to the next unescaped quote, or else take to the next quote." This is reasonable because ordinary strings are not multiline.

Also worth noting that when you focus on the edge case, the bar for intuitive is set at "OK, I can reason about that if you'll give me a napkin and a pen." I mean a serviette.

@jrudolph
Copy link
Contributor

So, if I understand you correctly, you mean that in r"[\\\"]+" the r interpolator would get passed in the string [\\\"]+, i.e. the verbatim input, "what the coder typed". This is indeed convenient for a syntax which also uses backslash-escaping.

On the other hand, it would be nice if every target string could be represented by an interpolated one without having to resort to an ${}-expression. While the situation is better than before, there are all those cases for regular expressions for which only half of them can be represented depending on the interpretation of the raw string by the r-interpolator:

  • $ to match the end
  • \$ to match the literal $
  • \\$ to match a literal \ at the end
  • \\\$ to match a literal \ followed by a literal \$

@som-snytt
Copy link
Contributor Author

You can never do plain $, oh well.

Unless you pick an arbitrary symbol like ¥, which is bound to supplant the dollar eventually anyway.

I guess we won't see guillemets to mean, Suppress interpolation of args, r«\w$».

But r"hello\$" instead of r"hello$$" for hello$, and r"\w\$" for \w$.

For r"\\\$", maybe prefer r"\\$$" for legibility.

scala> raw"(?m)\\$$".r findAllIn """abc\
     | uvw
     | def\
     | xyz""" toList
res0: List[String] = List(\, \)

To keep the style choice open, dollar aliases for all three delimiters ($", $\, $$).

@lrytz
Copy link
Member

lrytz commented Mar 26, 2015

Also worth noting that when you focus on the edge case

Absolutely agree, the PR is all about making string interpolations more intuitive to use, and I think it achieves that goal.

I imagined somebody using raw"" without knowing how it's implemented: he might think of raw as "no escapes", because all characters end up in the resulting string (raw"\\" == """\\""").

So we have to explain the raw interpolator better (todo: how exactly).

@jrudolph
Copy link
Contributor

Also worth noting that when you focus on the edge case

Absolutely agree, the PR is all about making string interpolations more intuitive to use, and I think it achieves that goal.

But as this change seems to introduce new edge cases it still makes sense to critical look at them and see how they are resolved. I think if you look at the 4 examples above and try to spell out how the r-interpolator would translate syntax into a regular expression you will see that the interpolator needs to make at least one arbitrary choice: which of, the now two, escape characters will be interpreted literally and which one will need to be translated.

Here are the possible choices under this proposal with the translation from a regular expression on the left to its representation with an r-interpolator on the right.

Possibility A (backslash will be interpreted literally while $ needs to be escaped):

  • $ => r"$$"
  • \$ => r"$\$$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\$\$$"

Possibility B (both backslash and $ need to be escaped):

  • $ => r"\$" or r"$$"
  • \$ => r"\\\$" or r"$\$$
  • \\$ => r"\\\\\$" or r"$\$\$$"
  • \\\$ => r"\\\\\\\$" or r"$\$\$\$$"

Possibility C (a variant of A, probably both could be supported at the same time):

  • $ => r"$$"
  • \$ => r"\$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\\$"

In the first two cases, allowing $-escapes is necessary and not only to "keep the style choice open".

In an alternative implementation where backslash isn't an escape character (and that's what I argued for in the previous discussion) it would be clear, dollar and doublequotes need to be escaped with dollar:

  • $ => r"$$"
  • \$ => r"\$$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\\$$"

While possibility C is the shorter than the last one (without backslash escapes), it is less regular ($ has to be escaped in some cases but mustn't be escaped in others). So, in the end this is the compromise of the proposal in this PR: trading in a bit of regularity against a bit of added convenience.

@lrytz
Copy link
Member

lrytz commented Mar 26, 2015

@jrudolph thanks for the elaboration.

I agree the last alternative makes sense, it has simple rules:

  • $ => $$: you need to escape $ because you're in a string interpolation - makes sense.
  • " => $": you need to escape ", otherwise it ends the string - also makes sense.

This could be implemented after this PR. (no?)

A corner case is that there are two options for the \" regex: you an write r"\"" or r"\$"" - the second would be recommended.

this is the compromise of the proposal in this PR: trading in a bit of regularity against a bit of added convenience

What are you referring to, specifically?

@jrudolph
Copy link
Contributor

I agree the last alternative makes sense

But it only works if backslash has no special meaning for the parser. This PR introduces backslash as an escape character, so no it couldn't be implemented after this PR.

A corner case is that there are two options for the \" regex: you an write r"\"" or r"\$"" - the second would be recommended.

You are right I omitted the choices wrt doublequotes. Note that your second option isn't possible under this PR IIUC (because \$ is already an escape sequence which means that the first " would be closing the string and the second one couldn't be parsed). Was that a typo and you meant r"$""?

Let's update the examples:

Possibility A (backslash will be interpreted literally while $ needs to be escaped):

  • $ => r"$$"
  • \$ => r"$\$$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\$\$$"
  • " => r"$""
  • " => r"\"" ? This could be made to work but would introduce another irregularity that, in contrast to the above cases, this backslash wouldn't be treated literally.

Possibility B (both backslash and $ need to be escaped):

  • $ => r"\$" or r"$$"
  • \$ => r"\\\$" or r"$\$$
  • \\$ => r"\\\\\$" or r"$\$\$$"
  • \\\$ => r"\\\\\\\$" or r"$\$\$\$$"
  • " => r"\"" or r"$""

Possibility C (a variant of A, probably both could be supported at the same time):

  • $ => r"$$"
  • \$ => r"\$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\\$"
  • " => see A above

Possibility D (doesn't work with this PR, would work by adding just the $" escape):

  • $ => r"$$"
  • \$ => r"\$$"
  • \\$ => r"\\$$"
  • \\\$ => r"\\\$$"
  • " => r"$""

@lrytz
Copy link
Member

lrytz commented Mar 26, 2015

Actually, D has another problem (in addition to requiring $"): after this PR, for r"\$$", the scanner uses the backslash to escape the $, then the second $ yields an error (invalid string interpolation).

As you point out, I made the same mistake in my example: for the regex \" you cannot write r"\$"", even if $" is supported - the scanner consumes the $ after the backlash and then ends the string.

@som-snytt
Copy link
Contributor Author

@lrytz @jrudolph I started to attempt that tabulation yesterday; and this morning, still no time to pursue it. But, this PR only changes how the parser uses \". Instead of simply closing the string part, it can take it as embedded.

Semantics are still totally up to the interpolator. A regex"" would be free to ignore backslash, so that the anchor must be $$.

An interpolator could throw on embedded quote with "invalid escape", for instance. Require triple quotes in that instance.

Another idea: is there any benefit to translating s"$$" the same as s"${ "$" }". For s"$"", that would distinguish between quote embedded in triple quotes and a dollar-interpolated quote.

I haven't thought this thru: but possibly, s"$" could be translated as s"${ DOLLAR_TOKEN }", a special arg to indicate an uninterpolated dollar. DOLLAR_TOKEN would have to be translated by a macro or error out (I think there's an annotation for that).

So r"$x" as usual, r"x$" could be processed by an r macro. That is either a neat idea to support unadorned dollar in an r macro or a can of confusion. On the weekend I'll try it out. Gotta run now...

@jrudolph
Copy link
Contributor

Maybe to clarify: the question is not so much what the interpolator exactly gets as input but what kinds of input the Scala parser can parse at all. Out of all the parseable strings the question is then how to map inputs to semantics (regardless of the intermediate format the string interpolator has to deal with).

Adding \\, \", and \$ as escape characters will alter the set of possible inputs as shown by the above examples.

@som-snytt som-snytt closed this Apr 17, 2015
Everyone wants to say `s"\"hello,\" world"`.

This commit considers `\"` when looking for the end of
a string part. `\\` is permitted to enable `\\"` for
literal `\"`.

However, the interpolator still receives the source text.

The wrinkly edge case is that trailing backslash must be
escaped. For instance, an interpolator that creates `File`
objects would require `F"~\Documents\\"`.

Improve error message for unterminated string to highlight
quoted quote behavior.
Escaped flag is internal.
Everyone wants to say `s"\$s to donuts"`.

The double-dollar is especially onerous in format strings:

```
f"$pi %1$$.1f"
```
compared to
```
f"$pi %1\$.1f"
```

and regular expressions

```
scala> s"^hi$$".r
res0: scala.util.matching.Regex = ^hi$

scala> s"^hi\$".r
res1: scala.util.matching.Regex = ^hi$
```

This commit makes dollars escapable in the way everyone
expects, but retains the double-your-dollars escape for
compatibility.

However, "\$", escaped dollar in an ordinary string literal,
is not tolerated.

This is deemed good behavior because everyone should use
the f-interpolator all the time, even for constant strings.
That way, you'll never see the missing-interpolator lintage.
Everyone wants to say `raw"\hello, world.\"`.

Parser accepts arbitrary escaped quotes optionally followed by
a quote. That is, an escaped quote is deemed to terminate the
interpolated string if it is not followed by another quote on
the same line.
An alternative method for processing escapes, which
processes only escaped delimiters (\\, \", \$).

This is a minimal subset which an interpolator may
choose to support.

Optionally, it can also silently remove all escaping backslashes.

It does not convert standard escapes such as \n for newline.
@som-snytt
Copy link
Contributor Author

Rebasing.

@som-snytt som-snytt reopened this Jun 26, 2015
@som-snytt som-snytt force-pushed the issue/6476-backslash-b branch from 57f82f2 to ffc94e9 Compare June 26, 2015 04:01
@adriaanm adriaanm modified the milestones: 2.12.0-M1, 2.12.0-M3 Jul 11, 2015
@retronym
Copy link
Member

retronym commented Aug 7, 2015

Seems like the old off-by-one joke was off by one, there are in fact only three hard things in computer science, cache invalidation, naming things, escaping, and off-by-one errors.

I don't think we're close enough to a consensus to leave this PR open. I'm going to close it for now, but please don't take this as a signal that we should give up on this avenue of improvement.

@retronym retronym closed this Aug 7, 2015
@SethTisue
Copy link
Member

I think this is going to need a SIP.

@jvican
Copy link
Member

jvican commented Nov 24, 2016

@som-snytt Are you thinking of reopening and updating this proposal any time soon? If you're still interested, as you seem to point out here, I will schedule a review as soon as I have an open PR 😄.

@DanielGGordon
Copy link

@som-snytt I encountered this issue myself recently and would love to see this reopened. Had this issue when I wanted to use string interpolation and triple quotes to create a regex string.

@som-snytt
Copy link
Contributor Author

@som-snytt som-snytt deleted the issue/6476-backslash-b branch December 19, 2020 17:36
@som-snytt som-snytt restored the issue/6476-backslash-b branch March 4, 2021 16:45
@som-snytt som-snytt deleted the issue/6476-backslash-b branch August 31, 2023 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants