Add support for /regex/ by nik9000 · Pull Request #18842 · elastic/elasticsearch

nik9000 · 2016-06-13T15:51:51Z

Adds /regex/ as a regex constructor. A couple of fun points:

This makes generic the idea of arbitrary stuff adding a constant.
Both SFunction and LRegex create a statically initialized constant.
Both go through Locals to do this because they LRegex isn't directly
iterable from SScript.
Differentiating / as-in-division from / as-in-start-of-regex
is hard. See:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
The javascript folks have a way, way tougher time of it then we do
because they have semicolon insertion. We have the much simpler
delimiter rules. Even with our simpler life we still have to add
a hack to get lexing /regex/ to work properly. I chose to add
token-level lookbehind because it seems to be a pretty contained hack.
I considered and rejected lexer modes, a lexer member variable,
having the parser set variables on the lexer (this is a fairly common
solution for js, I believe), and moving regex parsing to the parser
level.
I've only added a very small subset of java.util.regex to the
whitelist because it is the subset I needed to test LRegex sanely.
More deserves to be added, and maybe more regex syntax like =~ and
==~. Those can probably be added without too much pain.

nik9000 · 2016-06-13T15:56:18Z

@jdconrad I wanted to take a crack at regexes "because parsers are fun" (irrelevant ping for @bd808)!

It turned out a bit more complicated than I'd liked because / is hard. I think I added some lexer ambiguity by adding the sempreds where I did. I didn't see any slowdown but I'm not looking very hard. I expect we can work around it by iterating on this though.

jdconrad · 2016-06-13T16:29:42Z

@nik9000 I'll make sure to take a look at this at some point today. Thanks for tackling this!

uschindler · 2016-06-13T16:36:33Z

I like the new constants handling, including the writeConstants method ref :-) I have not closely looked at the antlr code, but the ASM stuff looks fine. I will report once I have done a more close review!

nik9000 · 2016-06-13T16:40:18Z

I like the new constants handling, including the writeConstants method ref :-) I have not closely looked at the antlr code, but the ASM stuff looks fine. I will report once I have done a more close review!

Thanks!

uschindler · 2016-06-13T16:57:46Z

I think all "synthetic constants" should have a "$" in their name, so they are definitely not accessible outside. Javac does the same (e.g. the assertion constant, or lambda implementation methods). Our handles for the DEF case already have this, maybe make it consistent.

A second thing: If the pattern is bullshit, you would only see this in a clinit error. Maybe you should pass the Pattern to Pattern.compile also as a compile-time check (just to try if it compiles).

Another question: How to handle pattern arguments like caseinsensitive? currently you have to pass as "(?i)" as part of the regex. PERL users know this as coming after the second slash, e.g. /regex/i

jdconrad · 2016-06-13T17:06:55Z

@nik9000 The constants stuff is awesome. Very clean. I'm wondering if we can use modes instead of custom code for the regexes in the lexer. I just can't help but think we should be able to resolve that divides conflict some other way.

jdconrad · 2016-06-13T17:08:27Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/Locals.java

Nit: I realize your IDE probably did this, but would you mind switching it back to org.elasticsearch.painless.Definition.Type; and list out the org.objectweb.asm.Type instead? It matches with the other imports already from Definition.

I did that intentionally but I can revert it.

nik9000 · 2016-06-13T17:14:04Z

I think all "synthetic constants" should have a "$" in their name

Sure! I saw that at some point in something else and forgot to adapt. Easy enough to fix.

If the pattern is bullshit, you would only see this in a clinit error.

Yeah. I wonder if I can compile it when I hit is and just pass the compiled Pattern to the initializer.

/regex/i

I was hoping to do that in a followup, but, yes, I think we should hadn't flags after the trailing /.

I'm wondering if we can use modes instead of custom code for the regexes in the lexer.

I investigated that. We can do it but it makes the lexer much more complex to read though probably simpler to run. Because modes don't inherit from each other you'd need to copy every rule and rename it. It pretty much quadruples the size of the lexer because you need to specify every mode transition and most token's types. Twice.

jdconrad · 2016-06-13T17:35:35Z

@nik9000 Yeah, you're right. It would be ridiculous to do this another way. I'm on board with the grammar changes. The only question I have left is should the predicates in the lexer be on the right side of the rules? I think this might have ramifications on how it matches the first and last characters. (In the lexer they can be anywhere.) Edit: The reason I ask this question is right hand predicates seem to be the preferred way in the book.

uschindler · 2016-06-13T17:36:43Z

Yeah. I wonder if I can compile it when I hit is and just pass the compiled Pattern to the initializer.

That does not work with final variables. Just compiling it while parsing once does not matter in reality, Its just to check that syntax is correct and throw some sytax related exception on parse problem. The compiled pattern is not important. Pattern.compile() should be fast, its only slow if you do it millions of times :-)

nik9000 · 2016-06-13T18:32:52Z

The only question I have left is should the predicates in the lexer be on the right side of the rules?

I've moved it to the right side and it seems to be being called far fewer times. I just checked by debugging it, but that seems like a good thing?

nik9000 · 2016-06-13T18:47:36Z

@uschindler and @jdconrad I believe this is ready for another round.

uschindler · 2016-06-13T18:56:39Z

I'd make the type in addConstant also be a Definition.Type for consistency... unless it is too complicated to setup for the MethodHandles special case...

uschindler · 2016-06-13T18:58:07Z

Ignore that it's too complicated! Maybe only add a second overload for the consistency :)

nik9000 · 2016-06-13T18:58:10Z

unless it is too complicated to setup for the MethodHandles special case...

That is why I didn't do it in the first place.

nik9000 · 2016-06-13T18:58:34Z

Maybe only add a second overload for the consistency :)

That I'm happy to do.

nik9000 · 2016-06-13T20:02:38Z

That I'm happy to do.

Done.

jdconrad · 2016-06-13T20:07:54Z

The behavior of the lexer makes sense with the predicate being called fewer times if it's on the right hand side. It will only test the predicate now if it actually matches one of the tokens.

nik9000 · 2016-06-13T20:27:58Z

Morning syntactic predicates....

uschindler · 2016-06-13T21:15:14Z

Ok, from my side all looks fine. I'd just add some tests for more complicated regexes (e.g. those with escaped forward slashes or similar).

Also the renamed constants with a $ similar to the others are still missing. Otherwise LGTM from the ASM side.

uschindler · 2016-06-13T21:22:17Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/node/SFunction.java

I think you can remove that method now and inline the body above. It was just added to make the code in SSource not rely on internals here.

nik9000 · 2016-06-13T21:43:47Z

Morning

mourning.... Spelling is hard.

uschindler · 2016-06-13T21:47:15Z

mourning.... Spelling is hard.

This is why the method had to die! :-)

nik9000 · 2016-06-13T21:49:48Z

OK! Since @uschindler is good with it and we're pretty much done with the review part I've prepared it for merge. @jdconrad, are you good with it?

jdconrad · 2016-06-13T21:51:30Z

This looks good to me. +1 @nik9000 Thanks for adding this!

Adds `/regex/` as a regex constructor. A couple of fun points: 1. This makes generic the idea of arbitrary stuff adding a constant. Both SFunction and LRegex create a statically initialized constant. Both go through Locals to do this because they LRegex isn't directly iterable from SScript. 2. Differentiating `/` as-in-division from `/` as-in-start-of-regex is hard. See: http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions The javascript folks have a way, way tougher time of it then we do because they have semicolon insertion. We have the much simpler delimiter rules. Even with our simpler life we still have to add a hack to get lexing `/regex/` to work properly. I chose to add token-level lookbehind because it seems to be a pretty contained hack. I considered and rejected lexer modes, a lexer member variable, having the parser set variables on the lexer (this is a fairly common solution for js, I believe), and moving regex parsing to the parser level. 3. I've only added a very small subset of java.util.regex to the whitelist because it is the subset I needed to test LRegex sanely. More deserves to be added, and maybe more regex syntax like `=~` and `==~`. Those can probably be added without too much pain.

nik9000 · 2016-06-13T22:13:00Z

Thanks for reviewing @jdconrad and @uschindler !

nik9000 added review WIP v5.0.0-alpha4 labels Jun 13, 2016

jdconrad mentioned this pull request Jun 13, 2016

Ongoing Painless Improvements #17992

Closed

18 tasks

jdconrad reviewed Jun 13, 2016
View reviewed changes

nik9000 removed the WIP label Jun 13, 2016

uschindler reviewed Jun 13, 2016
View reviewed changes

nik9000 force-pushed the painless_regex branch from ffcce5d to 07bfaa3 Compare June 13, 2016 21:45

nik9000 force-pushed the painless_regex branch from 07bfaa3 to 852e527 Compare June 13, 2016 21:48

nik9000 force-pushed the painless_regex branch from 852e527 to 6617b53 Compare June 13, 2016 22:12

nik9000 merged commit 6617b53 into elastic:master Jun 13, 2016

clintongormley added the >enhancement label Jun 28, 2016

clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018

Conversation

nik9000 commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

jdconrad commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdconrad commented Jun 13, 2016

Uh oh!

jdconrad Jun 13, 2016

Choose a reason for hiding this comment

Uh oh!

nik9000 Jun 13, 2016

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

jdconrad commented Jun 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

jdconrad commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

uschindler Jun 13, 2016

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

uschindler commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

jdconrad commented Jun 13, 2016

Uh oh!

nik9000 commented Jun 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uschindler commented Jun 13, 2016 •

edited

Loading

jdconrad commented Jun 13, 2016 •

edited

Loading