Skip to content

Add support for /regex/#18842

Merged
nik9000 merged 1 commit intoelastic:masterfrom
nik9000:painless_regex
Jun 13, 2016
Merged

Add support for /regex/#18842
nik9000 merged 1 commit intoelastic:masterfrom
nik9000:painless_regex

Conversation

@nik9000
Copy link
Copy Markdown
Member

@nik9000 nik9000 commented Jun 13, 2016

Adds /regex/ as a regex constructor. A couple of fun points:

  1. This makes generic the idea of arbitrary stuff adding a constant.
    Both SFunction and LRegex create a statically initialized constant.
    Both go through Locals to do this because they LRegex isn't directly
    iterable from SScript.
  2. Differentiating / as-in-division from / as-in-start-of-regex
    is hard. See:
    http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
    The javascript folks have a way, way tougher time of it then we do
    because they have semicolon insertion. We have the much simpler
    delimiter rules. Even with our simpler life we still have to add
    a hack to get lexing /regex/ to work properly. I chose to add
    token-level lookbehind because it seems to be a pretty contained hack.
    I considered and rejected lexer modes, a lexer member variable,
    having the parser set variables on the lexer (this is a fairly common
    solution for js, I believe), and moving regex parsing to the parser
    level.
  3. I've only added a very small subset of java.util.regex to the
    whitelist because it is the subset I needed to test LRegex sanely.
    More deserves to be added, and maybe more regex syntax like =~ and
    ==~. Those can probably be added without too much pain.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

@jdconrad I wanted to take a crack at regexes "because parsers are fun" (irrelevant ping for @bd808)!

It turned out a bit more complicated than I'd liked because / is hard. I think I added some lexer ambiguity by adding the sempreds where I did. I didn't see any slowdown but I'm not looking very hard. I expect we can work around it by iterating on this though.

@jdconrad
Copy link
Copy Markdown
Contributor

@nik9000 I'll make sure to take a look at this at some point today. Thanks for tackling this!

@uschindler
Copy link
Copy Markdown
Contributor

I like the new constants handling, including the writeConstants method ref :-) I have not closely looked at the antlr code, but the ASM stuff looks fine. I will report once I have done a more close review!

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

I like the new constants handling, including the writeConstants method ref :-) I have not closely looked at the antlr code, but the ASM stuff looks fine. I will report once I have done a more close review!

Thanks!

@uschindler
Copy link
Copy Markdown
Contributor

uschindler commented Jun 13, 2016

I think all "synthetic constants" should have a "$" in their name, so they are definitely not accessible outside. Javac does the same (e.g. the assertion constant, or lambda implementation methods). Our handles for the DEF case already have this, maybe make it consistent.

A second thing: If the pattern is bullshit, you would only see this in a clinit error. Maybe you should pass the Pattern to Pattern.compile also as a compile-time check (just to try if it compiles).

Another question: How to handle pattern arguments like caseinsensitive? currently you have to pass as "(?i)" as part of the regex. PERL users know this as coming after the second slash, e.g. /regex/i

@jdconrad
Copy link
Copy Markdown
Contributor

@nik9000 The constants stuff is awesome. Very clean. I'm wondering if we can use modes instead of custom code for the regexes in the lexer. I just can't help but think we should be able to resolve that divides conflict some other way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I realize your IDE probably did this, but would you mind switching it back to org.elasticsearch.painless.Definition.Type; and list out the org.objectweb.asm.Type instead? It matches with the other imports already from Definition.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that intentionally but I can revert it.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

I think all "synthetic constants" should have a "$" in their name

Sure! I saw that at some point in something else and forgot to adapt. Easy enough to fix.

If the pattern is bullshit, you would only see this in a clinit error.

Yeah. I wonder if I can compile it when I hit is and just pass the compiled Pattern to the initializer.

/regex/i

I was hoping to do that in a followup, but, yes, I think we should hadn't flags after the trailing /.

I'm wondering if we can use modes instead of custom code for the regexes in the lexer.

I investigated that. We can do it but it makes the lexer much more complex to read though probably simpler to run. Because modes don't inherit from each other you'd need to copy every rule and rename it. It pretty much quadruples the size of the lexer because you need to specify every mode transition and most token's types. Twice.

@jdconrad
Copy link
Copy Markdown
Contributor

jdconrad commented Jun 13, 2016

@nik9000 Yeah, you're right. It would be ridiculous to do this another way. I'm on board with the grammar changes. The only question I have left is should the predicates in the lexer be on the right side of the rules? I think this might have ramifications on how it matches the first and last characters. (In the lexer they can be anywhere.) Edit: The reason I ask this question is right hand predicates seem to be the preferred way in the book.

@uschindler
Copy link
Copy Markdown
Contributor

Yeah. I wonder if I can compile it when I hit is and just pass the compiled Pattern to the initializer.

That does not work with final variables. Just compiling it while parsing once does not matter in reality, Its just to check that syntax is correct and throw some sytax related exception on parse problem. The compiled pattern is not important. Pattern.compile() should be fast, its only slow if you do it millions of times :-)

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

The only question I have left is should the predicates in the lexer be on the right side of the rules?

I've moved it to the right side and it seems to be being called far fewer times. I just checked by debugging it, but that seems like a good thing?

@nik9000 nik9000 removed the WIP label Jun 13, 2016
@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

@uschindler and @jdconrad I believe this is ready for another round.

@uschindler
Copy link
Copy Markdown
Contributor

I'd make the type in addConstant also be a Definition.Type for consistency... unless it is too complicated to setup for the MethodHandles special case...

@uschindler
Copy link
Copy Markdown
Contributor

Ignore that it's too complicated! Maybe only add a second overload for the consistency :)

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

unless it is too complicated to setup for the MethodHandles special case...

That is why I didn't do it in the first place.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

Maybe only add a second overload for the consistency :)

That I'm happy to do.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

That I'm happy to do.

Done.

@jdconrad
Copy link
Copy Markdown
Contributor

The behavior of the lexer makes sense with the predicate being called fewer times if it's on the right hand side. It will only test the predicate now if it actually matches one of the tokens.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

Morning syntactic predicates....

@uschindler
Copy link
Copy Markdown
Contributor

Ok, from my side all looks fine. I'd just add some tests for more complicated regexes (e.g. those with escaped forward slashes or similar).

Also the renamed constants with a $ similar to the others are still missing. Otherwise LGTM from the ASM side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove that method now and inline the body above. It was just added to make the code in SSource not rely on internals here.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

Morning

mourning.... Spelling is hard.

@uschindler
Copy link
Copy Markdown
Contributor

mourning.... Spelling is hard.

This is why the method had to die! :-)

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

OK! Since @uschindler is good with it and we're pretty much done with the review part I've prepared it for merge. @jdconrad, are you good with it?

@jdconrad
Copy link
Copy Markdown
Contributor

This looks good to me. +1 @nik9000 Thanks for adding this!

Adds `/regex/` as a regex constructor. A couple of fun points:
1. This makes generic the idea of arbitrary stuff adding a constant.
Both SFunction and LRegex create a statically initialized constant.
Both go through Locals to do this because they LRegex isn't directly
iterable from SScript.
2. Differentiating `/` as-in-division from `/` as-in-start-of-regex
is hard. See:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
The javascript folks have a way, way tougher time of it then we do
because they have semicolon insertion. We have the much simpler
delimiter rules. Even with our simpler life we still have to add
a hack to get lexing `/regex/` to work properly. I chose to add
token-level lookbehind because it seems to be a pretty contained hack.
I considered and rejected lexer modes, a lexer member variable,
having the parser set variables on the lexer (this is a fairly common
solution for js, I believe), and moving regex parsing to the parser
level.
3. I've only added a very small subset of java.util.regex to the
whitelist because it is the subset I needed to test LRegex sanely.
More deserves to be added, and maybe more regex syntax like `=~` and
`==~`. Those can probably be added without too much pain.
@nik9000 nik9000 merged commit 6617b53 into elastic:master Jun 13, 2016
@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 13, 2016

Thanks for reviewing @jdconrad and @uschindler !

@clintongormley clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement v5.0.0-alpha4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants