Add support for the find operator (=~) and the match operator (==~) by nik9000 · Pull Request #18858 · elastic/elasticsearch

nik9000 · 2016-06-14T13:52:02Z

Adds support for the find operator (=~) and the match operator (==~) to painless's regexes. Also whitelists most of the Pattern class and documents regex support in painless.

The find operator (=~) constructs a matcher. The operator is called "find" because matchers coerce to booleans by calling Matcher.find. So this is valid:

if (ctx._source.last =~ /b/)

As is this:

Matcher m = ctx._source.last =~ /[aeiou]/

The match operator (==~) returns a boolean, true if the lhs Matcher.matches the rhs. So this is valid:

if (ctx._source.last ==~ /^[^aeiou].*[aeiou]$/)

All three of these things (find operator, match operator, Matcher -> boolean coercion) are "borrowed" from groovy. Groovy has lots more useful stuff around regexes but this is a good start.

jdconrad · 2016-06-14T14:59:48Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/AnalyzerCaster.java

I'm no expert on regexes, so my question here is can you have a script where you do something like boolean x = (boolean)someMatcher;? If not, this isn't the correct place for this. My gut feeling is this should probably be part of the EComp node or its own node.

I think this should be valid: return (boolean)("foo" =~ /foo/)

("foo" =~ /foo/) What type does this return to begin with?

Got it. All right, that seems good then. Thanks for the answers.

jdconrad · 2016-06-14T15:09:48Z

@nik9000 This looks pretty good to me. One more serious comment about functionality and one minor follow up question.

jdconrad · 2016-06-14T15:33:37Z

@nik9000 +1 Thanks for the follow up answers.

uschindler · 2016-06-14T15:55:28Z

Syntax looks file to me.

Question about DEFs: If you have assigned a pattern to def, the boolean coercion and operators won't work, or does it?

I am a bit afraid about the new Sort.MATCHER type. Unless this breaks hardcoded checks I am fine with it, but we have a lot of code like type.sort == Sort.DEF or similar. We should check those if statements that they don't forget to also handle MATCHER when they expect an object.

nik9000 · 2016-06-14T16:02:21Z

Question about DEFs: If you have assigned a pattern to def, the boolean coercion and operators won't work, or does it?

Probably not at this point but it should.... I'll have a look.

I am a bit afraid about the new Sort.MATCHER type. Unless this breaks hardcoded checks I am fine with it, but we have a lot of code like type.sort == Sort.DEF or similar. We should check those if statements that they don't forget to also handle MATCHER when they expect an object.

I don't understand what you mean - is the problem with Sort.MATCHER or Sort.DEF missing the support for the matcher case? Or both? I'm kind of fumbling around in the dark on stuff like Sort.MATCHER - I can tell that it is the normal way to add support for the casing but I'm not sure if there are consequences I hadn't intended.

rmuir · 2016-06-14T16:07:01Z

I think the coersion to boolean will be an issue. I know this is a groovy thing, but we have to be careful here. We can't just easily treat Matcher as a "boxed type" for boolean without a lot of (possibly slow) things. Currently on the dynamic side, we let the JVM take care of all these conversions... I do really think this is the way we should do it :)

Is it possible at first to have one operator that returns boolean and the other that returns Matcher just to keep things simple?

nik9000 · 2016-06-14T16:09:09Z

Is it possible at first to have one operator that returns boolean and the other that returns Matcher just to keep things simple?

Yeah, I can back out the coercion stuff.

uschindler · 2016-06-14T16:30:19Z

Yeah, I can back out the coercion stuff.

Then we don't need Sort.MATCHER anymore!

nik9000 · 2016-06-14T16:33:06Z

Then we don't need Sort.MATCHER anymore!

I think the idea is that we can get the operators in easy but will have to think more about the coercion. I still want it, or, if we can't have it then I should rework the operators some. Either way is fine with me and this is a good incremental improvement without the coercion.

uschindler · 2016-06-14T16:40:11Z

I can tell that it is the normal way to add support for the casing but I'm not sure if there are consequences I hadn't intended.

It was just a warning! I don't like how Sort is used throughout the code. At some places it is used to check a type which is wrong. We just have to check all the if statements and verify that we don't miss correct checks. E.g. if we have a check somewhere sort == Sort.Object this won't work anymore for Matchers, although they are Objects. That was my warning! I just hate the usage of Sort. I'd like to have marker interfaces on the types and do instanceof checks instead of a Sort enum.

nik9000 · 2016-06-14T16:42:54Z

That was my warning!

Now I understand! Thanks!

rmuir · 2016-06-14T17:16:52Z

I think the idea is that we can get the operators in easy but will have to think more about the coercion. I still want it, or, if we can't have it then I should rework the operators some. Either way is fine with me and this is a good incremental improvement without the coercion.

If i could see it working in a way right now that didn't involve a runtime check for every conditional, i would try to work something up. But I am currently afraid that is the only solution for the dynamic case, if we want to treat Matcher as boolean (and we are very sparse on type information, so thats the general case right now: lots of def). Right now we avoid this, we only use MethodHandles asType and it seems fast :)

I do really think we should try to avoid this coersion feature in general, i think its an anti-feature of groovy. I know things like java.util.Maps can be treated as a boolean there, but that's very ambiguous.

nik9000 · 2016-06-14T17:22:00Z

I do really think we should try to avoid this coersion feature in general, i think its an anti-feature of groovy. I know things like java.util.Maps can be treated as a boolean there, but that's very ambiguous.

What if we have =~ just return a boolean as well. That way it can still be the "find" operator. Then if we really want to get more groovy we can work up the coercion stuff later. And if we don't we can just use /foo/.matcher("bar") when we want the actual matcher?

rjernst · 2016-06-14T17:23:08Z

I think =~ and ==~ should both just return boolean. If someone wants to get a matcher, they can call Pattern functions. But the simple case of having a condition on "find this pattern in a string" should work:

if (doc["myfield] =~ /hello/) {

nik9000 · 2016-06-14T18:29:31Z

I think =~ and ==~ should both just return boolean.

I did it!

clintongormley · 2016-06-14T18:56:03Z

I haven't quite followed the conversation above so apologies in advance, but I want one of two things from a regex:

does it match (boolean)
give me the captures (array)

@nik9000 tells me that find finds the pattern inside a string and match ensures that the whole string matches. The second one seems redundant to me; if I want the whole string to match I'll just use $ ^ anchors.

I'd much rather have one operator which returns a boolean and another operator which returns an array of captures.

nik9000 · 2016-06-14T19:04:17Z

returns an array of captures

It'd be a java Matcher which has methods like .group for the whole match and .group(2) for second group, etc.

The second one seems redundant to me; if I want the whole string to match I'll just use $ ^ anchors.

They map to two different methods in java and it looks like matches is going to be faster than find with anchors. OTOH we could actually measure it and optimize it on our side if it comes out to be faster. As in we could detect that there are anchors on both sides of the pattern and the pattern isn't in multiline mode then we could strip them and use matches....

Just so we're clear, if we get what we want the only syntax that gets you matches is:

Matcher m = /foo/i.matcher('cat');

or

Pattern p = /foo/i;
Matcher m = p.matcher('cat');

nik9000 · 2016-06-14T19:04:46Z

what we want

I guess I should rephrase that to "this PR as is and an as yet unwritten flags PR".

rmuir · 2016-06-14T19:25:51Z

Just so we're clear, if we get what we want the only syntax that gets you matches is:

This isn't necessarily the only syntax. Its the only thing we have today.

But we could work on things like method augmentation. Imagine giving CharSequence a few of the ones it has in groovy (http://docs.groovy-lang.org/latest/html/groovy-jdk/java/lang/CharSequence.html#findAll%28java.util.regex.Pattern%29)

so e.g. "foobar".findAll(/something/) would return a list. They have others.

I am not saying we should do this, it needs a lot of thought, .e.g i dont want to maintain tons of shitty string functions either, at the sametime distrustful of groovy code that e.g. adds an execute method to List and other horrible things they do. Nor do I really want to do our own thing and document any kind of "api", maybe just a key subset (e.g. regex and similar features on CharSequence)...

Just something to think about, it avoids the downsides of performance tradeoffs and confusion due to operator explosion.

uschindler · 2016-06-14T21:12:22Z

I think =~ and ==~ should both just return boolean.

+1 for that and thanks for cleaning up the PR. Looks much better to me.

I agree with @nik9000 and @rmuir: Users that really want to use the Matcher object should simply call .matcher() on the pattern. If they need a matcher they will also want to get the groups and so on. In that case a simple syntax using ~= is contraprocuctive, because them people tend to do horrible shit like ('a'=~/foo/).group(n) over and over for every group. With having to call matcher they see that there is actually some object returned that can be stored and then group() called on the local stored matcher.

rmuir · 2016-06-14T21:30:41Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/java.util.regex.txt

we don't allow method signature overriding in this way, intentionally. Otherwise things would be much slower and more ambiguous...

That is probably ok here but seems a bit breaky given that java allows it and we're exposing java APIs. We can just say that we don't support named groups and that'd be OK here but I have a suspicion something else will come up.

its not breaking anything: we are experimental and can do things correctly. You can have one call per arity, and it means there is no ambiguity of what method is being called. We have whitelisted over 10 of the most common java packages and this "java overloading" only impacts a rare few, which typically have another way to do it that is not ambiguous.

For a dynamic language this overloading is a big no-no, you have no idea what method is being called! What if you did new StringBuilder(doc[field].fieldA).append(doc[field].fieldB), and then you have a mappings bug in one index or something, such that fieldA is an int? Say with a large value.

Furthermore, what java does is extremely complicated. Just look at this: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.12.2.1 Be sure to reply back before you have torn out your eyes.

For us its just an int

This also means the arity of the callsite is fixed (e.g. String.xxx/3). This is important for performance, it means for method calls we only need one guardWithTest on the receiver, versus a shit ton (and then possibility of inline caches degrading). It also is the only reason practically that method references and lambdas can work for us with a def type, without doing something slow.

The whitelisting is done carefully so it wont be a problem. For example just whitelisting String.valueOf(Object) and knowing that will work fine.

Sorry, i dont see this as something practical to change. I know that groovy does a lot of ambiguous, slow things, you probably have no clue what method is being called in many of these situations or how slow that is. That does not mean we need to make bad tradeoffs.

I'm fine with it because we have the aliases. I've since exposed this as namedGroup/1 and that is fine by me. If we end up keeping this forever (and it sounds like that is a good idea) then it'll always be a funny wart we explain away as being very useful for dynamic performance. Hell, we could just point people to this post to explain it.

uschindler · 2016-06-14T21:46:55Z

If we need named groups we can add an alias with different name. Like namedGroup(String)

nik9000 · 2016-06-14T21:53:24Z

If we need named groups we can add an alias with different name. Like namedGroup(String)

I'll have a look and see if that is already a thing then I'll set it up.

uschindler · 2016-06-14T22:27:41Z

See getLength() as alias for size() on lists. We need this to support length property

nik9000 · 2016-06-14T22:50:24Z

See getLength() as alias for size() on lists. We need this to support length property

Got it!

nik9000 · 2016-06-15T17:35:27Z

Where've we landed on this? Is this the way we want to go or not? Should I change it, merge it, wait for more review?

nik9000 · 2016-06-15T19:18:51Z

Where've we landed on this? Is this the way we want to go or not? Should I change it, merge it, wait for more review?

I only ask because hacking on painless is fun and I'm excited to go and do flags and don't want to deal with the merge conflicts I'd get if I did it now. Basically I'm excited and lazy at the same time.

jdconrad · 2016-06-15T19:20:05Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/node/EBinary.java

Definitely for a different PR, but I think we can fix this by making the String constant in ALink an Object constant instead.

jdconrad · 2016-06-15T19:21:45Z

I'm +1 to these changes. @rmuir or @uschindler did either of you have any other concerns?

jdconrad · 2016-06-16T01:00:43Z

@nik9000 Please go ahead and merge this. I don't want you to have to deal with any serious conflicts, and if something comes up we can make changes moving forward. Thanks again for the work here!

uschindler · 2016-06-16T10:09:58Z

LGTM

Adds support for the find operator (=~) and the match operator (==~) to painless's regexes. Also whitelists most of the Matcher class and documents regex support in painless. The find operator (=~) returns a boolean that is the result of building a matcher on the lhs with the Pattern on the RHS and calling `find` on it. Use it like this: ``` if (ctx._source.last =~ /b/) ``` The match operator (==~) returns boolean like find but instead of calling `find` on the Matcher it calls `matches`. ``` if (ctx._source.last ==~ /[^aeiou].*[aeiou]/) ``` Finally, if you want the actual matcher you do: ``` Matcher m = /[aeiou]/.matcher(ctx._source.last) ```

nik9000 added review v5.0.0-alpha4 labels Jun 14, 2016

jdconrad mentioned this pull request Jun 14, 2016

Ongoing Painless Improvements #17992

Closed

18 tasks

jdconrad reviewed Jun 14, 2016
View reviewed changes

rmuir reviewed Jun 14, 2016
View reviewed changes

jdconrad reviewed Jun 15, 2016
View reviewed changes

nik9000 force-pushed the painless_matcher_builder branch 2 times, most recently from 7d74a4e to fd349ff Compare June 16, 2016 12:42

nik9000 force-pushed the painless_matcher_builder branch from fd349ff to 8d3ef74 Compare June 16, 2016 12:42

nik9000 merged commit 8d3ef74 into elastic:master Jun 16, 2016

clintongormley added the >enhancement label Jun 17, 2016

clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018

Conversation

nik9000 commented Jun 14, 2016

Uh oh!

jdconrad Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdconrad commented Jun 14, 2016

Uh oh!

jdconrad commented Jun 14, 2016

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

rmuir commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

rmuir commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

rjernst commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

clintongormley commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

rmuir commented Jun 14, 2016

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

uschindler commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 14, 2016

Uh oh!

nik9000 commented Jun 15, 2016

Uh oh!

nik9000 commented Jun 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdconrad commented Jun 15, 2016

Uh oh!

jdconrad commented Jun 16, 2016

Uh oh!

uschindler commented Jun 16, 2016

jdconrad Jun 14, 2016 •

edited

Loading