Skip to content

Add support for the find operator (=~) and the match operator (==~)#18858

Merged
nik9000 merged 1 commit intoelastic:masterfrom
nik9000:painless_matcher_builder
Jun 16, 2016
Merged

Add support for the find operator (=~) and the match operator (==~)#18858
nik9000 merged 1 commit intoelastic:masterfrom
nik9000:painless_matcher_builder

Conversation

@nik9000
Copy link
Copy Markdown
Member

@nik9000 nik9000 commented Jun 14, 2016

Adds support for the find operator (=~) and the match operator (==~) to painless's regexes. Also whitelists most of the Pattern class and documents regex support in painless.

The find operator (=~) constructs a matcher. The operator is called "find" because matchers coerce to booleans by calling Matcher.find. So this is valid:

if (ctx._source.last =~ /b/)

As is this:

Matcher m = ctx._source.last =~ /[aeiou]/

The match operator (==~) returns a boolean, true if the lhs Matcher.matches the rhs. So this is valid:

if (ctx._source.last ==~ /^[^aeiou].*[aeiou]$/)

All three of these things (find operator, match operator, Matcher -> boolean coercion) are "borrowed" from groovy. Groovy has lots more useful stuff around regexes but this is a good start.

Copy link
Copy Markdown
Contributor

@jdconrad jdconrad Jun 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert on regexes, so my question here is can you have a script where you do something like boolean x = (boolean)someMatcher;? If not, this isn't the correct place for this. My gut feeling is this should probably be part of the EComp node or its own node.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be valid: return (boolean)("foo" =~ /foo/)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

("foo" =~ /foo/) What type does this return to begin with?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matcher

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. All right, that seems good then. Thanks for the answers.

@jdconrad
Copy link
Copy Markdown
Contributor

@nik9000 This looks pretty good to me. One more serious comment about functionality and one minor follow up question.

@jdconrad
Copy link
Copy Markdown
Contributor

@nik9000 +1 Thanks for the follow up answers.

@uschindler
Copy link
Copy Markdown
Contributor

Syntax looks file to me.

Question about DEFs: If you have assigned a pattern to def, the boolean coercion and operators won't work, or does it?

I am a bit afraid about the new Sort.MATCHER type. Unless this breaks hardcoded checks I am fine with it, but we have a lot of code like type.sort == Sort.DEF or similar. We should check those if statements that they don't forget to also handle MATCHER when they expect an object.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

Question about DEFs: If you have assigned a pattern to def, the boolean coercion and operators won't work, or does it?

Probably not at this point but it should.... I'll have a look.

I am a bit afraid about the new Sort.MATCHER type. Unless this breaks hardcoded checks I am fine with it, but we have a lot of code like type.sort == Sort.DEF or similar. We should check those if statements that they don't forget to also handle MATCHER when they expect an object.

I don't understand what you mean - is the problem with Sort.MATCHER or Sort.DEF missing the support for the matcher case? Or both? I'm kind of fumbling around in the dark on stuff like Sort.MATCHER - I can tell that it is the normal way to add support for the casing but I'm not sure if there are consequences I hadn't intended.

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Jun 14, 2016

I think the coersion to boolean will be an issue. I know this is a groovy thing, but we have to be careful here. We can't just easily treat Matcher as a "boxed type" for boolean without a lot of (possibly slow) things. Currently on the dynamic side, we let the JVM take care of all these conversions... I do really think this is the way we should do it :)

Is it possible at first to have one operator that returns boolean and the other that returns Matcher just to keep things simple?

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

Is it possible at first to have one operator that returns boolean and the other that returns Matcher just to keep things simple?

Yeah, I can back out the coercion stuff.

@uschindler
Copy link
Copy Markdown
Contributor

Yeah, I can back out the coercion stuff.

Then we don't need Sort.MATCHER anymore!

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

Then we don't need Sort.MATCHER anymore!

I think the idea is that we can get the operators in easy but will have to think more about the coercion. I still want it, or, if we can't have it then I should rework the operators some. Either way is fine with me and this is a good incremental improvement without the coercion.

@uschindler
Copy link
Copy Markdown
Contributor

I can tell that it is the normal way to add support for the casing but I'm not sure if there are consequences I hadn't intended.

It was just a warning! I don't like how Sort is used throughout the code. At some places it is used to check a type which is wrong. We just have to check all the if statements and verify that we don't miss correct checks. E.g. if we have a check somewhere sort == Sort.Object this won't work anymore for Matchers, although they are Objects. That was my warning! I just hate the usage of Sort. I'd like to have marker interfaces on the types and do instanceof checks instead of a Sort enum.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

That was my warning!

Now I understand! Thanks!

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Jun 14, 2016

I think the idea is that we can get the operators in easy but will have to think more about the coercion. I still want it, or, if we can't have it then I should rework the operators some. Either way is fine with me and this is a good incremental improvement without the coercion.

If i could see it working in a way right now that didn't involve a runtime check for every conditional, i would try to work something up. But I am currently afraid that is the only solution for the dynamic case, if we want to treat Matcher as boolean (and we are very sparse on type information, so thats the general case right now: lots of def). Right now we avoid this, we only use MethodHandles asType and it seems fast :)

I do really think we should try to avoid this coersion feature in general, i think its an anti-feature of groovy. I know things like java.util.Maps can be treated as a boolean there, but that's very ambiguous.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

I do really think we should try to avoid this coersion feature in general, i think its an anti-feature of groovy. I know things like java.util.Maps can be treated as a boolean there, but that's very ambiguous.

What if we have =~ just return a boolean as well. That way it can still be the "find" operator. Then if we really want to get more groovy we can work up the coercion stuff later. And if we don't we can just use /foo/.matcher("bar") when we want the actual matcher?

@rjernst
Copy link
Copy Markdown
Member

rjernst commented Jun 14, 2016

I think =~ and ==~ should both just return boolean. If someone wants to get a matcher, they can call Pattern functions. But the simple case of having a condition on "find this pattern in a string" should work:

if (doc["myfield] =~ /hello/) {

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

I think =~ and ==~ should both just return boolean.

I did it!

@clintongormley
Copy link
Copy Markdown
Contributor

I haven't quite followed the conversation above so apologies in advance, but I want one of two things from a regex:

  • does it match (boolean)
  • give me the captures (array)

@nik9000 tells me that find finds the pattern inside a string and match ensures that the whole string matches. The second one seems redundant to me; if I want the whole string to match I'll just use $ ^ anchors.

I'd much rather have one operator which returns a boolean and another operator which returns an array of captures.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

returns an array of captures

It'd be a java Matcher which has methods like .group for the whole match and .group(2) for second group, etc.

The second one seems redundant to me; if I want the whole string to match I'll just use $ ^ anchors.

They map to two different methods in java and it looks like matches is going to be faster than find with anchors. OTOH we could actually measure it and optimize it on our side if it comes out to be faster. As in we could detect that there are anchors on both sides of the pattern and the pattern isn't in multiline mode then we could strip them and use matches....

Just so we're clear, if we get what we want the only syntax that gets you matches is:

Matcher m = /foo/i.matcher('cat');

or

Pattern p = /foo/i;
Matcher m = p.matcher('cat');

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

what we want

I guess I should rephrase that to "this PR as is and an as yet unwritten flags PR".

@rmuir
Copy link
Copy Markdown
Contributor

rmuir commented Jun 14, 2016

Just so we're clear, if we get what we want the only syntax that gets you matches is:

This isn't necessarily the only syntax. Its the only thing we have today.

But we could work on things like method augmentation. Imagine giving CharSequence a few of the ones it has in groovy (http://docs.groovy-lang.org/latest/html/groovy-jdk/java/lang/CharSequence.html#findAll%28java.util.regex.Pattern%29)

so e.g. "foobar".findAll(/something/) would return a list. They have others.

I am not saying we should do this, it needs a lot of thought, .e.g i dont want to maintain tons of shitty string functions either, at the sametime distrustful of groovy code that e.g. adds an execute method to List and other horrible things they do. Nor do I really want to do our own thing and document any kind of "api", maybe just a key subset (e.g. regex and similar features on CharSequence)...

Just something to think about, it avoids the downsides of performance tradeoffs and confusion due to operator explosion.

@uschindler
Copy link
Copy Markdown
Contributor

I think =~ and ==~ should both just return boolean.

+1 for that and thanks for cleaning up the PR. Looks much better to me.

I agree with @nik9000 and @rmuir: Users that really want to use the Matcher object should simply call .matcher() on the pattern. If they need a matcher they will also want to get the groups and so on. In that case a simple syntax using ~= is contraprocuctive, because them people tend to do horrible shit like ('a'=~/foo/).group(n) over and over for every group. With having to call matcher they see that there is actually some object returned that can be stored and then group() called on the local stored matcher.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't allow method signature overriding in this way, intentionally. Otherwise things would be much slower and more ambiguous...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably ok here but seems a bit breaky given that java allows it and we're exposing java APIs. We can just say that we don't support named groups and that'd be OK here but I have a suspicion something else will come up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not breaking anything: we are experimental and can do things correctly. You can have one call per arity, and it means there is no ambiguity of what method is being called. We have whitelisted over 10 of the most common java packages and this "java overloading" only impacts a rare few, which typically have another way to do it that is not ambiguous.

For a dynamic language this overloading is a big no-no, you have no idea what method is being called! What if you did new StringBuilder(doc[field].fieldA).append(doc[field].fieldB), and then you have a mappings bug in one index or something, such that fieldA is an int? Say with a large value.

Furthermore, what java does is extremely complicated. Just look at this: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.12.2.1 Be sure to reply back before you have torn out your eyes.

For us its just an int

This also means the arity of the callsite is fixed (e.g. String.xxx/3). This is important for performance, it means for method calls we only need one guardWithTest on the receiver, versus a shit ton (and then possibility of inline caches degrading). It also is the only reason practically that method references and lambdas can work for us with a def type, without doing something slow.

The whitelisting is done carefully so it wont be a problem. For example just whitelisting String.valueOf(Object) and knowing that will work fine.

Sorry, i dont see this as something practical to change. I know that groovy does a lot of ambiguous, slow things, you probably have no clue what method is being called in many of these situations or how slow that is. That does not mean we need to make bad tradeoffs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it because we have the aliases. I've since exposed this as namedGroup/1 and that is fine by me. If we end up keeping this forever (and it sounds like that is a good idea) then it'll always be a funny wart we explain away as being very useful for dynamic performance. Hell, we could just point people to this post to explain it.

@uschindler
Copy link
Copy Markdown
Contributor

If we need named groups we can add an alias with different name. Like namedGroup(String)

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

If we need named groups we can add an alias with different name. Like namedGroup(String)

I'll have a look and see if that is already a thing then I'll set it up.

@uschindler
Copy link
Copy Markdown
Contributor

See getLength() as alias for size() on lists. We need this to support length property

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 14, 2016

See getLength() as alias for size() on lists. We need this to support length property

Got it!

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 15, 2016

Where've we landed on this? Is this the way we want to go or not? Should I change it, merge it, wait for more review?

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Jun 15, 2016

Where've we landed on this? Is this the way we want to go or not? Should I change it, merge it, wait for more review?

I only ask because hacking on painless is fun and I'm excited to go and do flags and don't want to deal with the merge conflicts I'd get if I did it now. Basically I'm excited and lazy at the same time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely for a different PR, but I think we can fix this by making the String constant in ALink an Object constant instead.

@jdconrad
Copy link
Copy Markdown
Contributor

I'm +1 to these changes. @rmuir or @uschindler did either of you have any other concerns?

@jdconrad
Copy link
Copy Markdown
Contributor

@nik9000 Please go ahead and merge this. I don't want you to have to deal with any serious conflicts, and if something comes up we can make changes moving forward. Thanks again for the work here!

@uschindler
Copy link
Copy Markdown
Contributor

LGTM

@nik9000 nik9000 force-pushed the painless_matcher_builder branch 2 times, most recently from 7d74a4e to fd349ff Compare June 16, 2016 12:42
Adds support for the find operator (=~) and the match operator (==~)
to painless's regexes. Also whitelists most of the Matcher class and
documents regex support in painless.

The find operator (=~) returns a boolean that is the result of building
a matcher on the lhs with the Pattern on the RHS and calling `find` on
it. Use it like this:

```
if (ctx._source.last =~ /b/)
```

The match operator (==~) returns boolean like find but instead of calling
`find` on the Matcher it calls `matches`.

```
if (ctx._source.last ==~ /[^aeiou].*[aeiou]/)
```

Finally, if you want the actual matcher you do:

```
Matcher m = /[aeiou]/.matcher(ctx._source.last)
```
@nik9000 nik9000 force-pushed the painless_matcher_builder branch from fd349ff to 8d3ef74 Compare June 16, 2016 12:42
@nik9000 nik9000 merged commit 8d3ef74 into elastic:master Jun 16, 2016
@clintongormley clintongormley added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed :Plugin Lang Painless labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement v5.0.0-alpha4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants