Add support for the find operator (=~) and the match operator (==~)#18858
Add support for the find operator (=~) and the match operator (==~)#18858nik9000 merged 1 commit intoelastic:masterfrom
Conversation
There was a problem hiding this comment.
I'm no expert on regexes, so my question here is can you have a script where you do something like boolean x = (boolean)someMatcher;? If not, this isn't the correct place for this. My gut feeling is this should probably be part of the EComp node or its own node.
There was a problem hiding this comment.
I think this should be valid: return (boolean)("foo" =~ /foo/)
There was a problem hiding this comment.
("foo" =~ /foo/) What type does this return to begin with?
There was a problem hiding this comment.
Got it. All right, that seems good then. Thanks for the answers.
|
@nik9000 This looks pretty good to me. One more serious comment about functionality and one minor follow up question. |
|
@nik9000 +1 Thanks for the follow up answers. |
|
Syntax looks file to me. Question about DEFs: If you have assigned a pattern to def, the boolean coercion and operators won't work, or does it? I am a bit afraid about the new Sort.MATCHER type. Unless this breaks hardcoded checks I am fine with it, but we have a lot of code like |
Probably not at this point but it should.... I'll have a look.
I don't understand what you mean - is the problem with Sort.MATCHER or Sort.DEF missing the support for the matcher case? Or both? I'm kind of fumbling around in the dark on stuff like Sort.MATCHER - I can tell that it is the normal way to add support for the casing but I'm not sure if there are consequences I hadn't intended. |
|
I think the coersion to boolean will be an issue. I know this is a groovy thing, but we have to be careful here. We can't just easily treat Matcher as a "boxed type" for boolean without a lot of (possibly slow) things. Currently on the dynamic side, we let the JVM take care of all these conversions... I do really think this is the way we should do it :) Is it possible at first to have one operator that returns boolean and the other that returns Matcher just to keep things simple? |
Yeah, I can back out the coercion stuff. |
Then we don't need Sort.MATCHER anymore! |
I think the idea is that we can get the operators in easy but will have to think more about the coercion. I still want it, or, if we can't have it then I should rework the operators some. Either way is fine with me and this is a good incremental improvement without the coercion. |
It was just a warning! I don't like how Sort is used throughout the code. At some places it is used to check a type which is wrong. We just have to check all the if statements and verify that we don't miss correct checks. E.g. if we have a check somewhere |
Now I understand! Thanks! |
If i could see it working in a way right now that didn't involve a runtime check for every conditional, i would try to work something up. But I am currently afraid that is the only solution for the dynamic case, if we want to treat Matcher as boolean (and we are very sparse on type information, so thats the general case right now: lots of I do really think we should try to avoid this coersion feature in general, i think its an anti-feature of groovy. I know things like java.util.Maps can be treated as a boolean there, but that's very ambiguous. |
What if we have |
|
I think |
I did it! |
|
I haven't quite followed the conversation above so apologies in advance, but I want one of two things from a regex:
@nik9000 tells me that I'd much rather have one operator which returns a boolean and another operator which returns an array of captures. |
It'd be a java
They map to two different methods in java and it looks like Just so we're clear, if we get what we want the only syntax that gets you matches is: or |
I guess I should rephrase that to "this PR as is and an as yet unwritten flags PR". |
This isn't necessarily the only syntax. Its the only thing we have today. But we could work on things like method augmentation. Imagine giving CharSequence a few of the ones it has in groovy (http://docs.groovy-lang.org/latest/html/groovy-jdk/java/lang/CharSequence.html#findAll%28java.util.regex.Pattern%29) so e.g. I am not saying we should do this, it needs a lot of thought, .e.g i dont want to maintain tons of shitty string functions either, at the sametime distrustful of groovy code that e.g. adds an Just something to think about, it avoids the downsides of performance tradeoffs and confusion due to operator explosion. |
+1 for that and thanks for cleaning up the PR. Looks much better to me. I agree with @nik9000 and @rmuir: Users that really want to use the Matcher object should simply call |
There was a problem hiding this comment.
we don't allow method signature overriding in this way, intentionally. Otherwise things would be much slower and more ambiguous...
There was a problem hiding this comment.
That is probably ok here but seems a bit breaky given that java allows it and we're exposing java APIs. We can just say that we don't support named groups and that'd be OK here but I have a suspicion something else will come up.
There was a problem hiding this comment.
its not breaking anything: we are experimental and can do things correctly. You can have one call per arity, and it means there is no ambiguity of what method is being called. We have whitelisted over 10 of the most common java packages and this "java overloading" only impacts a rare few, which typically have another way to do it that is not ambiguous.
For a dynamic language this overloading is a big no-no, you have no idea what method is being called! What if you did new StringBuilder(doc[field].fieldA).append(doc[field].fieldB), and then you have a mappings bug in one index or something, such that fieldA is an int? Say with a large value.
Furthermore, what java does is extremely complicated. Just look at this: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.12.2.1 Be sure to reply back before you have torn out your eyes.
For us its just an int
This also means the arity of the callsite is fixed (e.g. String.xxx/3). This is important for performance, it means for method calls we only need one guardWithTest on the receiver, versus a shit ton (and then possibility of inline caches degrading). It also is the only reason practically that method references and lambdas can work for us with a def type, without doing something slow.
The whitelisting is done carefully so it wont be a problem. For example just whitelisting String.valueOf(Object) and knowing that will work fine.
Sorry, i dont see this as something practical to change. I know that groovy does a lot of ambiguous, slow things, you probably have no clue what method is being called in many of these situations or how slow that is. That does not mean we need to make bad tradeoffs.
There was a problem hiding this comment.
I'm fine with it because we have the aliases. I've since exposed this as namedGroup/1 and that is fine by me. If we end up keeping this forever (and it sounds like that is a good idea) then it'll always be a funny wart we explain away as being very useful for dynamic performance. Hell, we could just point people to this post to explain it.
|
If we need named groups we can add an alias with different name. Like namedGroup(String) |
I'll have a look and see if that is already a thing then I'll set it up. |
|
See getLength() as alias for size() on lists. We need this to support length property |
Got it! |
|
Where've we landed on this? Is this the way we want to go or not? Should I change it, merge it, wait for more review? |
I only ask because hacking on painless is fun and I'm excited to go and do flags and don't want to deal with the merge conflicts I'd get if I did it now. Basically I'm excited and lazy at the same time. |
There was a problem hiding this comment.
Definitely for a different PR, but I think we can fix this by making the String constant in ALink an Object constant instead.
|
I'm +1 to these changes. @rmuir or @uschindler did either of you have any other concerns? |
|
@nik9000 Please go ahead and merge this. I don't want you to have to deal with any serious conflicts, and if something comes up we can make changes moving forward. Thanks again for the work here! |
|
LGTM |
7d74a4e to
fd349ff
Compare
Adds support for the find operator (=~) and the match operator (==~) to painless's regexes. Also whitelists most of the Matcher class and documents regex support in painless. The find operator (=~) returns a boolean that is the result of building a matcher on the lhs with the Pattern on the RHS and calling `find` on it. Use it like this: ``` if (ctx._source.last =~ /b/) ``` The match operator (==~) returns boolean like find but instead of calling `find` on the Matcher it calls `matches`. ``` if (ctx._source.last ==~ /[^aeiou].*[aeiou]/) ``` Finally, if you want the actual matcher you do: ``` Matcher m = /[aeiou]/.matcher(ctx._source.last) ```
fd349ff to
8d3ef74
Compare
Adds support for the find operator (
=~) and the match operator (==~) to painless's regexes. Also whitelists most of the Pattern class and documents regex support in painless.The find operator (
=~) constructs a matcher. The operator is called "find" because matchers coerce to booleans by callingMatcher.find. So this is valid:As is this:
The match operator (
==~) returns a boolean, true if the lhsMatcher.matches the rhs. So this is valid:All three of these things (find operator, match operator, Matcher -> boolean coercion) are "borrowed" from groovy. Groovy has lots more useful stuff around regexes but this is a good start.