ESQL: Rewrite TO_UPPER/TO_LOWER comparisons#118870
ESQL: Rewrite TO_UPPER/TO_LOWER comparisons#118870elasticsearchmachine merged 10 commits intoelastic:mainfrom
Conversation
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @bpintea, I've created a changelog YAML for you. |
|
|
||
| /** Similar to {@link String#strip()}, but removes the WS throughout the entire string. */ | ||
| public static String stripThrough(String input) { | ||
| return WS_PATTERN.matcher(input).replaceAll(StringUtils.EMPTY); |
There was a problem hiding this comment.
As an alternative to avoid the pattern matching, use Strings.deleteAny(input, " ") or Strings.replace(input, " ", "")
There was a problem hiding this comment.
The \s class will match more ([ \t\n\x0B\f\r]). I use it to compare (multiline, "randomly" spaced) JSONs.
| var foldedRight = BytesRefs.toString(bc.right().fold()); | ||
| return changeCase.caseType().matchesCase(foldedRight) | ||
| ? new InsensitiveEquals(bc.source(), changeCase.field(), bc.right()) | ||
| : Literal.FALSE; |
There was a problem hiding this comment.
Use Literal.of(bc, Boolean.FALSE) to preserve the original source.
Additionally, see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase").
There was a problem hiding this comment.
You could add an additional check to remove any nested casing function since only the top level one counts:
to_upper(to_lower(to_upper(field))) is the same as to_upper(field).
There was a problem hiding this comment.
You could add an additional check to remove any nested casing function
Added.
It's basic; ideally, we'd be able to transform TO_UPPER(CONCAT(TO_LOWER(field), "foo")) to just TO_UPPER(CONCAT(field, "foo")), but not TO_UPPER((CONCAT(TO_LOWER(field), "-foo") == "x-foo")::KEYWORD). But maybe another PR.
There was a problem hiding this comment.
see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase")
We definitely can, but do we want then to expand a bit the scope and do this when other optimisations apply (folding to null or booleans), especially in filters?
There was a problem hiding this comment.
Make sure to add sorting or ignore order
Additionally add some tests with non-alphabetical strings such as 🐔✈🔥🎉👓
There was a problem hiding this comment.
I've added, both here and in the unit tests. The ones here are just functional, but we should probably add some unicode data in the CSV tests indices.
| import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.InsensitiveEquals; | ||
| import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.NotEquals; | ||
|
|
||
| public class ReplaceCaseWithInsensitiveMatch extends OptimizerRules.OptimizerExpressionRule<BinaryComparison> { |
There was a problem hiding this comment.
ReplaceStringCasingWithInsensitiveEquals (avoid using Match).
💚 Backport successful
|
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query. Fixes elastic#118304.
* ESQL: Rewrite TO_UPPER/TO_LOWER comparisons (#118870) This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query. Fixes #118304. * Disable `TO_UPPER(null)`-tests prior to 8.17 (#119213) TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by #114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range). (cherry picked from commit edb3818)
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function.
This also allows later pushing down the predicate to lucene as a case-insensitive term-query.
Fixes #118304.