Search - add case insensitive flag for "term" family of queries#61596
Search - add case insensitive flag for "term" family of queries#61596markharwood merged 2 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/es-search (:Search/Search) |
5b9a7bf to
cc70b86
Compare
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java
Outdated
Show resolved
Hide resolved
jimczi
left a comment
There was a problem hiding this comment.
I wonder if we should add new functions rather than parameters ?
Something like termQueryCI, prefixQueryCI and wildcardCI with a default implementation that throws a QueryShardException ?
cc70b86 to
c3ea3ff
Compare
I was following the approach of using parameters that we introduced in regexpQuery. |
For regexQuery I think it's different since it's only relevant for text and keyword field. Term queries are implemented by almost all field types so I think we can have at least a differentiation there ? |
ab4bdfe to
053aee1
Compare
|
I'm going to copy the regexp docs for this flag:
I'll not include the flag in the example query though (despite this helping improve test coverage). The reason is that I want to avoid promoting the use of this query-time flag if an index-time choice (eg using a normalizer) is typically the better option. I'm not clear if the above is the best wording or how to present the choices between index-time and query-time case sensitivity options. It seems a pain to dive into that detailed discussion on every type of query that supports the new case insensitive flag. |
828e907 to
0539abc
Compare
|
I made the changes you suggested, @jimczi I feel uneasy about query docs for the flag - it would be nice to point to somewhere central that outlines the trade offs for case insensitivity at query-time Vs index-time |
javanna
left a comment
There was a problem hiding this comment.
I have a naming concern, can we not use the CI acronym in the new method names? I find them hard to read, and makes it hard for readers to figure out what CI means in the context. I personally think that a slightly longer name is not a big problem.
server/src/main/java/org/elasticsearch/index/mapper/ConstantFieldType.java
Outdated
Show resolved
Hide resolved
29d895a to
b82974a
Compare
|
Thanks for the review, @javanna . I changed the name. |
jimczi
left a comment
There was a problem hiding this comment.
I left some comments, thanks for adding the distinction for term queries
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitivePrefixQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveTermQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveWildcardQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/IndexFieldMapper.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/runtimefields/mapper/ScriptBooleanMappedFieldType.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/org/elasticsearch/xpack/runtimefields/query/StringScriptFieldTermsQuery.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/org/elasticsearch/xpack/runtimefields/query/StringScriptFieldTermsQuery.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I wonder what we should do here ? We'd need to add the support for case insensitive matching in TermInSetQuery if we want to avoid building giant boolean query in the keyword field. For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?
There was a problem hiding this comment.
@jimczi terms query is currently used by eql for more obvious queries like file where file_name in ("wininit.exe", "lsass.exe") and (less obvious) for cidrMatch function for matching an IP address in a list of CIDR blocks. Would be useful to have case insensitive support for it as well.
There was a problem hiding this comment.
(less obvious) for
cidrMatchfunction for matching an IP address in a list of CIDR blocks.
Neat! I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.
There was a problem hiding this comment.
I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.
We're not, we just pass the block to ES.
When dealing with an expression that forces us to do scripting, we use the underlying Lucene class for matching.
There was a problem hiding this comment.
Nice.
Hopefully you'll be able to use runtime field before too long!
There was a problem hiding this comment.
For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?
If we don't support it won't people just be forced to do what we do internally and create a bool should array of term queries?
Is there a query-complexity circuit-breaker we're somehow missing here?
There was a problem hiding this comment.
Adding a +1 to have this supported in terms query. The alternative in EQL, in case we decide to not support this, would probably be (as @markharwood mentioned) to create a bool query with a bunch of term queries in it. CC @jimczi
There was a problem hiding this comment.
My worry is that we'd internally create an array of should term queries because the TermsInSet query doesn't handle case insensitive queries. It's maybe not a big issue but that wouldn't be consistent since a case insensitive terms query would fail with more than 1024 terms (the max boolean clause limit) while the normal one would use the optimized TermsInSet query. So my take on this is that we should implement a proper support in TermInSet if we really want to handle terms case insensitive query. For EQL and SQL I don't think there's a real need to handle terms though. The in operator doesn't need to handle thousands of terms so imo that would be acceptable to translate it into an array of term query. Same for the cidrMatch function that also work on term queries.
There was a problem hiding this comment.
OK - I'll pull the termsQueryCaseInsensitive method for now. We can always add in another PR later. This PR probably has enough changes to consider already
0c80418 to
698b62a
Compare
|
I ran some benchmarks to see what effects this flag has on response times. Response Time (secs) for 1,000 queries on different fields in 2m doc index
Some of the observations this helped me draw out:
|
jimczi
left a comment
There was a problem hiding this comment.
The change looks good to me. Thanks for iterating on this. I'd prefer that we consider terms queries in a follow up since the current matching on keyword would not be able to leverage the TermsInSetQuery but I don't have strong feelings either. We can also optimize later so I let you decide on that aspect ;).
There was a problem hiding this comment.
simpleMatchCaseInsensitive ?
There was a problem hiding this comment.
That name might suggest the case insensitivity is built into the method (it's not).
I wanted something less explicit ("post-normalization"?) that suggested the strings were assumed to have already been normalised (which for argument's sake could have been uppercasing not lowercasing).
de8c3d2 to
8f9ad73
Compare
…, prefix, wildcard queries Closes elastic#61546
8f9ad73 to
685a419
Compare
|
test this please |
|
Removed |
|
In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword. |
|
@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this. |
@cbuescher I opened an issue about this https://github.com/elastic/elasticsearch/issues/95120 And here is a fail example, we have a field called name. If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query. This can be fixed by removing these 3 lines in server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function. And that's why I am asking why we only support ASCII characters in wildcard field. Hope this helps. |
|
Thanks, I'll copy parts of this short note over to the issue. |

Adds case insensitive option to term, terms, terms_in_set, prefix, wildcard queries
First cut.
Closes #61546