-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Named fuzzy query killing performance of nested top_hits aggregation #80860
Description
Elasticsearch version (bin/elasticsearch --version): 7.15.2
Plugins installed: [analysis-icu]
JVM version (java -version): bundled
OS version (uname -a if on a Unix-like system): macOS Big Sur
Description of the problem including expected versus actual behavior:
I was researching why some queries in our production environment were slow (well, slower than expected anyway), and narrowed it down to the combination of a named fuzzy query (match query with fuzziness set to AUTO) and a top_hits aggregation.
The complete query is quite simple:
{
"query": {
"bool": {
"should": [
{
"match": {
"category.std": {
"query": "the query",
"_name": "std"
}
}
},
{
"match": {
"category.prefix": {
"query": "the query",
"_name": "language_std"
}
}
},
{
"match": {
"category": {
"query": "the query",
"_name": "language_search"
}
}
},
{
"match": {
"category.fuzzy": {
"query": "the query",
"operator": "AND",
"fuzziness": "AUTO",
"prefix_length": 2,
"max_expansions": "5",
"boost": "0.5",
"_name": "fuzzy"
}
}
}
]
}
},
"aggs": {
"category": {
"terms": {
"field": "category.base",
"size": 5
},
"aggs": {
"top_category": {
"top_hits": {
"size": 1,
"_source": {
"includes": [
"category"
]
}
}
}
}
}
},
"size": 0
}We are searching for the user's query in the "category" field using a bool should query (since the relevant sub-fields use different search analyzers), then aggregate the top 5 values for the category field (terms aggregation) and finally get the top 1 hit for each of the 5 terms. The reason we do it like that is for highlighting (which is under the top_hits aggregation in the original query but, since it adds a small overhead, I deliberately left it out for simplicity).
With the query cache disabled for testing, this query takes around 70ms to return results. Since we are aiming for a much shorter response time (in the region of 5-20ms), I started taking parts out of the query in order to figure out where the bottleneck is.
Omitting any of the fuzzy query / top hits aggregation brings the response time down to 5ms.
The astonishing discovery is that removing just the _name parameter from the fuzzy query leads to constant 7ms responses (removing the same parameter for the rest of the queries makes no difference).
This definitely seems to be a bug, since I was under the impression that named queries are just a convenience for results processing and should not produce any performance overhead whatsoever.
Any help would be appreciated.
PS. I don't know if the issue title is correct, but I could not come up with something better.