Named fuzzy query killing performance of nested top_hits aggregation

**Elasticsearch version** (`bin/elasticsearch --version`): 7.15.2

**Plugins installed**: [analysis-icu]

**JVM version** (`java -version`): bundled

**OS version** (`uname -a` if on a Unix-like system): macOS Big Sur

**Description of the problem including expected versus actual behavior**:

I was researching why some queries in our production environment were slow (well, slower than expected anyway), and narrowed it down to the combination of a *named* fuzzy query (match query with fuzziness set to AUTO) and a top_hits aggregation.

The complete query is quite simple:

```json
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "category.std": {
              "query": "the query",
              "_name": "std"
            }
          }
        },
        {
          "match": {
            "category.prefix": {
              "query": "the query",
              "_name": "language_std"
            }
          }
        },
        {
          "match": {
            "category": {
              "query": "the query",
              "_name": "language_search"
            }
          }
        },
        {
          "match": {
            "category.fuzzy": {
              "query": "the query",
              "operator": "AND",
              "fuzziness": "AUTO",
              "prefix_length": 2,
              "max_expansions": "5",
              "boost": "0.5",
              "_name": "fuzzy"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "category": {
      "terms": {
        "field": "category.base",
        "size": 5
      },
      "aggs": {
        "top_category": {
          "top_hits": {
            "size": 1,
            "_source": {
              "includes": [
                "category"
              ]
            }
          }
        }
      }
    }
  },
  "size": 0
}
```

We are searching for the user's query in the "category" field using a bool should query (since the relevant sub-fields use different search analyzers), then aggregate the top 5 values for the category field (terms aggregation) and finally get the top 1 hit for each of the 5 terms. The reason we do it like that is for highlighting (which is under the top_hits aggregation in the original query but, since it adds a small overhead, I deliberately left it out for simplicity).

With the query cache disabled for testing, this query takes around 70ms to return results. Since we are aiming for a much shorter response time (in the region of 5-20ms), I started taking parts out of the query in order to figure out where the bottleneck is.

Omitting any of the fuzzy query / top hits aggregation brings the response time down to 5ms.
The astonishing discovery is that removing *just* the _name parameter from the fuzzy query leads to constant 7ms responses (removing the same parameter for the rest of the queries makes no difference).

This definitely seems to be a bug, since I was under the impression that named queries are just a convenience for results processing and should not produce any performance overhead whatsoever.

Any help would be appreciated.

PS. I don't know if the issue title is correct, but I could not come up with something better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named fuzzy query killing performance of nested top_hits aggregation #80860

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Named fuzzy query killing performance of nested top_hits aggregation #80860

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions