Skip to content

Named fuzzy query killing performance of nested top_hits aggregation #80860

@nemphys

Description

@nemphys

Elasticsearch version (bin/elasticsearch --version): 7.15.2

Plugins installed: [analysis-icu]

JVM version (java -version): bundled

OS version (uname -a if on a Unix-like system): macOS Big Sur

Description of the problem including expected versus actual behavior:

I was researching why some queries in our production environment were slow (well, slower than expected anyway), and narrowed it down to the combination of a named fuzzy query (match query with fuzziness set to AUTO) and a top_hits aggregation.

The complete query is quite simple:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "category.std": {
              "query": "the query",
              "_name": "std"
            }
          }
        },
        {
          "match": {
            "category.prefix": {
              "query": "the query",
              "_name": "language_std"
            }
          }
        },
        {
          "match": {
            "category": {
              "query": "the query",
              "_name": "language_search"
            }
          }
        },
        {
          "match": {
            "category.fuzzy": {
              "query": "the query",
              "operator": "AND",
              "fuzziness": "AUTO",
              "prefix_length": 2,
              "max_expansions": "5",
              "boost": "0.5",
              "_name": "fuzzy"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "category": {
      "terms": {
        "field": "category.base",
        "size": 5
      },
      "aggs": {
        "top_category": {
          "top_hits": {
            "size": 1,
            "_source": {
              "includes": [
                "category"
              ]
            }
          }
        }
      }
    }
  },
  "size": 0
}

We are searching for the user's query in the "category" field using a bool should query (since the relevant sub-fields use different search analyzers), then aggregate the top 5 values for the category field (terms aggregation) and finally get the top 1 hit for each of the 5 terms. The reason we do it like that is for highlighting (which is under the top_hits aggregation in the original query but, since it adds a small overhead, I deliberately left it out for simplicity).

With the query cache disabled for testing, this query takes around 70ms to return results. Since we are aiming for a much shorter response time (in the region of 5-20ms), I started taking parts out of the query in order to figure out where the bottleneck is.

Omitting any of the fuzzy query / top hits aggregation brings the response time down to 5ms.
The astonishing discovery is that removing just the _name parameter from the fuzzy query leads to constant 7ms responses (removing the same parameter for the rest of the queries makes no difference).

This definitely seems to be a bug, since I was under the impression that named queries are just a convenience for results processing and should not produce any performance overhead whatsoever.

Any help would be appreciated.

PS. I don't know if the issue title is correct, but I could not come up with something better.

Metadata

Metadata

Labels

:Search/SearchSearch-related issues that do not fall into other categories>docsGeneral docs changesTeam:DocsMeta label for docs teamTeam:SearchMeta label for search team

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions