Refactor field expansion for match, multi_match and query_string query by jimczi · Pull Request #25726 · elastic/elasticsearch

jimczi · 2017-07-14T14:26:20Z

This commit changes the way we handle field expansion in match, multi_match and query_string query.
The main changes are:

For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.
For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.
For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.
The * notation can also be used to set default_field option onquery_string query. This should replace the needs for the extra option use_all_fields which is deprecated in this change.

This commit also rewrites simple * query to matchalldocs query when all fields are requested (Fixes #25556).

The same change should be done on simple_query_string for completeness.

use_all_fields option in query_string is also deprecated in this change, default_field should be set to * instead.

Relates #25551

dakrone

@jimczi I just started playing with this (hence the short review), but I found something that causes some dissonance. If someone configures "default_field": "*" in the index settings, it does not work the same way as configuring "default_field": "*" in the query does.

Here's something that demonstrates the problem:

DELETE /test?pretty=false
{}

PUT /test?pretty=false
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "query": {
        "default_field": "*"
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "f1": {"type": "text"},
        "f2": {"type": "text"}
      }
    }
  }
}

POST /test/doc/1?refresh=wait_for&filter_path=result&pretty=false
{"f1": "foo", "f2": "bar"}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar"
    }
  }
}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar",
      "default_field": "*"
    }
  }
}

And the output:

{"acknowledged":true}
{"acknowledged":true,"shards_acknowledged":true,"index":"test"}
{"result":"created"}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "f1" : "foo",
          "f2" : "bar"
        }
      }
    ]
  }
}

dakrone · 2017-07-14T15:24:31Z

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java

I think this comment is misleading since this is a getter

jimczi · 2017-07-14T19:19:37Z

Thanks @dakrone , good catch.
I pushed some changes to fix the bug.

jimczi · 2017-07-17T20:57:17Z

This PR has been updated to refactor the expansion of field names for the main es queries (the simple query string should be done in a follow up). The new description reflects the latest status of the PR (and explains the main changes). @dakrone I need to document the new behavior and work a bit on tests but the code is ready for a review I think ;).

dakrone · 2017-07-18T16:53:08Z

For partial field names (with * suffix), the expansion is done only on keyword, text, date and number field types. Other field types are simply ignored.

Does this include the ip field? (I can't remember if ip counts as a number)

jimczi · 2017-07-18T16:58:00Z

Does this include the ip field? (I can't remember if ip counts as a number)

It's a separate type but you are right this include the ip type as well. I've updated the description of the PR.

dakrone

LGTM, I left some comments about whether we want to capture exception messages, but it's up to you for addressing them.

Thanks for doing this!

dakrone · 2017-07-18T17:03:38Z

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java

I think "*".equals(defaultField) should use Regex.isMatchAllPattern

dakrone · 2017-07-18T17:09:49Z

core/src/main/java/org/elasticsearch/index/search/MatchQuery.java

Can we include the exception message? It might be good to know that a lenient query failed due to something that is avoidable

dakrone · 2017-07-18T17:10:31Z

core/src/main/java/org/elasticsearch/index/search/MultiMatchQuery.java

I think the indentation is off here still? :)

dakrone · 2017-07-18T17:17:08Z

core/src/main/java/org/elasticsearch/index/search/QueryStringQueryParser.java

should this return a MatchNoDocsQuery like the other ones where an exception is thrown?

dakrone · 2017-07-18T17:17:22Z

core/src/main/java/org/elasticsearch/index/search/QueryStringQueryParser.java

Should we include the message of the exception here?

dakrone · 2017-07-18T17:23:04Z

docs/reference/query-dsl/query-string-query.asciidoc

Should we leave this in here, but mark it as deprecated? It still technically exists and I'm not sure we want people that have all_fields: true set to not be able to look up what the option does (until it's removed)

This commit changes the way we handle field expansion in `match`, `multi_match` and `query_string` query. The main changes are: - For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping. - For partial field names (with `*` suffix), the expansion is done only on `keyword`, `text`, `date` and `number` field types. Other field types are simply ignored. - For all fields (`*`), the expansion is done on accepted field types only (see above) and metadata fields are also filtered. - The `*` notation can also be used to set `default_field` option on`query_string` query. This should replace the needs for the extra option `use_all_fields` which is deprecated in this change. This commit also rewrites simple `*` query to matchalldocs query when all fields are requested (Fixes elastic#25556). The same change should be done on `simple_query_string` for completeness. Relates elastic#25551

Related to elastic#25726, this resolves elastic#25556 for the 5.x series by parsing "*" as a `MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query on the `_field_names`.

Related to #25726, this resolves #25556 for the 5.x series by parsing "*" as a `MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query on the `_field_names`.

Bargs · 2017-08-01T17:44:29Z

Hey @jimczi, I'm seeing an odd difference between default_field and fields. fields: ["*"] sometimes throws an error when default_field: "*" does not. For example:

I'm using a weblogs data set with IP fields. If I send the following query, I get an error:

  {
    "query_string": {
      "query": "200",
      "analyze_wildcard": true,
      "fields": [
        "*"
      ]
    }
  }

{
      "error": {
        "root_cause": [
          {
            "type": "query_shard_exception",
            "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
            "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
            "index": "logstash-0"
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "logstash-0",
            "node": "38QhIDc9TTGCRXYNWAmFJA",
            "reason": {
              "type": "query_shard_exception",
              "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
              "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
              "index": "logstash-0",
              "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "'200' is not an IP string literal."
              }
            }
          }
        ]
      },
      "status": 400
    }

A multi_match with fields: ["*"] fails with the same error:

{
  "multi_match": {
    "query": 200,
    "fields": [
      "*"
    ],
    "type": "phrase"
  }
}

The following query_string query using default_field, however, works fine:

{
  "query_string": {
    "query": "200",
    "analyze_wildcard": true,
    "default_field": "*"
  }
}

For me, it would be ideal if fields: ["*"] worked like default_field: "*". If fields: ["*"] throws errors depending on the query value, I can't really send user provided values and I'm back to abusing the query_string query in order to do searches across all fields.

jimczi · 2017-08-01T18:49:27Z

This is just a leftover of the all_fields mode. By default the query_string and multi_match query are not lenient, they throw an error when an invalid content is used for a field type. When the all_fields was used (by default or when it was explicitly set) the leniency was forced to true. I kept this behavior for query_string query using default_field:* but it's just to make sure that the default (when default_field is not set) doesn't change. If you use fields:* or the multi_match query you'll have to set the leniency to true manually.

Bargs · 2017-08-01T22:10:39Z

Ah, I had no idea that lenient parameter existed. Thanks @jimczi, that's exactly what I needed!

… query_string This change is a continuation of elastic#25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query. The main changes are: * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping. * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored. * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered. The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter. This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part , ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query. Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.

… query_string (#26145) This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query. The main changes are: * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping. * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored. * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered. The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter. This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part , ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query. Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.

jimczi added :Query DSL >enhancement v6.0.0 labels Jul 14, 2017

jimczi requested a review from dakrone July 14, 2017 14:26

dakrone requested changes Jul 14, 2017

View reviewed changes

jimczi force-pushed the query_string_all_fields branch from d67c61a to 5a30987 Compare July 17, 2017 20:53

jimczi changed the title ~~Remove all_fields:true in favour of default_field:* in the query_string query~~ Refactor field expansion for match, multi_match and query_string query Jul 17, 2017

dakrone approved these changes Jul 18, 2017

View reviewed changes

jimczi force-pushed the query_string_all_fields branch from 5a30987 to 33e79b6 Compare July 21, 2017 11:06

jimczi added >bug and removed >enhancement labels Jul 21, 2017

jimczi added 3 commits July 21, 2017 15:34

docs

a743f0f

cleanup isMatchAllPattern

8fbb2b9

jimczi force-pushed the query_string_all_fields branch from b8b8f80 to 8fbb2b9 Compare July 21, 2017 13:35

jimczi merged commit c378432 into elastic:master Jul 21, 2017

jimczi deleted the query_string_all_fields branch July 21, 2017 14:53

dakrone mentioned this pull request Jul 25, 2017

Parse "*" in query_string_query as MatchAllDocsQuery #25872

Merged

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

Bargs mentioned this pull request Jul 25, 2017

Use default_field instead of all_fields in highlight query elastic/kibana#13096

Closed

Bargs mentioned this pull request Aug 2, 2017

Remove simple_query_string hack now that multi_match supports * properly elastic/kibana#13285

Merged

jimczi mentioned this pull request Aug 10, 2017

Refactor simple_query_string to handle text part like multi_match and query_string #26145

Merged

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

romseygeek mentioned this pull request Jan 18, 2022

Exclude doc-value-only fields from all fields queries #82712

Closed

Conversation

jimczi commented Jul 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dakrone left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 14, 2017

Choose a reason for hiding this comment

Uh oh!

jimczi commented Jul 14, 2017

Uh oh!

jimczi commented Jul 17, 2017

Uh oh!

dakrone commented Jul 18, 2017

Uh oh!

jimczi commented Jul 18, 2017

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

Bargs commented Aug 1, 2017

Uh oh!

jimczi commented Aug 1, 2017

Uh oh!

Bargs commented Aug 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jimczi commented Jul 14, 2017 •

edited

Loading

dakrone left a comment •

edited

Loading