Skip to content

Refactor field expansion for match, multi_match and query_string query#25726

Merged
jimczi merged 3 commits intoelastic:masterfrom
jimczi:query_string_all_fields
Jul 21, 2017
Merged

Refactor field expansion for match, multi_match and query_string query#25726
jimczi merged 3 commits intoelastic:masterfrom
jimczi:query_string_all_fields

Conversation

@jimczi
Copy link
Copy Markdown
Contributor

@jimczi jimczi commented Jul 14, 2017

This commit changes the way we handle field expansion in match, multi_match and query_string query.
The main changes are:

  • For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

  • For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

  • For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

  • The * notation can also be used to set default_field option onquery_string query. This should replace the needs for the extra option use_all_fields which is deprecated in this change.

This commit also rewrites simple * query to matchalldocs query when all fields are requested (Fixes #25556).

The same change should be done on simple_query_string for completeness.

use_all_fields option in query_string is also deprecated in this change, default_field should be set to * instead.

Relates #25551

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I just started playing with this (hence the short review), but I found something that causes some dissonance. If someone configures "default_field": "*" in the index settings, it does not work the same way as configuring "default_field": "*" in the query does.

Here's something that demonstrates the problem:

DELETE /test?pretty=false
{}

PUT /test?pretty=false
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "query": {
        "default_field": "*"
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "f1": {"type": "text"},
        "f2": {"type": "text"}
      }
    }
  }
}

POST /test/doc/1?refresh=wait_for&filter_path=result&pretty=false
{"f1": "foo", "f2": "bar"}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar"
    }
  }
}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar",
      "default_field": "*"
    }
  }
}

And the output:

{"acknowledged":true}
{"acknowledged":true,"shards_acknowledged":true,"index":"test"}
{"result":"created"}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "f1" : "foo",
          "f2" : "bar"
        }
      }
    ]
  }
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading since this is a getter

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Jul 14, 2017

Thanks @dakrone , good catch.
I pushed some changes to fix the bug.

@jimczi jimczi force-pushed the query_string_all_fields branch from d67c61a to 5a30987 Compare July 17, 2017 20:53
@jimczi jimczi changed the title Remove all_fields:true in favour of default_field:* in the query_string query Refactor field expansion for match, multi_match and query_string query Jul 17, 2017
@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Jul 17, 2017

This PR has been updated to refactor the expansion of field names for the main es queries (the simple query string should be done in a follow up). The new description reflects the latest status of the PR (and explains the main changes). @dakrone I need to document the new behavior and work a bit on tests but the code is ready for a review I think ;).

@dakrone
Copy link
Copy Markdown
Member

dakrone commented Jul 18, 2017

For partial field names (with * suffix), the expansion is done only on keyword, text, date and number field types. Other field types are simply ignored.

Does this include the ip field? (I can't remember if ip counts as a number)

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Jul 18, 2017

Does this include the ip field? (I can't remember if ip counts as a number)

It's a separate type but you are right this include the ip type as well. I've updated the description of the PR.

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left some comments about whether we want to capture exception messages, but it's up to you for addressing them.

Thanks for doing this!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "*".equals(defaultField) should use Regex.isMatchAllPattern

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the exception message? It might be good to know that a lenient query failed due to something that is avoidable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the indentation is off here still? :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return a MatchNoDocsQuery like the other ones where an exception is thrown?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the message of the exception here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave this in here, but mark it as deprecated? It still technically exists and I'm not sure we want people that have all_fields: true set to not be able to look up what the option does (until it's removed)

@jimczi jimczi force-pushed the query_string_all_fields branch from 5a30987 to 33e79b6 Compare July 21, 2017 11:06
@jimczi jimczi added >bug and removed >enhancement labels Jul 21, 2017
jimczi added 3 commits July 21, 2017 15:34
This commit changes the way we handle field expansion in `match`, `multi_match` and `query_string` query.
 The main changes are:

- For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

- For partial field names (with `*` suffix), the expansion is done only on `keyword`, `text`, `date` and `number` field types. Other field types are simply ignored.

- For all fields (`*`), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

- The `*` notation can also be used to set `default_field` option on`query_string` query. This should replace the needs for the extra option `use_all_fields` which is deprecated in this change.

This commit also rewrites simple `*` query to matchalldocs query when all fields are requested (Fixes elastic#25556).

The same change should be done on `simple_query_string` for completeness.

Relates elastic#25551
@jimczi jimczi force-pushed the query_string_all_fields branch from b8b8f80 to 8fbb2b9 Compare July 21, 2017 13:35
@jimczi jimczi merged commit c378432 into elastic:master Jul 21, 2017
@jimczi jimczi deleted the query_string_all_fields branch July 21, 2017 14:53
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jul 25, 2017
Related to elastic#25726, this resolves elastic#25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.
dakrone added a commit that referenced this pull request Jul 25, 2017
Related to #25726, this resolves #25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.
dakrone added a commit that referenced this pull request Jul 25, 2017
Related to #25726, this resolves #25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.
@Bargs
Copy link
Copy Markdown

Bargs commented Aug 1, 2017

Hey @jimczi, I'm seeing an odd difference between default_field and fields. fields: ["*"] sometimes throws an error when default_field: "*" does not. For example:

I'm using a weblogs data set with IP fields. If I send the following query, I get an error:

  {
    "query_string": {
      "query": "200",
      "analyze_wildcard": true,
      "fields": [
        "*"
      ]
    }
  }
{
      "error": {
        "root_cause": [
          {
            "type": "query_shard_exception",
            "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
            "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
            "index": "logstash-0"
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "logstash-0",
            "node": "38QhIDc9TTGCRXYNWAmFJA",
            "reason": {
              "type": "query_shard_exception",
              "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
              "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
              "index": "logstash-0",
              "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "'200' is not an IP string literal."
              }
            }
          }
        ]
      },
      "status": 400
    }

A multi_match with fields: ["*"] fails with the same error:

{
  "multi_match": {
    "query": 200,
    "fields": [
      "*"
    ],
    "type": "phrase"
  }
}

The following query_string query using default_field, however, works fine:

{
  "query_string": {
    "query": "200",
    "analyze_wildcard": true,
    "default_field": "*"
  }
}

For me, it would be ideal if fields: ["*"] worked like default_field: "*". If fields: ["*"] throws errors depending on the query value, I can't really send user provided values and I'm back to abusing the query_string query in order to do searches across all fields.

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Aug 1, 2017

This is just a leftover of the all_fields mode. By default the query_string and multi_match query are not lenient, they throw an error when an invalid content is used for a field type. When the all_fields was used (by default or when it was explicitly set) the leniency was forced to true. I kept this behavior for query_string query using default_field:* but it's just to make sure that the default (when default_field is not set) doesn't change. If you use fields:* or the multi_match query you'll have to set the leniency to true manually.

@Bargs
Copy link
Copy Markdown

Bargs commented Aug 1, 2017

Ah, I had no idea that lenient parameter existed. Thanks @jimczi, that's exactly what I needed!

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 21, 2017
… query_string

This change is a continuation of elastic#25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
jimczi added a commit that referenced this pull request Aug 21, 2017
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
jimczi added a commit that referenced this pull request Aug 21, 2017
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
jimczi added a commit that referenced this pull request Aug 21, 2017
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Search/Search Search-related issues that do not fall into other categories v6.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance degradation for '*' query with '_all' field disabled

4 participants