Skip to content

Enhances exists queries to reduce need for _field_names#26930

Merged
colings86 merged 8 commits intoelastic:masterfrom
colings86:enhance/existsQuery
Nov 1, 2017
Merged

Enhances exists queries to reduce need for _field_names#26930
colings86 merged 8 commits intoelastic:masterfrom
colings86:enhance/existsQuery

Conversation

@colings86
Copy link
Copy Markdown
Contributor

Before this change we wrote the name all the fields in a document to a _field_names field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance.

This change adds a new method existsQuery() to MappedFieldType which is implemented by each sub-class. For most field types if doc values are available a DocValuesFieldExistsQuery is used, falling back to using _field_names if doc values are disabled. Note that only fields where no doc values are available are written to _field_names.

Closes #26770

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it.

One concern I have is that the impl for existsQuery is shared in a super class while the logic for adding mappers is in concrete impls. I think it makes it very easy to inherit the default impl for the exists query and forget to add a _field_names field at index time if necessary?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha good spot, this was so that the rest client didn't time out when I was debugging rest tests, shouldn't have been committed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected binary fields to require changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then shouldn't it create field names fields when it is stored or has doc values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I've made changes to support this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if statement?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz and I discussed this and we agreed that for this PR we will keep the behaviour that exists queries will only work for fields that are indexed, stored, or have doc values (or a combination of those) so this PR can be merged into 7.0 and 6.x. I'll make a followup PR for 7.0 only that changes things to allow exists to work for any field that is mapped.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this logic outside of this method, the callers now check to ensure if field names field should be created

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a comment that the logic sits in other impls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could return a match_all if the field is enabled and match_none otherwise?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw an exception here to mirror the behaviour of termQuery(). I assumed we did this in term query to discourage people from performing queries on _source and assumed we would want to do this here too for the same reason?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to create a helper method that creates a query for a given field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think more about it, I think this impl has potential problems with multiple types, since getObjectMapper will return the mapper of one of the types, meaning it might miss sub fields if they only exist in another type (those mapper classes are per-type on the contrary to the field type classes). So maybe we should resolve fields that match field + ".*" and build a bool query out of it (and no need for recursion since I would expect the glob to match fields of sub objects too)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use NormFieldExistsQuery since we are on a 7.1 snapshot now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this as a followup PR as it will require the Lucene 7.1 snapshot to be updated too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@colings86
Copy link
Copy Markdown
Contributor Author

@jpountz I pushed a commit that addresses your comments. Could you take another look?

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments. I think we also need to fix the _field_names docs to remove the example that directly queries _field_names?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then shouldn't it create field names fields when it is stored or has doc values?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@colings86
Copy link
Copy Markdown
Contributor Author

@jpountz I pushed another commit. This addresses your remaining comments including removing the implementation of existsQuery from MappedFieldType and instead requires (and implements) the method on the subclasses

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me overall! I have two questions left:

  • should we document the breaking change about term queries on _field_names or maybe even try to maintain backward compatibility? In any cases we should at least remove the term query example in the docs?
  • Doc value iterators are only good for sure if they were written with Lucene 7. So I'm wondering that we might still want to write field names entries if the index was created before 6.0 as well as only use DocValuesFieldExistsQuery if the index was created on or after 6.0?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will always be false?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always false?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always false?

@colings86
Copy link
Copy Markdown
Contributor Author

should we document the breaking change about term queries on _field_names or maybe even try to maintain backward compatibility? In any cases we should at least remove the term query example in the docs?

Apart from going back to always writing all field names to the _field_names field the only way I see to maintain backward compatibility for term queries on _field_names is to have FieldNamesFieldMapper.termQuery() retrieve the MappedFieldType for the field being queried and call existsQuery() on it. At the moment the implementations of existsQuerycall FieldNamesFieldMapper.termQuery() so if we want to do this we would need to have the existsQuery() implementations instead create a TermQuery directly otherwise we would end up with a circular call chain. Does this sound ok to you?

Doc value iterators are only good for sure if they were written with Lucene 7. So I'm wondering that we might still want to write field names entries if the index was created before 6.0 as well as only use DocValuesFieldExistsQuery if the index was created on or after 6.0?

Ok so this would mean checking the indexCreatedVersion at both index time and query time to determine how to index/query the data? It would also mean the index-time performance improvements would only be visible for indexes created after 6.0?

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Oct 13, 2017

Does this sound ok to you?

I think it is fine. It would also be symmetric since we add StringField instances manually at index time rather than calling FieldNamesFieldMapper.parse?

Ok so this would mean checking the indexCreatedVersion at both index time and query time to determine how to index/query the data? It would also mean the index-time performance improvements would only be visible for indexes created after 6.0?

Correct.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a short comment about how we only need it in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you override parseCreateField to throw an IllegalStateException or something? I don't think we use it any more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't use it until I pushed the latest commits which put the BWC code in to make sure _field_names is still used for indexes created before 6.1. I will make sure I change parseCreateField to throw an exception when I remove the BWC code though. Thanks for pointing this out.

@colings86
Copy link
Copy Markdown
Contributor Author

@nik9000 @jpountz thanks for the reviews so far. I've pushed some more commits, would you be able to take another look?

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance.

This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`.

Closes #26770
These values will need to be changed after backporting this PR to 6.x
@colings86 colings86 merged commit 99aca9c into elastic:master Nov 1, 2017
colings86 added a commit that referenced this pull request Nov 1, 2017
* Enhances exists queries to reduce need for `_field_names`

Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance.

This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`.

Closes #26770

* Addresses review comments

* Addresses more review comments

* implements existsQuery explicitly on every mapper

* Reinstates ability to perform term query on `_field_names`

* Added bwc depending on index created version

* Review Comments

* Skips tests that are not supported in 6.1.0

These values will need to be changed after backporting this PR to 6.x
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 1, 2017
* master:
  Enhances exists queries to reduce need for `_field_names` (elastic#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (elastic#27188)
  Update Docker docs for 6.0.0-rc2 (elastic#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (elastic#26933)
  Convert index blocks to cluster block exceptions (elastic#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (elastic#26811)
  prevent duplicate fields when mixing parent and root nested includes (elastic#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
@colings86 colings86 deleted the enhance/existsQuery branch November 1, 2017 15:30
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 2, 2017
* master:
  Remove checkpoint tracker bit sets setting
  Fix stable BWC branch detection logic
  Fix logic detecting unreleased versions
  Enhances exists queries to reduce need for `_field_names` (elastic#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (elastic#27188)
  Update Docker docs for 6.0.0-rc2 (elastic#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (elastic#26933)
  Convert index blocks to cluster block exceptions (elastic#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (elastic#26811)
  prevent duplicate fields when mixing parent and root nested includes (elastic#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
jasontedor added a commit that referenced this pull request Nov 2, 2017
* master:
  Lazy initialize checkpoint tracker bit sets
  Remove checkpoint tracker bit sets setting
  Fix stable BWC branch detection logic
  Fix logic detecting unreleased versions
  Enhances exists queries to reduce need for `_field_names` (#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (#27188)
  Update Docker docs for 6.0.0-rc2 (#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (#26933)
  Convert index blocks to cluster block exceptions (#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (#26811)
  prevent duplicate fields when mixing parent and root nested includes (#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (#27138)
jasontedor added a commit that referenced this pull request Nov 2, 2017
* 6.x:
  Lazy initialize checkpoint tracker bit sets
  Remove checkpoint tracker bit sets setting
  Tests: Fix FullClusterRestartIT.testSnapshotRestore test failing in 6.x (#27218)
  Fix FullClusterRestartIT using lenient booleans with 6.0
  Fix stable BWC branch detection logic
  Add version 6.0.0
  Fix logic detecting unreleased versions
  Skips exists query tests on unsupported versions
  Enhances exists queries to reduce need for `_field_names` (#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (#27188)
  Update Docker docs for 6.0.0-rc2 (#27166)
  Docs: restore now fails if it encounters incompatible settings (#26933)
  Convert index blocks to cluster block exceptions (#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  prevent duplicate fields when mixing parent and root nested includes (#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (#27138)
@colings86 colings86 added :Search/Search Search-related issues that do not fall into other categories and removed WIP labels Nov 2, 2017
@Bek
Copy link
Copy Markdown
Contributor

Bek commented Nov 7, 2020

This is triggering a few thousand DocValuesFieldExistsQuery for us when using exists for a field type object because objects under the field have even more deeply nested documents. Queries taking a second before are taking around 30-40 seconds. Profiling is indicating that "exists": "user" causes the DocValuesFieldExistsQuery to be called on each and all subfields as if "exists": "user.*" is intended. "exists": "user.id" returns without performance degradation. Is this the expected behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search/Search Search-related issues that do not fall into other categories v6.1.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the ability to create exists queries work on a per-field basis

5 participants