Skip to content

Delete by query should not silently refresh index #3593

@uschindler

Description

@uschindler

Hi this issue caused lots of trouble because it was not clear why this happened. I had some index updates where a (quite common) approach is used:

I have to update a bulk of documents with some higher level group key (not the uid). Like:

doc1: { groupKey: 'foo', _id: 'bar1' }
doc2: { groupKey: 'foo', _id: 'bar2' }
doc3: { groupKey: 'foo', _id: 'bar3' }

The code that updates this group of documents does not know the real _id of those already in the index (it just knows that the whole group updates), so it first deletes all documents by using deleteByQuery on the group key. After that it reindexes all documents in the group (with possibly different new _id values).

If you don't disable index refreshing, for a short time, the whole group would be disappearing and reappearing then. So to make the whole group reindex "atomic" you would disable index refreshing before that and reenable it afterwards (or do manual refreshing at all - what I do for this index in any case).

Unfortunately, deleteByQuery forcefully refreshes the index. Which is hard to understand because its not documented. There is just a comment in the code that the refresh is needed although its heavy, because when executing a Lucene IndexWriter deleteByQuery, ElasticSearch does not know what documents were really deleted, so all internal tracking does not work (it cannot update version consistency,...)

I was discussing with Martijn on IRC (not even he was aware that deleteByQuery does not work with disabled refreshing), he suggested that maybe the query is executed in ElasticSearch itsself and then it starts a bulk on _uid deletes (this is also one possibility for a workaround in our case if number of deletes is small).

In my opinion the better variant would be to do it like in Apache Solr: Apache Solr has 2 different IndexReaders open: One for searching the index (this one is refreshed in those periods of times), but a second one is another NRT reader on the IndexWriter that is used to do some updates of data structures after IndexWriter has written stuff. So updating of the ES internal data should be done with a new NRT reader and not the one used for searching.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions