Engine - do not index operations with seq# lower than the local checkpoint into lucene#25827
Merged
bleskes merged 6 commits intoelastic:masterfrom Jul 21, 2017
Merged
Conversation
Contributor
Author
|
retest this please |
ywelsch
approved these changes
Jul 21, 2017
Contributor
ywelsch
left a comment
There was a problem hiding this comment.
Left one suggestion for an assertion. LGTM
| // part of the lucene commit that was sent over to the replica (due to translog retention) or due to | ||
| // a concurrent indexing / recovery. For the former it is important to skip lucene as the operation in | ||
| // question may have been deleted in an out of order op that is not replayed. See testRecoverFromStoreWithOutOfOrderDelete | ||
| opVsLucene = OpVsLuceneDocStatus.OP_STALE_OR_EQUAL; |
Contributor
There was a problem hiding this comment.
can you add an assertion on the origin here? i.e. PEER_RECOVERY or REPLICA
Contributor
Author
|
thx @jasontedor & @ywelsch |
bleskes
added a commit
that referenced
this pull request
Jul 22, 2017
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a replica processes out of order operations, it can drop some due to version comparisons. In the past that would have resulted in a VersionConflictException being thrown and the operation was totally ignored. With the seq# push, we started storing these operations in the translog (but not indexing them into lucene) in order to have complete op histories to facilitate ops based recoveries. This in turn had the undesired effect that deleted docs may be resurrected during recovery in some extreme edge situation (see a complete explanation below). This PR contains a simple fix, which is also an optimization for the recovery process, incoming operation that have a seq# lower than the current local checkpoint (i.e., have already been processed) should not be indexed into lucene. Note that sometimes we can also skip storing them in the translog, but this is not required for the fix and is more complicated.
This is the equivalent of #25592
More details on resurrected ops
Consider two operations:
On a replica they come out of order:
If this replica becomes a primary:
- Local recovery will replay translog gen 2 and up, causing index #1 to be re-index.
- Even if recovery will start at gen 3, the translog retention policy will cause file based recovery to replay the entire translog. If it happens to start at gen 2 (but not 1), we will run into the same problem.
Some context - out of order delivery involving deletes:
On normal operations, this relies on the gc_deletes setting. We assume that the setting represents an upper bound on the time between the index and the delete operation. The index operation will be detected as stale based on the tombstone map in the LiveVersionMap.
Recovery presents a challenge as it can replay an old index operation that was in the translog and override a delete operation that was done when the engine was opened (and is not part of the replayed snapshot). To deal with this situation, we disable GC deletes (i.e. retain all deletes) for the duration of recoveries. This means that the delete operation will be remembered and the index operation ignored.
Both of the above scenarios (local recover + peer recovery) create a situation where the delete operation is never replayed. It this "lost" as lucene doesn't remember it happened and our LiveVersionMap is populated with it.
Solution:
Note that both local and peer recovery represent a scenario where we replay translog ops on top of an existing lucene index, potentially with ongoing indexing. Therefore we can treat them the same.
The local checkpoint in Lucene represent a marker indicating that all operations below it were performed on the index. This is the only form of "memory" that we have that relates to deletes. If we can achieve the following:
It will mean that all variants are covered: (i# == index op seq#, d# == delete op seq#, lc == local checkpoint in commit)
More formally - we want to make sure that for all ops that performed on the primary o1 and o2, if o2 is processed on a shard before o1, o1 will be dropped. We have the following scenarios
Implementation:
For local recovery, we can filter the ops we read of the translog and avoid replaying them. For peer recovery this is tricky as we do want to send the operations in order to have some history on the target shard. Filtering operations on the engine level (i.e., not indexing to lucene if op seq# <= lc) would work for both.