Restore local history from translog on promotion#33616
Merged
dnhatn merged 7 commits intoelastic:masterfrom Sep 20, 2018
Merged
Restore local history from translog on promotion#33616dnhatn merged 7 commits intoelastic:masterfrom
dnhatn merged 7 commits intoelastic:masterfrom
Conversation
If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates elastic#32867
Collaborator
|
Pinging @elastic/es-distributed |
dnhatn
commented
Sep 12, 2018
| localCheckpointTracker.markSeqNoAsCompleted(operation.seqNo()); | ||
| } | ||
| } | ||
| return translogRecoveryRunner.run(this, snapshot); |
Member
Author
There was a problem hiding this comment.
We can keep track a max_seqno from translog to recover when we rollback this engine (i.e., recover_upto and max_seqno in translog at that time), then only restore if needed. However, I opted out for simplicity.
s1monw
suggested changes
Sep 12, 2018
| @Override | ||
| public void restoreLocalCheckpointFromTranslog() { | ||
| public int restoreLocalHistoryFromTranslog(TranslogRecoveryRunner translogRecoveryRunner) { | ||
| assert false : "this should not be called"; |
Contributor
There was a problem hiding this comment.
I don't understand why this throws an exception. if you have an index that is read-only and uses this engine and a primary get's promoted this should be a no-op not a UOE?
Member
Author
There was a problem hiding this comment.
Yes, we should make this a no-op (just like fillSeqNoGaps).
ywelsch
approved these changes
Sep 20, 2018
Member
Author
This was referenced Sep 20, 2018
Closed
dnhatn
added a commit
that referenced
this pull request
Sep 20, 2018
If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates #33473 Relates #32867
kcm
pushed a commit
that referenced
this pull request
Oct 30, 2018
If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates #33473 Relates #32867
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If a shard was serving as a replica when another shard was promoted to
primary, then its Lucene index was reset to the global checkpoint.
However, if the new primary fails before the primary/replica resync
completes and we are now being promoted, we have to restore the reverted
operations by replaying the translog to avoid losing acknowledged writes.
Relates #33473
Relates #32867