Skip to content

Replay history of operations in remote recovery#39153

Closed
dnhatn wants to merge 33 commits intoelastic:masterfrom
dnhatn:remote-recovery-history
Closed

Replay history of operations in remote recovery#39153
dnhatn wants to merge 33 commits intoelastic:masterfrom
dnhatn:remote-recovery-history

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Feb 19, 2019

The safe commit invariant does not hold for the following indices because we do not replay the history in remote recovery.

Relates #39000
Relates #35975

@dnhatn dnhatn added blocker v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 19, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Feb 20, 2019

run elasticsearch-ci/default-distro

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn. I've had an initial look and this looks already great. I'm not super happy with restoreShard returning a Translog.Snapshot, as it's now up to the caller to complete building the shard but also see that it's tricky to otherwise keep information about session etc. local to the repository implementation.

* to the max_seq_no. Then we won't have a safe commit for the restoring commit is not safe (missing translog).
* To maintain the safe commit assumption, we have to forcefully flush a new commit here.
*/
indexShard.flush(new FlushRequest().force(true).waitIfOngoing(true));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make sure that the global checkpoint is up-to-date (i.e. >= max-seq-no in the new commit that we create here) before calling this? Otherwise the shard will be marked as in-sync in the cluster state while the commit here will only become safe commit when the shard is locally started (and the gcp advanced). The main property we're after here is that every in-sync shard copy has a safe commit, which (AFAICS) is not guaranteed by the current recovery logic.

*/
public static void assertSafeCommitExists(IndexShard shard) throws IOException {
try {
if (shard.state != IndexShardState.STARTED) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a shard in POST_RECOVERY should also satisfy this property already, see my comment above

}
}
indexShard.finalizeRecovery();
indexShard.postRecovery("restore done");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert in post_recovery that we have a safe commit?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion here, but this assertion may slow down our tests. I will post the difference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, we don't have this invariant with closed indices because the global checkpoint is not persisted to the translog checkpoint during recovery.

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Mar 11, 2019

Discussed with Yannick on another channel, I marked this as WIP since we need to make broader (maybe unrelated) changes to this PR to achieve the safe commit invariant.

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Apr 3, 2019

We now can achieve the safe commit invariant. I will open smaller pull requests for unrelated changes.

@dnhatn dnhatn removed the WIP label Apr 5, 2019
@dnhatn dnhatn requested a review from ywelsch April 5, 2019 03:10
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Apr 5, 2019

@ywelsch This is ready for another round. Can you please give it another go. Thank you!

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented May 16, 2019

I am closing this PR as the safe commit invariant does not hold for follower indices.

@dnhatn dnhatn closed this May 16, 2019
@dnhatn dnhatn deleted the remote-recovery-history branch May 16, 2019 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/CCR Issues around the Cross Cluster State Replication features >enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants