Fail replica shards locally upon failures#5847
Closed
bleskes wants to merge 2 commits intoelastic:masterfrom
Closed
Fail replica shards locally upon failures#5847bleskes wants to merge 2 commits intoelastic:masterfrom
bleskes wants to merge 2 commits intoelastic:masterfrom
Conversation
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This PR builds on elastic#5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.
Member
There was a problem hiding this comment.
we end up double logging warnings, no? The first here, and the second when failing the engine. I think its enough to log a warning when failing the engine later.
Contributor
There was a problem hiding this comment.
I tend to agree but I think we should log that we executed this as debug?
Contributor
|
one small comments but otherwise LGTM |
…it contained is passed on
Contributor
Author
|
I pushed another commit with the log message removed. I adapted the reason (which is logged by the shard failure) to include the information that was missing. I decided in the end not to add a debug logging as there is no logic and hardly any code between here and where we log it. If anyone feels strongly about it, I'll happily add it. |
Contributor
|
LGTM |
bleskes
added a commit
that referenced
this pull request
Apr 18, 2014
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area. Closes #5847
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (
POST_RECOVERYstate), we have a racing condition between the failed shard message and moving the shard into theSTARTEDstate. If the latter happen first, master will fail to resolve the fail shard message.This PR builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the
STARTEDcommand from master. It also makes us more resilient to other racing conditions in this area.