Shard failure requests for non-existent shards#16089
Shard failure requests for non-existent shards#16089jasontedor wants to merge 7 commits intoelastic:masterfrom jasontedor:validate-shard-failure-requests
Conversation
This commit adds handling on the master side for shard failure requests for shards that do not exist at the time that they are processed on the master node (whether it be from errant requests, duplicate requests, or both the primary and replica notifying the master of a shard failure). This change is made because such shard failure requests should always be considered successful (the failed shard is not there anymore), but could be marked as failed if batched with a shard failure request that does in fact fail.
|
@bleskes I've updated this pull request. |
There was a problem hiding this comment.
can we call these - missingShardTasks or alreadyRemovedTasks?
There was a problem hiding this comment.
sorry - got confused - maybe resolvedShards? (nonTrivial is so vague)
|
Thanks @jasontedor . Left some comments/questions... |
This commit renames a variable in ShardFailedClusterStateTaskExecutor#execute in that it contains the tasks that need the allocation service to be processed.
This commit simplifies the task-splitting logic when processing a batch of shard failure tasks.
There was a problem hiding this comment.
This was a bug, but fortunately the method was unused until now so not impactful.
|
@bleskes I pushed two commits. |
This commit adds tests of the execution logic when failing a batch of shards. Three tests are added: - test that an empty task list produces no change in the cluster state - test that duplicate shard failure requests are processed successfully - test that shard failure requests for non-existent shards are processed successfully - test that failing tasks do not prevent trivially successful tasks from succeeding
There was a problem hiding this comment.
can we move this up next to the stream partitioning so it will be clearer why we do this?
This commit rearranges code to clarify the intent of the task partitioning in the shard failure cluster state task executor.
There was a problem hiding this comment.
can we also add shards with the same index and id as the existing ones but with a different allocation ids?
There was a problem hiding this comment.
can we also add shards with the same index and id as the existing ones but with a different allocation ids?
That's a good suggestion.
|
LGTM. Left some suggestions for the tests, non of it is blocking pushing this. |
This commit adds some shards with the same shard ID as existing shards, but with a non-existent allocation ID. This tests that these shards are correctly handled by the shard failure cluster state task executor.
This commit adds assertions that when a shard failure request is successful, the shard that was requested to be failed is in fact no longer in the cluster state.
|
Thanks for another great review @bleskes. |
This commit adds handling on the master side for shard failure requests
for shards that do not exist at the time that they are processed on the
master node (whether it be from errant requests, duplicate requests, or
both the primary and replica notifying the master of a shard
failure). This change is made because such shard failure requests should
always be considered successful (the failed shard is not there anymore),
but could be marked as failed if batched with a shard failure request
that does in fact fail.
Relates #14252