Fail demoted primary shards and retry request#16415
Fail demoted primary shards and retry request#16415jasontedor wants to merge 11 commits intoelastic:masterfrom jasontedor:waiting-is-the-hardest-part
Conversation
This commit handles the scenario where a replication action fails on a replica shard, the primary shard attempts to fail the replica shard but the primary shard is notified of demotion by the master. In this scenario, the demoted primary shard must be failed, and then the request rerouted again to the new primary shard.
| // TODO: handle catastrophic non-channel failures | ||
| onReplicaFailure(nodeId, exp); | ||
| public void onFailure(Throwable shardFailedError) { | ||
| logger.error("[{}] catastrophic error while failing replica shard [{}] for [{}]", shardFailedError, shardId, shard, exp); |
There was a problem hiding this comment.
this is already logged in the shard state action
|
Thanks Jason. left some comments with suggestions |
This commit removes a logging statement from TransportReplicationAction that occurs when handling a catastrophic non-channel failure while failing a shard. The same information is also logged from ShardStateAction when the master responds to the shard failure request there.
This commit significantly simplifies the retry logic upon a demoted primary shard. In particular, rather failing the demoted primary via shard state action and starting a new reroute phase, this commit simplifies this logic by failing the demoted primary via an index shard reference, and sending a retry on primary exception back to the original reroute phase.
This commit adds a dedicated test for handling the situation of a demoted primary shard that was trying to fail a replica shard during a replication action.
This commit removes a dead request parameter from the constructor of TransportReplicationAction.ReplicationPhase.
* master: (85 commits) Rename variables. BootstrapSettings final with private constructor simplify method signature Register bootstrap settings Fix asciidoc typo Log warning if max file descriptors too low Require that relocation source is marked as relocating before starting recovery to relocation target apply feedback from @colings86 Speedup MessageDigestTests#testToHexString Fix serialization of `search_analyzer`. Cleanup JavaVersion Detach QueryShardContext from IndexShard and remove obsolete threadlocals Move sorting tests w/o scripting back to core Fix recovery translog stats totals when recovering from store [TEST] Don't assert on null value. It's fine to not always see an exception in this part of the test. Objects#requireNonNull guard in MessageDigests Fix trace logging statement in ZenDiscovery Avoid cloning MessageDigest instances Minor clean up. Make IndicesWarmer a private class of IndexService ...
|
@bleskes I've pushed more commits. |
| ShardRouting primaryShard = indexShardReference.routingEntry(); | ||
| String message = String.format(Locale.ROOT, "primary shard [%s] was demoted while failing replica shard [%s] for [%s]", primaryShard, shard, exp); | ||
| // we are no longer the primary, fail ourselves and start over | ||
| indexShardReference.failShard(message, shardFailedError); |
There was a problem hiding this comment.
we can use forceFinishAsFailed here no?
| assertPhase(task, "failed"); | ||
| } | ||
|
|
||
| public void testReroutePhaseRetriedAfterDemotedPrimary() { |
There was a problem hiding this comment.
since we know that we return with retry on primary (and we test for it), do we need a dedicated test here? can't we rely on the generic retry test (which I do home check the RetryOnPrimaryException semantics)?
There was a problem hiding this comment.
@bleskes I think that runReplicateTest is getting very complicated, and I'd prefer to keep this test with much simpler semantics for what is a complicated situation. Do you feel strongly that it should be removed?
There was a problem hiding this comment.
I meant removing this in favor of what I thought would be a centralized reroute test , but I know see there is no suche thing and we do it on a case by case basis. So Ignore me, all ok.
|
I think this turned out well. I left some comments. I'm thinking we should get this in and then open a follow up issue to deal with the annoying request-reset issue. |
|
LGTM. |
This commit handles the scenario where a replication action fails on a
replica shard, the primary shard attempts to fail the replica shard
but the primary shard is notified of demotion by the master. In this
scenario, the demoted primary shard must be failed, and then the
request rerouted again to the new primary shard.
Relates #14252