Deduplicate Shard Started Requests by original-brownbear · Pull Request #82089 · elastic/elasticsearch

original-brownbear · 2021-12-27T11:38:54Z

Deduplicate shard started requests the same way we deduplicate shard-failed
and shard snapshot state updates already.

Deduplicate shard started requests the same way we deduplicate shard-failed and shard snapshot state updates already. closes #81628

elasticmachine · 2021-12-27T11:38:57Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

Can we add a test demonstrating that this works, both when master has stuff queued, causing the retries and when master fails over?

I wonder if we should change our logic here to always send to a new master to speed up recovery after a very slow/gc hung master is taken over by another master?

henningandersen · 2021-12-28T10:52:37Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

    // a list of shards that failed during replication
    // we keep track of these shards in order to avoid sending duplicate failed shard requests for a single failing shard.
-    private final ResultDeduplicator<FailedShardEntry, Void> remoteFailedShardsDeduplicator = new ResultDeduplicator<>();
+    private final ResultDeduplicator<TransportRequest, Void> remoteFailedShardsDeduplicator = new ResultDeduplicator<>();


I think this field needs a rename now.

++ renamed and fixed comment

original-brownbear · 2021-12-28T15:31:55Z

Can we add a test demonstrating that this works, both when master has stuff queued, causing the retries and when master fails over?

I added a rather trivial test in the style of the tests that already exist for this thing (the test is good enough to demonstrate proper deduplication of requests IMO). Couldn't find a quick way of testing the thing below.

I wonder if we should change our logic here to always send to a new master to speed up recovery after a very slow/gc hung master is taken over by another master?

Yea we had the same issue in shard snapshot state updates and I implemented the same solution now. Unfortunately I couldn't find a neat way of testing this quickly. In snapshotting this is a lot easier to test with the existing test infrastructure.
Not sure it's worth the effort to add a test for this today since it's certainly better than it was before by adding the clearing out of the dedup?

henningandersen

LGTM, with two small additions.

henningandersen · 2021-12-28T16:41:54Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+     * to the new master right away on master failover.
+     */
+    public void clearRemoteShardRequestDeduplicator() {
+        remoteShardStateUpdateDeduplicator.clear();


Since we call this from multiple threads, there is a bit of best-effort over this method, I think that is worth documenting.

For instance, this may clear out a remote shard failed request deduplication to the new master in edge cases. This does no real harm, since we still protect the master.

henningandersen · 2021-12-28T16:44:10Z

server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java

        assertThat(transport.capturedRequests(), arrayWithSize(0));
    }

+    public void testDeduplicateRemoteShardStarted() throws InterruptedException {


Can we either add a test or randomly clear the deduplicator here and then validate we see two requests at the end?

original-brownbear · 2021-12-28T18:02:18Z

Thanks Henning!

elasticsearchmachine · 2021-12-28T18:03:33Z

💚 Backport successful

Status	Branch	Result
✅	8.0

Deduplicate shard started requests the same way we deduplicate shard-failed and shard snapshot state updates already. closes #81628

DaveCTurner · 2022-01-04T09:39:38Z

LGTM as a small/interim fix, but fundamentally we should be using edge-triggering for the shard state transitions with appropriate failure handling to organise retries (see also #81626). Today’s level-triggered system was necessary when cluster state updates could be lost I guess, but that’s no longer the case. I opened #82185 to track this tech debt.

Deduplicate Shard Started Requests

a8e758f

Deduplicate shard started requests the same way we deduplicate shard-failed and shard snapshot state updates already. closes #81628

original-brownbear added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v8.1.0 labels Dec 27, 2021

elasticmachine added the Team:Distributed Meta label for distributed team. label Dec 27, 2021

original-brownbear requested review from DaveCTurner and henningandersen December 28, 2021 10:22

henningandersen reviewed Dec 28, 2021

View reviewed changes

original-brownbear added 2 commits December 28, 2021 14:13

Merge remote-tracking branch 'elastic/master' into 81628

67d1fff

trivial test and dedup cleanup on failover

ca40e0c

original-brownbear requested a review from henningandersen December 28, 2021 16:23

henningandersen approved these changes Dec 28, 2021

View reviewed changes

udpate comment + enhance test

fc1b768

original-brownbear added the auto-backport-and-merge label Dec 28, 2021

original-brownbear merged commit 01debdc into elastic:master Dec 28, 2021

original-brownbear deleted the 81628 branch December 28, 2021 18:02

original-brownbear mentioned this pull request Dec 28, 2021

[8.0] Deduplicate Shard Started Requests (#82089) #82106

Merged

elasticsearchmachine pushed a commit that referenced this pull request Dec 28, 2021

Deduplicate Shard Started Requests (#82089) (#82106)

e790342

Deduplicate shard started requests the same way we deduplicate shard-failed and shard snapshot state updates already. closes #81628

DaveCTurner mentioned this pull request Jan 4, 2022

Shard state transitions should be edge-triggered rather than level-triggered #82185

Open

DaveCTurner mentioned this pull request Jan 18, 2022

Stop unnecessary retries of shard-started tasks #81628

Closed

original-brownbear restored the 81628 branch April 18, 2023 20:36

Conversation

original-brownbear commented Dec 27, 2021

Uh oh!

elasticmachine commented Dec 27, 2021

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 28, 2021

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

henningandersen Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 28, 2021

Uh oh!

elasticsearchmachine commented Dec 28, 2021

💚 Backport successful

Uh oh!

DaveCTurner commented Jan 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants