Limit shard realocation retries by idegtiarenko · Pull Request #90296 · elastic/elasticsearch

idegtiarenko · 2022-09-23T11:21:56Z

This change ensures that elasticsearch would not indefinitely retry relocating shard if operation fails.

Closes: #79445

elasticsearchmachine · 2022-09-23T15:47:27Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2022-09-26T07:55:14Z

Hi @idegtiarenko, I've created a changelog YAML for you.

DaveCTurner

I left some initial comments.

DaveCTurner · 2022-09-26T07:57:50Z

server/src/main/java/org/elasticsearch/cluster/routing/RelocationFailureInfo.java

+/**
+ * Holds additional information as to why the shard failed to relocate.
+ */
+public class RelocationFailureInfo implements ToXContentFragment, Writeable {


I wonder, do we need a whole new object for this or could we just use a plain int? We don't really need to distinguish null from 0 I think.

If we really want an object I'd still rather it was never null, and maybe use a record instead?

Converted to a record. I wanted to keep it a class/record in case we want to make this behavior more complex in the future (similar to UnassignedInfo with recording failed nodes or introducing a delay)

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

server/src/test/java/org/elasticsearch/cluster/routing/RelocationFailureInfoTests.java

server/src/main/java/org/elasticsearch/cluster/routing/RelocationFailureInfo.java

…ionFailureInfo.java Co-authored-by: David Turner <david.turner@elastic.co>

idegtiarenko · 2022-09-26T14:56:29Z

org.elasticsearch.cluster.routing.allocation.decider.MockDiskUsagesIT#testOnlyMovesEnoughShardsToDropBelowHighWatermark is failing for this change. I am investigating why.

UPD:

The failure is caused by:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

Lines 164 to 168 in 7dc8806

    
           String actualPath = clusterInfo.getDataPath(routing); 
        
           if (actualPath == null) { 
        
               // we might know the path of this shard from before when it was relocating 
        
               actualPath = clusterInfo.getDataPath(routing.cancelRelocation()); 
        
           }

routing.cancelRelocation() is no longer equal the shard before relocation. Related to #90109

idegtiarenko · 2022-09-26T15:22:54Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java

            unassignedInfo.toXContent(builder, params);
        }
+        if (relocationFailureInfo != RelocationFailureInfo.NO_FAILURES) {
+            relocationFailureInfo.toXContent(builder, params);


Do we want to show this unconditionally?

I think it's friendlier to clients if we don't change the response shape like this, so let's make it unconditional. NB RelocationFailureInfo#toXContent only emits the field if it's nonzero - it'd be better not to have that conditionality too.

Do we have any clients that fail on unknown fields that we need to notify about the change?

server/src/main/java/org/elasticsearch/cluster/routing/RelocationFailureInfo.java

# Conflicts: # server/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java

DaveCTurner

LGTM

idegtiarenko added 4 commits September 22, 2022 15:22

failure dto

aca9635

introduce a parameter

f13ea0e

add integration test

2c30419

update max retry allocation decider

01c2966

idegtiarenko added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team. v8.6.0 labels Sep 23, 2022

idegtiarenko added 4 commits September 23, 2022 15:52

Merge branch 'main' into limit_shard_realocation_retries

033a744

reset failure count

8c8dfb7

fix assertion

9a5b86a

keep relocation failure info

06e0a55

idegtiarenko marked this pull request as ready for review September 23, 2022 15:47

idegtiarenko requested a review from DaveCTurner September 23, 2022 15:47

DaveCTurner added the >enhancement label Sep 26, 2022

Update docs/changelog/90296.yaml

34a324b

DaveCTurner reviewed Sep 26, 2022

View reviewed changes

idegtiarenko added 5 commits September 26, 2022 14:03

Merge branch 'main' into limit_shard_realocation_retries

9972638

fix comments

473644d

convert to record

d8d8368

fix assertion failure

d6c1166

fix test instance creation

81caa53

DaveCTurner reviewed Sep 26, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/routing/RelocationFailureInfo.java Outdated Show resolved Hide resolved

idegtiarenko and others added 2 commits September 26, 2022 15:44

Update server/src/main/java/org/elasticsearch/cluster/routing/Relocat…

85da3e9

…ionFailureInfo.java Co-authored-by: David Turner <david.turner@elastic.co>

make new field not nullable

06d3c66

idegtiarenko added 2 commits September 26, 2022 17:20

fix tests

7a5ad46

one more nullable annotation

feb297e

idegtiarenko commented Sep 26, 2022

View reviewed changes

Merge branch 'main' into limit_shard_realocation_retries

c10b864

DaveCTurner reviewed Sep 26, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/routing/RelocationFailureInfo.java Outdated Show resolved Hide resolved

idegtiarenko added 3 commits September 26, 2022 18:17

Write relocationFailureInfo unconditionally

6ca8522

fix tests

862a86c

Merge branch 'main' into limit_shard_realocation_retries

ae1a2bc

# Conflicts: # server/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java

idegtiarenko requested a review from DaveCTurner September 27, 2022 08:25

fix serialization version

77676d8

DaveCTurner approved these changes Sep 27, 2022

View reviewed changes

idegtiarenko merged commit 24cf871 into elastic:main Sep 27, 2022

idegtiarenko deleted the limit_shard_realocation_retries branch September 27, 2022 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit shard realocation retries#90296

Limit shard realocation retries#90296
idegtiarenko merged 23 commits intoelastic:mainfrom
idegtiarenko:limit_shard_realocation_retries

idegtiarenko commented Sep 23, 2022

Uh oh!

elasticsearchmachine commented Sep 23, 2022

Uh oh!

elasticsearchmachine commented Sep 26, 2022

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Sep 26, 2022

Uh oh!

idegtiarenko Sep 26, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

idegtiarenko commented Sep 26, 2022 •

edited

Loading

Uh oh!

idegtiarenko Sep 26, 2022

Uh oh!

DaveCTurner Sep 26, 2022

Uh oh!

idegtiarenko Sep 26, 2022

Uh oh!

Uh oh!

DaveCTurner left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

idegtiarenko commented Sep 23, 2022

Uh oh!

elasticsearchmachine commented Sep 23, 2022

Uh oh!

elasticsearchmachine commented Sep 26, 2022

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

idegtiarenko Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

idegtiarenko commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idegtiarenko Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

idegtiarenko Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

idegtiarenko commented Sep 26, 2022 •

edited

Loading