Skip to content

Limit shard realocation retries#90296

Merged
idegtiarenko merged 23 commits intoelastic:mainfrom
idegtiarenko:limit_shard_realocation_retries
Sep 27, 2022
Merged

Limit shard realocation retries#90296
idegtiarenko merged 23 commits intoelastic:mainfrom
idegtiarenko:limit_shard_realocation_retries

Conversation

@idegtiarenko
Copy link
Copy Markdown
Contributor

This change ensures that elasticsearch would not indefinitely retry relocating shard if operation fails.

Closes: #79445

@idegtiarenko idegtiarenko added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team. v8.6.0 labels Sep 23, 2022
@idegtiarenko idegtiarenko marked this pull request as ready for review September 23, 2022 15:47
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments.

/**
* Holds additional information as to why the shard failed to relocate.
*/
public class RelocationFailureInfo implements ToXContentFragment, Writeable {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we need a whole new object for this or could we just use a plain int? We don't really need to distinguish null from 0 I think.

If we really want an object I'd still rather it was never null, and maybe use a record instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to a record. I wanted to keep it a class/record in case we want to make this behavior more complex in the future (similar to UnassignedInfo with recording failed nodes or introducing a delay)

idegtiarenko and others added 2 commits September 26, 2022 15:44
…ionFailureInfo.java

Co-authored-by: David Turner <david.turner@elastic.co>
@idegtiarenko
Copy link
Copy Markdown
Contributor Author

idegtiarenko commented Sep 26, 2022

org.elasticsearch.cluster.routing.allocation.decider.MockDiskUsagesIT#testOnlyMovesEnoughShardsToDropBelowHighWatermark is failing for this change. I am investigating why.

UPD:

The failure is caused by:

String actualPath = clusterInfo.getDataPath(routing);
if (actualPath == null) {
// we might know the path of this shard from before when it was relocating
actualPath = clusterInfo.getDataPath(routing.cancelRelocation());
}

routing.cancelRelocation() is no longer equal the shard before relocation. Related to #90109

unassignedInfo.toXContent(builder, params);
}
if (relocationFailureInfo != RelocationFailureInfo.NO_FAILURES) {
relocationFailureInfo.toXContent(builder, params);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to show this unconditionally?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's friendlier to clients if we don't change the response shape like this, so let's make it unconditional. NB RelocationFailureInfo#toXContent only emits the field if it's nonzero - it'd be better not to have that conditionality too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any clients that fail on unknown fields that we need to notify about the change?

# Conflicts:
#	server/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@idegtiarenko idegtiarenko merged commit 24cf871 into elastic:main Sep 27, 2022
@idegtiarenko idegtiarenko deleted the limit_shard_realocation_retries branch September 27, 2022 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Meta label for distributed team. v8.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shard relocations are retried indefinitely

3 participants