-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Batch up failure-related ILM master tasks #81880
Copy link
Copy link
Closed
Labels
:Data Management/ILM+SLMDO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.>bugTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.DO NOT USE. This team no longer exists.
Metadata
Metadata
Assignees
Labels
:Data Management/ILM+SLMDO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.>bugTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.DO NOT USE. This team no longer exists.
Type
Fields
Give feedbackNo fields configured for issues without a type.
In #78547 we introduced batching for the ILM master tasks that occur on the happy path. However if a high-shard-count cluster encounters problems while doing ILM-related things—perhaps some nodes are temporarily unavailable for taking a snapshot—then we process the resulting
ilm-retry-failed-stepandilm-move-to-error-steptasks one-by-one which can significantly delay the cluster's recovery from its problems.We should batch these things together too.
It looks like we also enqueue duplicate
ilm-retry-failed-stepon each poll interval too, although we do appear to treat the duplicates as no-ops at execution time.Relates #77466