Batch up failure-related ILM master tasks

In https://github.com/elastic/elasticsearch/pull/78547 we introduced batching for the ILM master tasks that occur on the happy path. However if a high-shard-count cluster encounters problems while doing ILM-related things—perhaps some nodes are temporarily unavailable for taking a snapshot—then we process the resulting `ilm-retry-failed-step` and `ilm-move-to-error-step` tasks one-by-one which can significantly delay the cluster's recovery from its problems.

We should batch these things together too.

It looks like we also enqueue duplicate `ilm-retry-failed-step` on each poll interval too, although we do appear to treat the duplicates as no-ops at execution time.

Relates https://github.com/elastic/elasticsearch/issues/77466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch up failure-related ILM master tasks #81880

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Batch up failure-related ILM master tasks #81880

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions