ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes

ILM's `org.elasticsearch.xpack.ilm.IndexLifecycleService#triggerPolicies` can queue up an unlimited number of cluster state updates on slow master nodes. This method is invoked on every cluster state application.
It submits tasks for every index that it decides work needs to be done on with priority `NORMAL`. So the following can happen easily under load:

* master works through a number of higher than `NORMAL` priority tasks
* each of them triggers an ILM task at priority normal for each index that has outstanding work (without checking for duplicates)
=> as master works through the higher priority tasks it uses up more and more memory for queued ILM tasks as long as there's outstanding higher priority work
=> even if and when master gets to working through the `NORMAL` priority tasks, each of them will yet again trigger all policies adding more duplicate work, eventually leading to runaway task counts if things slow down enough

ILM needs to make sure to limit and deduplicate tasks to avoid running into this. I will see if I can find a quick fix to this situation to unblock benchmarking, but it seems a complete solution is quite involved.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ILM Can Create an Unlimited Number of Pending Clusterstate Updates on Slow Master Nodes #78246

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions