Today stats requests are processed by the management threadpool, which is also used for important internal management tasks such as syncing global checkpoints and retention leases, ILM and SLM. The management threadpool has an unbounded queue. Some stats take a nontrivial amount of effort to compute and it is certainly possible to request stats more frequently than the cluster can respond. We cannot control the behaviour of clients requesting stats, and I've seen more than a few situations where an errant monitoring system harms the cluster with its request rate (see links in #51915). Since we doggedly enqueue every request it can take a very long time to recover from this situation, and while working through the queue the well-behaved internal management tasks do not run in a timely fashion. The quickest recovery path may be to restart any affected nodes.
I think we should be pushing back against this kind of behaviour to protect the cluster from abusive monitoring clients. We could, for instance, use different threadpools for the internal (and well-behaved) actions from external (and possibly-abusive) ones, and use a bounded queue for the threadpool handling the external actions. Some users of the management threadpool are not clearly one or the other and we'll need to use some judgement to decide whether we need to protect them from abuse - e.g. license management, security cache management. I've yet to see such actions involved in struggling clusters, however, so perhaps either way would be ok.
Relates #51915
Today stats requests are processed by the
managementthreadpool, which is also used for important internal management tasks such as syncing global checkpoints and retention leases, ILM and SLM. Themanagementthreadpool has an unbounded queue. Some stats take a nontrivial amount of effort to compute and it is certainly possible to request stats more frequently than the cluster can respond. We cannot control the behaviour of clients requesting stats, and I've seen more than a few situations where an errant monitoring system harms the cluster with its request rate (see links in #51915). Since we doggedly enqueue every request it can take a very long time to recover from this situation, and while working through the queue the well-behaved internal management tasks do not run in a timely fashion. The quickest recovery path may be to restart any affected nodes.I think we should be pushing back against this kind of behaviour to protect the cluster from abusive monitoring clients. We could, for instance, use different threadpools for the internal (and well-behaved) actions from external (and possibly-abusive) ones, and use a bounded queue for the threadpool handling the external actions. Some users of the management threadpool are not clearly one or the other and we'll need to use some judgement to decide whether we need to protect them from abuse - e.g. license management, security cache management. I've yet to see such actions involved in struggling clusters, however, so perhaps either way would be ok.
Relates #51915