I encountered a Cloud cluster with an overworked master due (partly) to processing multiple calls to GET /_ml/anomaly_detectors/_all/_stats originating from an external Metricbeat monitoring process. Metricbeat imposes a 10s timeout after which it closes the HTTP connection and tries again. However, GetTrainedModelsStatsAction does not notice if the client connection closes (i.e. the REST handler does not use RestCancellableNodeClient and the resulting transport task is not a CancellableTask) so it carries on wastefully processing the request even after the client timeout.
Relates #55550
I encountered a Cloud cluster with an overworked master due (partly) to processing multiple calls to
GET /_ml/anomaly_detectors/_all/_statsoriginating from an external Metricbeat monitoring process. Metricbeat imposes a 10s timeout after which it closes the HTTP connection and tries again. However,GetTrainedModelsStatsActiondoes not notice if the client connection closes (i.e. the REST handler does not useRestCancellableNodeClientand the resulting transport task is not aCancellableTask) so it carries on wastefully processing the request even after the client timeout.Relates #55550