ILM: Add cluster update timeout on step retry#54878
ILM: Add cluster update timeout on step retry#54878andreidan merged 10 commits intoelastic:masterfrom
Conversation
This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run)
|
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
| @Override | ||
| public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | ||
| if (oldState.equals(newState) == false) { | ||
| IndexMetadata newIndexMeta = newState.metadata().index(index); | ||
| Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta); | ||
| StepKey stepKey = indexMetaCurrentStep.getKey(); | ||
| if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) { | ||
| logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " + | ||
| "retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey); | ||
| maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey); | ||
| @Override | ||
| public ClusterState execute(ClusterState currentState) { | ||
| return IndexLifecycleTransition.moveClusterStateToPreviouslyFailedStep(currentState, index, | ||
| nowSupplier, stepRegistry, true); | ||
| } | ||
|
|
||
| @Override | ||
| public void onFailure(String source, Exception e) { | ||
| logger.error(new ParameterizedMessage("retry execution of step [{}] for index [{}] failed", | ||
| failedStep.getKey().getName(), index), e); | ||
| } | ||
|
|
||
| @Override | ||
| public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | ||
| if (oldState.equals(newState) == false) { | ||
| IndexMetadata newIndexMeta = newState.metadata().index(index); | ||
| Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta); | ||
| StepKey stepKey = indexMetaCurrentStep.getKey(); | ||
| if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) { | ||
| logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " + | ||
| "retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey); | ||
| maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey); | ||
| } |
There was a problem hiding this comment.
This is code formatting
|
@elasticmachine update branch |
DaveCTurner
left a comment
There was a problem hiding this comment.
I like the more descriptive task sources.
Can we also/instead have a mechanism that more explicitly prevents two of these retries being enqueued for the same (policy, index) pair at the same time? This timeout sorta does so as long as the poll interval is greater than 30 seconds, but I think it'd be useful to give a hard guarantee of this.
Relatedly, I think the timeout should be (a function of) something like the ILM poll interval rather than hard-coded at 30 seconds. We do sometimes need to deal with clusters that simply cannot process cluster state updates in a reasonable time by extending timeouts.
|
@elasticmachine update branch |
|
@DaveCTurner thanks for the suggestions and bringing this up. We've discussed this issue and we'll go ahead with an internal setting (non-documented) that controls this timeout as a first step and come back to it after we get a chance to review and figure out better heuristics on what the timeout, or maybe entire approach, should be. We want to make sure we don't add to the "cluster is overwhelmed" problem a "rollover couldn't be executed so there's now also a full disk" problem because the cluster state updates timed out. Using a setting will give users a chance to address this based on the situations as they observe them. |
| // we can afford to drop these requests if they timeout as on the next {@link | ||
| // IndexLifecycleRunner#runPeriodicStep} run the policy will still be in the ERROR step, as we haven't been able | ||
| // to move it back into the failed step, so we'll try again | ||
| return LifecycleSettings.LIFECYCLE_STEP_MASTER_TIMEOUT_SETTING.get(clusterService.state().metadata().settings()); |
There was a problem hiding this comment.
I think we can use the setting we already created to manipulate the ILM related master timeouts. What do you think @dakrone ?
There was a problem hiding this comment.
@dakrone cool, thanks for confirming, this is ready for review then 🙏🏻
dakrone
left a comment
There was a problem hiding this comment.
LGTM, thanks for adding this Andrei!
This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* ILM add cluster update timeout on step retry (#54878) This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)
This also adds more descriptive source messages for the cluster state update
tasks to aid debugging.