Skip to content

ILM: Add cluster update timeout on step retry#54878

Merged
andreidan merged 10 commits intoelastic:masterfrom
andreidan:ilm-cluster-updates
Apr 8, 2020
Merged

ILM: Add cluster update timeout on step retry#54878
andreidan merged 10 commits intoelastic:masterfrom
andreidan:ilm-cluster-updates

Conversation

@andreidan
Copy link
Copy Markdown
Contributor

This adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

This also adds more descriptive source messages for the cluster state update
tasks to aid debugging.

This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)
@andreidan andreidan added :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.6.3 v7.8.0 v7.7.1 labels Apr 7, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@andreidan
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

Comment on lines -220 to +244
@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
if (oldState.equals(newState) == false) {
IndexMetadata newIndexMeta = newState.metadata().index(index);
Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta);
StepKey stepKey = indexMetaCurrentStep.getKey();
if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) {
logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " +
"retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey);
maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey);
@Override
public ClusterState execute(ClusterState currentState) {
return IndexLifecycleTransition.moveClusterStateToPreviouslyFailedStep(currentState, index,
nowSupplier, stepRegistry, true);
}

@Override
public void onFailure(String source, Exception e) {
logger.error(new ParameterizedMessage("retry execution of step [{}] for index [{}] failed",
failedStep.getKey().getName(), index), e);
}

@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
if (oldState.equals(newState) == false) {
IndexMetadata newIndexMeta = newState.metadata().index(index);
Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta);
StepKey stepKey = indexMetaCurrentStep.getKey();
if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) {
logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " +
"retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey);
maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code formatting

@andreidan andreidan requested review from DaveCTurner and dakrone April 7, 2020 13:48
@andreidan
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the more descriptive task sources.

Can we also/instead have a mechanism that more explicitly prevents two of these retries being enqueued for the same (policy, index) pair at the same time? This timeout sorta does so as long as the poll interval is greater than 30 seconds, but I think it'd be useful to give a hard guarantee of this.

Relatedly, I think the timeout should be (a function of) something like the ILM poll interval rather than hard-coded at 30 seconds. We do sometimes need to deal with clusters that simply cannot process cluster state updates in a reasonable time by extending timeouts.

@andreidan
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Copy Markdown
Contributor Author

@DaveCTurner thanks for the suggestions and bringing this up. We've discussed this issue and we'll go ahead with an internal setting (non-documented) that controls this timeout as a first step and come back to it after we get a chance to review and figure out better heuristics on what the timeout, or maybe entire approach, should be. We want to make sure we don't add to the "cluster is overwhelmed" problem a "rollover couldn't be executed so there's now also a full disk" problem because the cluster state updates timed out. Using a setting will give users a chance to address this based on the situations as they observe them.

// we can afford to drop these requests if they timeout as on the next {@link
// IndexLifecycleRunner#runPeriodicStep} run the policy will still be in the ERROR step, as we haven't been able
// to move it back into the failed step, so we'll try again
return LifecycleSettings.LIFECYCLE_STEP_MASTER_TIMEOUT_SETTING.get(clusterService.state().metadata().settings());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the setting we already created to manipulate the ILM related master timeouts. What do you think @dakrone ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakrone cool, thanks for confirming, this is ready for review then 🙏🏻

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this Andrei!

@andreidan andreidan merged commit ff6c5ed into elastic:master Apr 8, 2020
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
@andreidan andreidan removed the v7.6.3 label Apr 9, 2020
andreidan added a commit that referenced this pull request Apr 11, 2020
* ILM add cluster update timeout on step retry (#54878)

This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit that referenced this pull request Apr 11, 2020
This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
@williamrandolph williamrandolph changed the title ILM add cluster update timeout on step retry Add cluster update timeout on ILM step retry Jun 1, 2020
@williamrandolph williamrandolph changed the title Add cluster update timeout on ILM step retry ILM: Add cluster update timeout on step retry Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. >enhancement v7.7.1 v7.8.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants