ILM: Add cluster update timeout on step retry by andreidan · Pull Request #54878 · elastic/elasticsearch

andreidan · 2020-04-07T12:02:00Z

This adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

This also adds more descriptive source messages for the cluster state update
tasks to aid debugging.

This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run)

elasticmachine · 2020-04-07T12:02:02Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

andreidan · 2020-04-07T12:56:09Z

@elasticmachine update branch

andreidan · 2020-04-07T13:09:38Z

@elasticmachine update branch

andreidan · 2020-04-07T13:25:00Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

-                @Override
-                public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
-                    if (oldState.equals(newState) == false) {
-                        IndexMetadata newIndexMeta = newState.metadata().index(index);
-                        Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta);
-                        StepKey stepKey = indexMetaCurrentStep.getKey();
-                        if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) {
-                            logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " +
-                                "retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey);
-                            maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey);
+                    @Override
+                    public ClusterState execute(ClusterState currentState) {
+                        return IndexLifecycleTransition.moveClusterStateToPreviouslyFailedStep(currentState, index,
+                            nowSupplier, stepRegistry, true);
+                    }
+
+                    @Override
+                    public void onFailure(String source, Exception e) {
+                        logger.error(new ParameterizedMessage("retry execution of step [{}] for index [{}] failed",
+                            failedStep.getKey().getName(), index), e);
+                    }
+
+                    @Override
+                    public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
+                        if (oldState.equals(newState) == false) {
+                            IndexMetadata newIndexMeta = newState.metadata().index(index);
+                            Step indexMetaCurrentStep = getCurrentStep(stepRegistry, policy, newIndexMeta);
+                            StepKey stepKey = indexMetaCurrentStep.getKey();
+                            if (stepKey != null && stepKey != TerminalPolicyStep.KEY && newIndexMeta != null) {
+                                logger.trace("policy [{}] for index [{}] was moved back on the failed step for as part of an automatic " +
+                                    "retry. Attempting to execute the failed step [{}] if it's an async action", policy, index, stepKey);
+                                maybeRunAsyncAction(newState, newIndexMeta, policy, stepKey);
+                            }


This is code formatting

andreidan · 2020-04-07T14:07:53Z

@elasticmachine update branch

DaveCTurner

I like the more descriptive task sources.

Can we also/instead have a mechanism that more explicitly prevents two of these retries being enqueued for the same (policy, index) pair at the same time? This timeout sorta does so as long as the poll interval is greater than 30 seconds, but I think it'd be useful to give a hard guarantee of this.

Relatedly, I think the timeout should be (a function of) something like the ILM poll interval rather than hard-coded at 30 seconds. We do sometimes need to deal with clusters that simply cannot process cluster state updates in a reasonable time by extending timeouts.

andreidan · 2020-04-07T16:29:45Z

@elasticmachine update branch

andreidan · 2020-04-08T11:08:10Z

@DaveCTurner thanks for the suggestions and bringing this up. We've discussed this issue and we'll go ahead with an internal setting (non-documented) that controls this timeout as a first step and come back to it after we get a chance to review and figure out better heuristics on what the timeout, or maybe entire approach, should be. We want to make sure we don't add to the "cluster is overwhelmed" problem a "rollover couldn't be executed so there's now also a full disk" problem because the cluster state updates timed out. Using a setting will give users a chance to address this based on the situations as they observe them.

andreidan · 2020-04-08T11:08:56Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

+                        // we can afford to drop these requests if they timeout as on the next {@link
+                        // IndexLifecycleRunner#runPeriodicStep} run the policy will still be in the ERROR step, as we haven't been able
+                        // to move it back into the failed step, so we'll try again
+                        return LifecycleSettings.LIFECYCLE_STEP_MASTER_TIMEOUT_SETTING.get(clusterService.state().metadata().settings());


I think we can use the setting we already created to manipulate the ILM related master timeouts. What do you think @dakrone ?

That sounds reasonable to me 👍

@dakrone cool, thanks for confirming, this is ready for review then 🙏🏻

dakrone

LGTM, thanks for adding this Andrei!

This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

* ILM add cluster update timeout on step retry (#54878) This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run) ILM more descriptive source messages for cluster updates Use the configured ILM step master timeout setting (cherry picked from commit ff6c5ed) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

andreidan added 2 commits April 7, 2020 12:52

ILM add cluster update timeout on step retry

982659b

This commits adds a timeout when moving ILM back on to a failed step. In case the master is struggling with processing the cluster update requests these ones will expire (as we'll send them again anyway on the next ILM loop run)

ILM more descriptive source messages for cluster updates

90275b6

andreidan added :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.6.3 v7.8.0 v7.7.1 labels Apr 7, 2020

Use all StepKey information

7568aeb

Merge branch 'master' into ilm-cluster-updates

a64edeb

Merge branch 'master' into ilm-cluster-updates

19502aa

andreidan commented Apr 7, 2020

View reviewed changes

andreidan added 2 commits April 7, 2020 14:46

Fix test

c52d84b

ILM cluster state source for ClusterState* steps

c0bed38

andreidan requested review from DaveCTurner and dakrone April 7, 2020 13:48

DaveCTurner reviewed Apr 7, 2020

View reviewed changes

Merge branch 'master' into ilm-cluster-updates

2961ac0

elasticmachine and others added 2 commits April 7, 2020 12:29

Merge branch 'master' into ilm-cluster-updates

29aaa1c

Use the configured ILM step master timeout setting

376485e

andreidan commented Apr 8, 2020

View reviewed changes

dakrone approved these changes Apr 8, 2020

View reviewed changes

andreidan merged commit ff6c5ed into elastic:master Apr 8, 2020

andreidan added the backport pending label Apr 8, 2020

This was referenced Apr 9, 2020

[7.6] ILM add cluster update timeout on step retry (#54878) #55020

Closed

[7.7] ILM add cluster update timeout on step retry (#54878) #55021

Merged

[7.x] ILM add cluster update timeout on step retry (#54878) #55022

Merged

andreidan removed the v7.6.3 label Apr 9, 2020

andreidan removed the backport pending label Apr 11, 2020

pugnascotia added the >enhancement label May 14, 2020

williamrandolph changed the title ~~ILM add cluster update timeout on step retry~~ Add cluster update timeout on ILM step retry Jun 1, 2020

williamrandolph changed the title ~~Add cluster update timeout on ILM step retry~~ ILM: Add cluster update timeout on step retry Jun 1, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM: Add cluster update timeout on step retry#54878

ILM: Add cluster update timeout on step retry#54878
andreidan merged 10 commits intoelastic:masterfrom
andreidan:ilm-cluster-updates

andreidan commented Apr 7, 2020

Uh oh!

elasticmachine commented Apr 7, 2020

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan Apr 7, 2020

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

DaveCTurner left a comment

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan commented Apr 8, 2020

Uh oh!

andreidan Apr 8, 2020

Uh oh!

dakrone Apr 8, 2020

Uh oh!

andreidan Apr 8, 2020

Uh oh!

dakrone left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

andreidan commented Apr 7, 2020

Uh oh!

elasticmachine commented Apr 7, 2020

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

andreidan commented Apr 7, 2020

Uh oh!

andreidan commented Apr 8, 2020

Uh oh!

andreidan Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

dakrone Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

andreidan Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants