Retry `downsample` ILM action using a new target index by csoulios · Pull Request #94965 · elastic/elasticsearch

csoulios · 2023-03-31T18:56:29Z

Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index, 3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes #93580

csoulios · 2023-03-31T18:59:11Z

server/src/main/java/org/elasticsearch/cluster/metadata/LifecycleExecutionState.java

            .setSnapshotName(state.snapshotName)
            .setShrinkIndexName(state.shrinkIndexName)
            .setSnapshotIndexName(state.snapshotIndexName)
-            .setRollupIndexName(state.downsampleIndexName)


I took the opportunity and renamed all instances of rollup to downsample.

I know this adds a bit of noise to this PR, but I could not resist and did it while I was in the area

Thanks for doing this rename Christos (it's been quite confusing before)

csoulios · 2023-03-31T19:10:43Z

...multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/DownsampleActionIT.java

            alias,
            policy
        );
+        updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build());


TODO: Remove this

I only had it for my testing

But it can be removed now?

Removed in b669ad9

elasticsearchmachine · 2023-03-31T20:12:37Z

Hi @csoulios, I've created a changelog YAML for you.

martijnvg

I think the direction in this pr looks good.

martijnvg · 2023-04-05T06:07:46Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DownsampleStep.java

                );
-                // Rollup index has already been created with the generated name but its status is not "success".
-                // So we delete the index and proceed with executing the rollup step.
-                DeleteIndexRequest deleteRequest = new DeleteIndexRequest(downsampleIndexName);


This is removed because cleanupDownsampleIndexKey will be invoked in case of failure?

Yes, the step will fail and ILM will go to the cleanup step. So no need to delete the index here

martijnvg · 2023-04-05T06:08:25Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DownsampleStep.java

    private final DateHistogramInterval fixedInterval;
+    private final StepKey nextStepOnSuccess;
+    private final StepKey nextStepOnFailure;
+    private boolean downsampleFailed;


Maybe also use SetOnce here like in CreateSnapshotStep?

I wanted to avoid having to set it at every successful outcome. So, boolean value is initially set to false and on failure I set it to true. This way, ILM will by default proceed to the next step.

But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).

Done in c097cad

martijnvg

Thanks for working on this @csoulios!
I left a few comments, otherwise looks good.

martijnvg · 2023-05-01T13:22:31Z

...multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/DownsampleActionIT.java

            alias,
            policy
        );
+        updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build());


But it can be removed now?

martijnvg · 2023-05-01T13:27:02Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DownsampleStep.java

    private final DateHistogramInterval fixedInterval;
+    private final StepKey nextStepOnSuccess;
+    private final StepKey nextStepOnFailure;
+    private boolean downsampleFailed;


But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).

martijnvg · 2023-05-01T13:47:57Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DownsampleStep.java

 import org.apache.logging.log4j.LogManager;
 import org.apache.logging.log4j.Logger;
 import org.elasticsearch.action.ActionListener;
-import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;


It think we need a testNextStepKey() similar to the one in CreateSnapshotStepTests too.

Done in 65b671f

csoulios · 2023-05-12T13:10:41Z

@elasticsearchmachine generate changelog

elasticsearchmachine · 2023-05-12T13:12:32Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2023-05-12T13:12:33Z

Pinging @elastic/es-data-management (Team:Data Management)

martijnvg

LGTM 👍

elasticsearchmachine · 2023-05-15T08:12:29Z

💚 Backport successful

Status	Branch	Result
✅	8.7
✅	8.8

Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes elastic#93580

Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes #93580

csoulios added 4 commits March 30, 2023 13:01

Rewind Downsample ILM on failure

43c0270

Merge branch 'main' into fix-ds-ilm2

88f5691

Change rollup to downsample

c0d3a79

Rewind ILM steps on downsample failure

528e678

csoulios added :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v8.7.1 v8.8.0 labels Mar 31, 2023

csoulios commented Mar 31, 2023

View reviewed changes

csoulios changed the title ~~Fix ds ilm2~~ Make downsample ILM action retry using a new target index name Mar 31, 2023

csoulios commented Mar 31, 2023

View reviewed changes

Merge branch 'main' into fix-ds-ilm2

971dce1

csoulios added the >bug label Mar 31, 2023

csoulios added 2 commits March 31, 2023 23:12

Update docs/changelog/94965.yaml

e7b6cc2

Added changelog

5b2bd2b

csoulios changed the title ~~Make downsample ILM action retry using a new target index name~~ Retry downsample ILM action using a new target index name Mar 31, 2023

csoulios changed the title ~~Retry downsample ILM action using a new target index name~~ Retry downsample ILM action using a new target index Mar 31, 2023

csoulios added 4 commits April 3, 2023 10:52

Merge branch 'main' into fix-ds-ilm2

b0014a7

Merge branch 'main' into fix-ds-ilm2

4e099de

Fix changelog

5e334d8

Merge branch 'main' into fix-ds-ilm2

318b04f

csoulios added the :StorageEngine/TSDB You know, for Metrics label Apr 4, 2023

martijnvg reviewed Apr 5, 2023

View reviewed changes

pquentin added v8.7.2 and removed v8.7.1 labels Apr 12, 2023

csoulios added 2 commits April 12, 2023 15:19

Merge branch 'main' into fix-ds-ilm2

11256e0

Merge branch 'main' into fix-ds-ilm2

82b31a1

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

martijnvg reviewed May 1, 2023

View reviewed changes

csoulios added 2 commits May 12, 2023 13:15

Merge branch 'main' into fix-ds-ilm2

0c31bee

Make variable volatile

c097cad

csoulios added v8.8.1 auto-backport-and-merge >bug and removed >bug labels May 12, 2023

Added test

65b671f

csoulios marked this pull request as ready for review May 12, 2023 13:12

elasticsearchmachine added Team:Data Management (obsolete) DO NOT USE. This team no longer exists. Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels May 12, 2023

csoulios requested a review from martijnvg May 12, 2023 13:12

csoulios added 2 commits May 12, 2023 16:16

Remove updateClusterSettings()

b669ad9

remove throws exception

c74b928

martijnvg approved these changes May 15, 2023

View reviewed changes

csoulios merged commit e9cfd81 into elastic:main May 15, 2023

csoulios deleted the fix-ds-ilm2 branch May 15, 2023 08:10

This was referenced May 15, 2023

[8.7] Retry downsample ILM action using a new target index (#94965) #96093

Merged

[8.8] Retry downsample ILM action using a new target index (#94965) #96094

Merged

gmarouli added v8.8.0 and removed v8.8.1 labels May 17, 2023

Conversation

csoulios commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csoulios Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 31, 2023

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios commented May 12, 2023

Uh oh!

elasticsearchmachine commented May 12, 2023

Uh oh!

elasticsearchmachine commented May 12, 2023

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented May 15, 2023

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

csoulios commented Mar 31, 2023 •

edited

Loading

csoulios Mar 31, 2023 •

edited

Loading