Retry downsample ILM action using a new target index#94965
Retry downsample ILM action using a new target index#94965csoulios merged 18 commits intoelastic:mainfrom
downsample ILM action using a new target index#94965Conversation
| .setSnapshotName(state.snapshotName) | ||
| .setShrinkIndexName(state.shrinkIndexName) | ||
| .setSnapshotIndexName(state.snapshotIndexName) | ||
| .setRollupIndexName(state.downsampleIndexName) |
There was a problem hiding this comment.
I took the opportunity and renamed all instances of rollup to downsample.
I know this adds a bit of noise to this PR, but I could not resist and did it while I was in the area
There was a problem hiding this comment.
Thanks for doing this rename Christos (it's been quite confusing before)
downsample ILM action retry using a new target index name
| alias, | ||
| policy | ||
| ); | ||
| updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build()); |
There was a problem hiding this comment.
TODO: Remove this
I only had it for my testing
|
Hi @csoulios, I've created a changelog YAML for you. |
downsample ILM action retry using a new target index namedownsample ILM action using a new target index name
downsample ILM action using a new target index namedownsample ILM action using a new target index
martijnvg
left a comment
There was a problem hiding this comment.
I think the direction in this pr looks good.
| ); | ||
| // Rollup index has already been created with the generated name but its status is not "success". | ||
| // So we delete the index and proceed with executing the rollup step. | ||
| DeleteIndexRequest deleteRequest = new DeleteIndexRequest(downsampleIndexName); |
There was a problem hiding this comment.
This is removed because cleanupDownsampleIndexKey will be invoked in case of failure?
There was a problem hiding this comment.
Yes, the step will fail and ILM will go to the cleanup step. So no need to delete the index here
| private final DateHistogramInterval fixedInterval; | ||
| private final StepKey nextStepOnSuccess; | ||
| private final StepKey nextStepOnFailure; | ||
| private boolean downsampleFailed; |
There was a problem hiding this comment.
Maybe also use SetOnce here like in CreateSnapshotStep?
There was a problem hiding this comment.
I wanted to avoid having to set it at every successful outcome. So, boolean value is initially set to false and on failure I set it to true. This way, ILM will by default proceed to the next step.
There was a problem hiding this comment.
But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).
| alias, | ||
| policy | ||
| ); | ||
| updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build()); |
| private final DateHistogramInterval fixedInterval; | ||
| private final StepKey nextStepOnSuccess; | ||
| private final StepKey nextStepOnFailure; | ||
| private boolean downsampleFailed; |
There was a problem hiding this comment.
But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).
| import org.apache.logging.log4j.LogManager; | ||
| import org.apache.logging.log4j.Logger; | ||
| import org.elasticsearch.action.ActionListener; | ||
| import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest; |
There was a problem hiding this comment.
It think we need a testNextStepKey() similar to the one in CreateSnapshotStepTests too.
|
@elasticsearchmachine generate changelog |
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
|
Pinging @elastic/es-data-management (Team:Data Management) |
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes elastic#93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes elastic#93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes #93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes #93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent
downsampleAPI invocation to index rolled up data into shards of the target index that already exists and while the previousdownsampleAPI invocation is still partially running (and also rolling up data into the same target shard).Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node).
This PR modfies the ILM
DownsampleActionso that whenDownsampleStepfails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index, 3. Downsample using the new target index name.Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient.
Note 2: A similar approach is used by the
searchable_snapshotILM actionCloses #93580