WaitForSnapshotStep verifies if the index belongs to the latest snapshot of that SLM policy#100911
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @gmarouli, I've created a changelog YAML for you. |
joegallo
left a comment
There was a problem hiding this comment.
I think the code mostly looks very good. I've added a couple of admittedly trivial comments.
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStep.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStep.java
Outdated
Show resolved
Hide resolved
|
Thank you for review @joegallo, you keep me sharp 🤓 . |
|
Hi @gmarouli, I've updated the changelog YAML for you. |
|
Hi @gmarouli, I've updated the changelog YAML for you. |
|
About the backport labels, I think we should try to backport it, I do realise that backporting to |
|
Regarding backporting, I think it should be pretty straightforward (hopefully!), a lot of the ILM code has been pretty stable for a while. A tricky bit is that there's no generally available edit: or... looking at how it's used in |
|
@elasticmachine update branch |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809 (cherry picked from commit 5697fcf) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java
…pshot of that SLM policy (#100911) (#101027) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809
…est snapshot of that SLM policy (#100911) (#101030) * `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809 (cherry picked from commit 5697fcf)
The
WaitForSnapshotStepused to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index.The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the
WaitForSnapshotStep, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index.See the exact reproduction path: #57809
Solution
This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds.
ILM explain will report:
Backwards compatibility concerns
In this PR, the
WaitForSnapshotStepchanged fromClusterStateWaitSteptoAsyncWaitStep. We do not think this is gonna cause an issue. This was tested manually by the following steps:wait-for-snapshot, we shutdown the nodeWe believe that this covers bwc concerns.
Fixes: #57809