`WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy by gmarouli · Pull Request #100911 · elastic/elasticsearch

gmarouli · 2023-10-16T13:44:01Z

The WaitForSnapshotStep used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the WaitForSnapshotStep, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index.

See the exact reproduction path: #57809

Solution
This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds.

ILM explain will report:

"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }

Backwards compatibility concerns
In this PR, the WaitForSnapshotStep changed from ClusterStateWaitStep to AsyncWaitStep. We do not think this is gonna cause an issue. This was tested manually by the following steps:

Run a master node with the old version.
When ILM is executing wait-for-snapshot, we shutdown the node
We start the node again with the new version os ES
ES was able to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809

elasticsearchmachine · 2023-10-16T13:44:26Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2023-10-16T13:44:49Z

Hi @gmarouli, I've created a changelog YAML for you.

joegallo

I think the code mostly looks very good. I've added a couple of admittedly trivial comments.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStep.java

gmarouli · 2023-10-16T18:49:15Z

Thank you for review @joegallo, you keep me sharp 🤓 .

elasticsearchmachine · 2023-10-16T18:52:43Z

Hi @gmarouli, I've updated the changelog YAML for you.

elasticsearchmachine · 2023-10-17T06:05:19Z

Hi @gmarouli, I've updated the changelog YAML for you.

gmarouli · 2023-10-17T06:05:55Z

About the backport labels, I think we should try to backport it, I do realise that backporting to 7.17.x might be tricky, that's why I would like to timebox it at least give it a try see if it's worth it.

joegallo · 2023-10-17T14:29:10Z

Regarding backporting, I think it should be pretty straightforward (hopefully!), a lot of the ILM code has been pretty stable for a while. A tricky bit is that there's no generally available EmptyInfo on 7.17, I think that was introduced pretty recently via #100179. Maybe it's worth pre-gaming a >non-issue PR that only introduces the EmptyInfo like #100179 did?

edit: or... looking at how it's used in AsyncWaitStep, you should probably just use null there instead for the backport...

joegallo

🚀

gmarouli · 2023-10-18T06:05:05Z

@elasticmachine update branch

elasticsearchmachine · 2023-10-18T07:01:33Z

💔 Backport failed

Status	Branch	Result
✅	8.11
❌	7.17	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 100911

…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809

gmarouli · 2023-10-18T07:35:41Z

💚 All backports created successfully

Status	Branch	Result
✅	7.17

Questions ?

Please refer to the Backport tool documentation

…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809 (cherry picked from commit 5697fcf) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java

…pshot of that SLM policy (#100911) (#101027) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809

…est snapshot of that SLM policy (#100911) (#101030) * `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809 (cherry picked from commit 5697fcf)

Check if the index is part of the snapshot before deleting

ec55c3e

gmarouli added >enhancement :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. labels Oct 16, 2023

elasticsearchmachine added v8.12.0 Team:Data Management (obsolete) DO NOT USE. This team no longer exists. labels Oct 16, 2023

gmarouli mentioned this pull request Oct 16, 2023

Check if the index belongs to the SLM policy before deleting #100816

Closed

Update docs/changelog/100911.yaml

4f8642a

joegallo reviewed Oct 16, 2023

View reviewed changes

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStep.java Outdated Show resolved Hide resolved

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStep.java Outdated Show resolved Hide resolved

gmarouli added 2 commits October 16, 2023 21:30

Merge branch 'main' into fix-ilm-delete-snapshot-wait-v2

ca3482b

Review improvements

2f534c2

gmarouli requested a review from joegallo October 16, 2023 18:49

Update 100911.yaml

36780a7

gmarouli added >bug and removed >enhancement labels Oct 16, 2023

Update docs/changelog/100911.yaml

fc629b4

Refer the issue this is fixing

22ed88e

gmarouli added v7.17.15 auto-backport Automatically create backport pull requests when merged v8.11.1 labels Oct 17, 2023

Update docs/changelog/100911.yaml

a562900

joegallo approved these changes Oct 17, 2023

View reviewed changes

Merge branch 'main' into fix-ilm-delete-snapshot-wait-v2

24b01ad

gmarouli added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 18, 2023

elasticsearchmachine merged commit 5697fcf into elastic:main Oct 18, 2023

gmarouli deleted the fix-ilm-delete-snapshot-wait-v2 branch October 18, 2023 07:00

gmarouli mentioned this pull request Oct 18, 2023

[8.11] WaitForSnapshotStep verifies if the index belongs to the latest snapshot of that SLM policy (#100911) #101027

Merged

elasticsearchmachine added the backport pending label Oct 18, 2023

gmarouli mentioned this pull request Oct 18, 2023

[7.17] WaitForSnapshotStep verifies if the index belongs to the latest snapshot of that SLM policy (#100911) #101030

Merged

bpintea removed the backport pending label Nov 13, 2023

Conversation

gmarouli commented Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 16, 2023

Uh oh!

elasticsearchmachine commented Oct 16, 2023

Uh oh!

joegallo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gmarouli commented Oct 16, 2023

Uh oh!

elasticsearchmachine commented Oct 16, 2023

Uh oh!

elasticsearchmachine commented Oct 17, 2023

Uh oh!

gmarouli commented Oct 17, 2023

Uh oh!

joegallo commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joegallo left a comment

Choose a reason for hiding this comment

Uh oh!

gmarouli commented Oct 18, 2023

Uh oh!

elasticsearchmachine commented Oct 18, 2023

💔 Backport failed

Uh oh!

gmarouli commented Oct 18, 2023

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gmarouli commented Oct 16, 2023 •

edited

Loading

joegallo commented Oct 17, 2023 •

edited

Loading