Skip to content

[ILM] Searchable snapshot action tries to snapshot partially mounted index in cold #105647

@andreidan

Description

@andreidan

Elasticsearch Version

7.x, 8.x

Installed Plugins

No response

Java Version

bundled

OS Version

Darwin

Problem Description

When ILM mounts and index it first takes a snapshot of the index including its ILM state (e.g. the snapshot for fully mounting and index in the cold phase will contain the ILM execution state of the index being in the cold/searchable_snapshot/create-snapshot step with the next step being cold/searchable_snapshot/cleanup-snapshot)
When this index transitions to frozen to be partially mounted ILM will reuse the snapshot it took in the cold phase and just re-mount the index as partially mounted.

Now, say you'd like to manually mount the index in a deployment reusing the snapshot ILM took in cold. A call like :

POST /_snapshot/found-snapshots/2024.02.14-test-000001-test-policy-g52geraxtnuhna780zgnha/_mount?wait_for_completion=true&storage=shared_cache
{
  "index": "test-000001", 
  "renamed_index": "partial-restored-test-000001"
}

will create the partially mounted index partial-restored-test-000001 but with the ILM policy that was configured for index test-000001 and the ILM state the index had when the snapshot was created (i.e. the index was in cold/searchable_snapshot/create-snapshot )
So ILM will see partial-restored-test-000001 in cold/searchable_snapshot/create-snapshot and power through to try and create a snapshot for this index, fail as 2024.02.14-test-000001-test-policy-g52geraxtnuhna780zgnha exists already, so it will delete this snapshot and attempt to create a new snapshot of the index. The tricky bit here is that the index is now a partially mounted index, as opposed to a regular index (e.g. test-000001) as the cold phase usually handles.

This results in ILM creating a snapshot of a frozen index that's mistakenly in the cold phase.

This is a tricky one for ILM to handle as one might say "well, ILM should just skip cold if it sees the index is partially mounted" however, what if there is no frozen tier in the cluster? In this case, perhaps one would expect ILM to fully mount the regular index from the snapshot it detects ? (do what the cold phase is meant to do i.e. fully mount the index)

We should discuss the options we have here (ILM not doing anything in this case but signaling an error is also an option) but it'd be great to avoid this data-loss scenario where ILM snapshots a partially mounted index, losing the snapshot containing the regular index.

In the meantime, if an index is partially mounted, ignore the index.lifecycle.name index setting when restoring (so the index is not picked up by ILM)
e.g.

POST /_snapshot/found-snapshots/2024.02.14-test-000001-test-policy-g52geraxtnuhna780zgnha/_mount?wait_for_completion=true&storage=shared_cache
{
  "index": "test-000001", 
  "renamed_index": "partial-restored-test-000001",
  "ignore_index_settings": [ "index.lifecycle.name" ]
}

and before attaching a new ILM policy to the index, remove its ILM execution state using the ILM remove API:

POST partial-restored-test-000001/_ilm/remove

Steps to Reproduce

Create an ILM policy with cold and frozen phases.
Wait for ILM to mount the index in frozen.
Take a note of the snapshot name for the frozen index listed in the ILM explain API :

GET partial-restored-test-000001/_ilm/explain?human

Delete the index so we can re-mount it manually.

DELETE partial-restored-test-000001

Manually mount the index:

POST /_snapshot/found-snapshots/2024.02.14-test-000001-test-policy-g52geraxtnuhna780zgnha/_mount?wait_for_completion=true&storage=shared_cache
{
  "index": "test-000001", 
  "renamed_index": "partial-restored-test-000001"
}

Restart the master node so ILM re-executes the current async step (cold/searchable_snapshot/create-snapshot)

Note how the freshly partially mounted index partial-restored-test-000001 is now int he cold phase, ILM deletes its backing snapshot and attempts to recreate it.Except it will now take a snapshot for he partially mounted index.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions