Retry ILM steps that fail due to SnapshotInProgressException#37624
Retry ILM steps that fail due to SnapshotInProgressException#37624dakrone merged 22 commits intoelastic:masterfrom
Conversation
…y-after-snapshot-fail
Also adds javadocs
|
Pinging @elastic/es-core-features |
talevy
left a comment
There was a problem hiding this comment.
LGTM
I discussed this outside of Github with @dakrone, but we agreed
that unit tests for AsyncRetryDuringSnapshotActionStep's
SnapshotExceptionListener and NoSnapshotRunningListener
would only cover some non-integral branches of the code logic in
retrying actions via the cluster-state-observable. Since there
is confidence that the existing integration tests in this PR cover
the successful retry, that represents the critical path and sufficient
for verifying these changes do what they intend.
| } | ||
|
|
||
| @Override | ||
| protected CloseFollowerIndexStep createRandomInstance() { |
| performAction(idxMeta, state, observer, originalListener); | ||
| }, originalListener::onFailure), | ||
| // TODO: what is a good timeout value for no new state received during this time? | ||
| TimeValue.timeValueHours(12)); |
There was a problem hiding this comment.
I think waiting 12 hours for a snapshot to finish is reasonable. If there is no progress on this action in that time interval, a user may want to know. so 👍
|
backport to 6.x is blocked on #37723 (SnapshotInProgressException) |
|
update: above blocker PR for 6.x was merged |
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed. This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring. This also adds integration tests for these scenarios (thanks to @talevy in #37552). Resolves #37541
* master: Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674)
* elastic/master: (85 commits) Use explicit version for build-tools in example plugin integ tests (elastic#37792) Change `rational` to `saturation` in script_score (elastic#37766) Deprecate types in get field mapping API (elastic#37667) Add ability to listen to group of affix settings (elastic#37679) Ensure changes requests return the latest mapping version (elastic#37633) Make Minio Setup more Reliable (elastic#37747) Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674) [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade Streamline skip_unavailable handling (elastic#37672) Only bootstrap and elect node in current voting configuration (elastic#37712) Ensure either success or failure path for SearchOperationListener is called (elastic#37467) Target only specific index in update settings test ...
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.
This change adds an abstract step (
AsyncRetryDuringSnapshotActionStep) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When aSnapshotInProgressExceptionis received by the listener wrapper, aClusterStateObserverlistener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.This also adds integration tests for these scenarios (thanks to @talevy in #37552).
Resolves #37541