Introduce step to check rollover happened properly#35391
Introduce step to check rollover happened properly#35391AthenaEryma wants to merge 4 commits intoelastic:masterfrom
Conversation
This prevents a race condition where two RolloverSteps are executed simultaneously, and one creates the new index first, and the second moves on to the next step before the first can attach RolloverInfo to the original index.
|
Pinging @elastic/es-core-infra |
dakrone
left a comment
There was a problem hiding this comment.
Thanks for doing this Gordon, I think that this should be a ClusterStateWaitStep though, since it never invokes the client or does anything asynchronously, what do you think?
...plugin/core/src/main/java/org/elasticsearch/xpack/core/indexlifecycle/CheckRolloverStep.java
Show resolved
Hide resolved
talevy
left a comment
There was a problem hiding this comment.
thanks for this Gordon!
I just noticed mid-review that Lee already reviewed, but I'll leave mine be. I notice some overlap, sorry.
|
|
||
| import java.util.function.LongSupplier; | ||
|
|
||
| public class CheckRolloverStep extends AsyncWaitStep { |
There was a problem hiding this comment.
I understand we are not consistent here with regards to other Steps, but I think it would be great if we had some javadocs here explaining the purpose of this step.
| private static final Logger logger = LogManager.getLogger(CheckRolloverStep.class); | ||
| static final long TIMEOUT_MILLIS = TimeValue.timeValueMinutes(10).millis(); | ||
|
|
||
|
|
| LifecycleExecutionState executionState = LifecycleExecutionState.fromIndexMetadata(indexMetaData); | ||
| Long stepTime = executionState.getStepTime(); | ||
|
|
||
| if (stepTime == null) { |
There was a problem hiding this comment.
should we be verifying that this stepTime is associate with this current step, "check_rollover"?
|
|
||
| private LongSupplier nowSupplier; | ||
|
|
||
| CheckRolloverStep(StepKey key, StepKey nextStepKey, Client client, LongSupplier nowSupplier) { |
There was a problem hiding this comment.
would it be better to minimize the places we introduce a dependency on System::currentTimeMillis and
update evaluateCondition to have an additional parameter, LongSupplier nowSupplier?
I'm not saying this is necessarily better, was just curious about other's thoughts on minimizing external dependency entry points. It doesn't look like it matters for tests
There was a problem hiding this comment.
Given that this is the only step that needs it, I'm inclined to keep it in the constructor for now, and move it to an evaluateCondition parameter if we end up with more AsyncWaitSteps that need to know what time it is. If anyone has strong feelings though, I would be open to promoting it to the AsyncWaitStep interface level.
|
I originally wrote this as a |
| long millisSinceEnteringStep = nowSupplier.getAsLong() - stepTime; | ||
|
|
||
| if (millisSinceEnteringStep > TIMEOUT_MILLIS) { | ||
| listener.onFailure(new IllegalStateException("index [" + indexMetaData.getIndex().getName() + "] was not rolled over "+ |
There was a problem hiding this comment.
perhaps we should make mention of the timeout in this error message (so the user knows we checked for at least 10 minutes), especially since the timeout is now non-configurable
| return; | ||
| } | ||
|
|
||
| listener.onResponse(false, null); |
There was a problem hiding this comment.
I think we can return a ToXContentObject here rather than null, allowing someone explaining the execution to know why it paused on this step for so long (including how long we've been waiting)
| assertThat(getReasonForIndex(originalIndex), equalTo("no rollover info found for [" + originalIndex + "], either the index " + | ||
| "has not yet rolled over or a subsequent index was created outside of Index Lifecycle Management")); | ||
| }); | ||
| } |
There was a problem hiding this comment.
I think having an integration test that does this is important and we should keep it
|
I'm closing this as I'll be opening another PR to handle this another way. |
This prevents a race condition where two RolloverSteps are executed
simultaneously, and one creates the new index first, and the second
moves on to the next step before the first can attach RolloverInfo to
the original index.
Relates to #35313
Of note: I've removed the integration test associated with the case where the next index is created outside of ILM, as the timeout is now too long. I considered making the timeout a setting, but we don't easily have access to the cluster state in an AsyncWaitStep - we can get there, but it's a bit awkward, and it would only really be used for testing. If anyone has any better ideas, I'm open to them.