Skip to content

Introduce step to check rollover happened properly#35391

Closed
AthenaEryma wants to merge 4 commits intoelastic:masterfrom
AthenaEryma:ilm/rollover-verify-step
Closed

Introduce step to check rollover happened properly#35391
AthenaEryma wants to merge 4 commits intoelastic:masterfrom
AthenaEryma:ilm/rollover-verify-step

Conversation

@AthenaEryma
Copy link
Copy Markdown
Contributor

This prevents a race condition where two RolloverSteps are executed
simultaneously, and one creates the new index first, and the second
moves on to the next step before the first can attach RolloverInfo to
the original index.

Relates to #35313

Of note: I've removed the integration test associated with the case where the next index is created outside of ILM, as the timeout is now too long. I considered making the timeout a setting, but we don't easily have access to the cluster state in an AsyncWaitStep - we can get there, but it's a bit awkward, and it would only really be used for testing. If anyone has any better ideas, I'm open to them.

This prevents a race condition where two RolloverSteps are executed
simultaneously, and one creates the new index first, and the second
moves on to the next step before the first can attach RolloverInfo to
the original index.
@AthenaEryma AthenaEryma added >bug blocker :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. labels Nov 8, 2018
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this Gordon, I think that this should be a ClusterStateWaitStep though, since it never invokes the client or does anything asynchronously, what do you think?

Copy link
Copy Markdown
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this Gordon!

I just noticed mid-review that Lee already reviewed, but I'll leave mine be. I notice some overlap, sorry.


import java.util.function.LongSupplier;

public class CheckRolloverStep extends AsyncWaitStep {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we are not consistent here with regards to other Steps, but I think it would be great if we had some javadocs here explaining the purpose of this step.

private static final Logger logger = LogManager.getLogger(CheckRolloverStep.class);
static final long TIMEOUT_MILLIS = TimeValue.timeValueMinutes(10).millis();


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra new lines

LifecycleExecutionState executionState = LifecycleExecutionState.fromIndexMetadata(indexMetaData);
Long stepTime = executionState.getStepTime();

if (stepTime == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be verifying that this stepTime is associate with this current step, "check_rollover"?


private LongSupplier nowSupplier;

CheckRolloverStep(StepKey key, StepKey nextStepKey, Client client, LongSupplier nowSupplier) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to minimize the places we introduce a dependency on System::currentTimeMillis and
update evaluateCondition to have an additional parameter, LongSupplier nowSupplier?

I'm not saying this is necessarily better, was just curious about other's thoughts on minimizing external dependency entry points. It doesn't look like it matters for tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is the only step that needs it, I'm inclined to keep it in the constructor for now, and move it to an evaluateCondition parameter if we end up with more AsyncWaitSteps that need to know what time it is. If anyone has strong feelings though, I would be open to promoting it to the AsyncWaitStep interface level.

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I originally wrote this as a ClusterStateWaitStep, but in the case where we need to fail out of this step because someone manually created the subsequent index outside of ILM, we cannot rely on there being additional cluster state updates to trigger the ClusterStateWaitStep evalation. Per discussion on Slack, for now I'm going to leave this as an AsyncWaitStep.

long millisSinceEnteringStep = nowSupplier.getAsLong() - stepTime;

if (millisSinceEnteringStep > TIMEOUT_MILLIS) {
listener.onFailure(new IllegalStateException("index [" + indexMetaData.getIndex().getName() + "] was not rolled over "+
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should make mention of the timeout in this error message (so the user knows we checked for at least 10 minutes), especially since the timeout is now non-configurable

return;
}

listener.onResponse(false, null);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can return a ToXContentObject here rather than null, allowing someone explaining the execution to know why it paused on this step for so long (including how long we've been waiting)

assertThat(getReasonForIndex(originalIndex), equalTo("no rollover info found for [" + originalIndex + "], either the index " +
"has not yet rolled over or a subsequent index was created outside of Index Lifecycle Management"));
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having an integration test that does this is important and we should keep it

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I'm closing this as I'll be opening another PR to handle this another way.

@AthenaEryma AthenaEryma deleted the ilm/rollover-verify-step branch December 7, 2018 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker >bug :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants