Skip to content

Make SLM Tasks Use Infinite Timeout for Master Requests#72085

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:infinite-timeouts-slm-tasks
Apr 22, 2021
Merged

Make SLM Tasks Use Infinite Timeout for Master Requests#72085
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:infinite-timeouts-slm-tasks

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Apr 22, 2021

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless SLM task failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless slm run failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.
@original-brownbear original-brownbear added >non-issue :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.14.0 labels Apr 22, 2021
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Apr 22, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit 47fdb46 into elastic:master Apr 22, 2021
@original-brownbear original-brownbear deleted the infinite-timeouts-slm-tasks branch April 22, 2021 15:27
String snapshotName = maybeMetadata.map(policyMetadata -> {
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest();
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest().masterNodeTimeout(TimeValue.MAX_VALUE);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wary of infinite timeouts just because we should be able to come up with a reasonable time that a request should either succeed or fail, or else SLM should actually fail rather than waiting forever.

I know this is already merged, but what about just setting a reasonably long timeout (something like 2 hours, or even 24 hours) instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout here is worse than patiently letting things complete in their own time. Timing out and retrying the thing that just timed out is particularly bad.

In this case, the timeout only applies to finding the master and processing the cluster state update that starts the snapshot. After that, we could already be waiting arbitrarily long. It's definitely a bug for us to take minutes or hours to process that first cluster state update, but it's almost certainly not a bug in SLM or snapshotting, and it doesn't make much sense to me to give up and retry (from the back of the queue) after a timeout elapses. We may as well stay in line and know that the master will get around to this eventually.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

We have it on our list to discuss SLM retries and what we want to do in the event that an SLM snapshot fails, so it's still under discussion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross linking: #70587

original-brownbear added a commit that referenced this pull request Apr 23, 2021
)

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless slm run failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.
original-brownbear added a commit that referenced this pull request Jun 17, 2021
Same as #72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
original-brownbear added a commit that referenced this pull request Jun 28, 2021
Same as #72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
@original-brownbear original-brownbear restored the infinite-timeouts-slm-tasks branch April 18, 2023 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v7.14.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants