Skip to content

Remove trappy timeouts in snapshot APIs#109828

Merged
elasticsearchmachine merged 12 commits intoelastic:mainfrom
DaveCTurner:2024/06/17/trappy-snapshot-requests
Jun 20, 2024
Merged

Remove trappy timeouts in snapshot APIs#109828
elasticsearchmachine merged 12 commits intoelastic:mainfrom
DaveCTurner:2024/06/17/trappy-snapshot-requests

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Wholesale fix of every TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT in
o.e.snapshots and o.e.repositories, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
TEST_REQUEST_TIMEOUT), or some other place where an explicit value is
available.

Relates #107984

Wholesale fix of every `TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` in
`o.e.snapshots` and `o.e.repositories`, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
`TEST_REQUEST_TIMEOUT`), or some other place where an explicit value is
available.

Relates elastic#107984
@DaveCTurner DaveCTurner added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.15.0 labels Jun 17, 2024
@DaveCTurner DaveCTurner requested a review from ywangd June 17, 2024 20:14
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Jun 17, 2024
Copy link
Copy Markdown
Member Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big change in terms of lines touched but it was all pretty mechanical apart from the few spots I've called out below.


public CleanupRepositoryRequest(StreamInput in) throws IOException {
super(TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT, DEFAULT_ACK_TIMEOUT);
public static CleanupRepositoryRequest readFrom(StreamInput in) throws IOException {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This request class was missing the AcknowledgedRequest header on the wire.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UpdateForV9 seems appropriate here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need I think, we'll be clearing up the transport versions in v9 anyway and that'll expose cases like this.

Comment on lines +97 to +99
clusterService.state().getMinTransportVersion().onOrAfter(TransportVersions.SNAPSHOT_REQUEST_TIMEOUTS)
? TimeValue.MINUS_ONE
: TimeValue.MAX_VALUE,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal request so really should have infinite timeout (as long as the cluster is new enough to understand that - using the current PR's transport version to be sure). Previously it had a 30s timeout which was probably a mistake.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to separate this out in its own PR?

protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
final ResetFeatureStateRequest req = new ResetFeatureStateRequest();

final var req = new ResetFeatureStateRequest(RestUtils.getMasterNodeTimeout(request));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API param was missing - added here and in the JSON spec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to update doc as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't even realise this had docs. It's not really intended for end-users:

WARNING: Intended for development and testing use only. Do not reset features on a production cluster.

return "cat_snapshot_action";
}

private static final String[] MATCH_ALL_PATTERNS = { ResolvedRepositories.ALL_PATTERN };
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this up to a static constant.

String snapshotName = snapPolicyMeta.getLastSuccess().getSnapshotName();
String repositoryName = snapPolicyMeta.getPolicy().getRepository();
GetSnapshotsRequest request = new GetSnapshotsRequest().repositories(repositoryName)
GetSnapshotsRequest request = new GetSnapshotsRequest(TimeValue.MAX_VALUE).repositories(repositoryName)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout was missing previously (i.e. defaulted to 30s) but like other ILM actions it should have been MAX_VALUE.

Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


public CleanupRepositoryRequest(StreamInput in) throws IOException {
super(TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT, DEFAULT_ACK_TIMEOUT);
public static CleanupRepositoryRequest readFrom(StreamInput in) throws IOException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UpdateForV9 seems appropriate here?

Comment on lines +85 to +86
public CreateSnapshotRequest(TimeValue masterNodeTimeout) {
super(masterNodeTimeout);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: I wonder why create snapshot request and some other snapshot requests are not AcknowledgedRequest?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't (directly) involve waiting for other nodes to apply the cluster state.

Comment on lines +97 to +99
clusterService.state().getMinTransportVersion().onOrAfter(TransportVersions.SNAPSHOT_REQUEST_TIMEOUTS)
? TimeValue.MINUS_ONE
: TimeValue.MAX_VALUE,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to separate this out in its own PR?

protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
final ResetFeatureStateRequest req = new ResetFeatureStateRequest();

final var req = new ResetFeatureStateRequest(RestUtils.getMasterNodeTimeout(request));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to update doc as well.

Comment on lines +71 to 74
@Deprecated(forRemoval = true) // temporary compatibility shim
public RestoreSnapshotRequest(String repository, String snapshot) {
this(MasterNodeRequest.TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT, repository, snapshot);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem to be used here? Is it for stateless?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 19, 2024
@elasticsearchmachine elasticsearchmachine merged commit 5662f98 into elastic:main Jun 20, 2024
@DaveCTurner DaveCTurner deleted the 2024/06/17/trappy-snapshot-requests branch June 20, 2024 21:11
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 29, 2024
These methods are no longer used in ES or any of its dependent code.

Relates elastic#109828
Relates elastic#107984
DaveCTurner added a commit that referenced this pull request Jul 30, 2024
These methods are no longer used in ES or any of its dependent code.

Relates #109828
Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Sep 12, 2024
In elastic#109828 we introduced a `DUMMY_TIMEOUT` constant for these handlers
in which timeouts are ignored, but we have not been very consistent with
its usage and it has an overly-generic name making it harder for readers
to understand its meaning.

This commit renames the constant to clarify why it's being used, and
fixes up several spots where we should have been using it already.
elasticsearchmachine pushed a commit that referenced this pull request Sep 12, 2024
In #109828 we introduced a `DUMMY_TIMEOUT` constant for these handlers
in which timeouts are ignored, but we have not been very consistent with
its usage and it has an overly-generic name making it harder for readers
to understand its meaning.

This commit renames the constant to clarify why it's being used, and
fixes up several spots where we should have been using it already.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
In elastic#109828 we deprecated several snapshot-related
methods on `ClusterAdminClient` which imposed a trappy default timeout
on their callers. This commit removes their usages from serverless so we
can remove the trappy methods.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
In elastic#109828 we deprecated several snapshot-related
methods on `ClusterAdminClient` which imposed a trappy default timeout
on their callers. This commit removes their usages from serverless so we
can remove the trappy methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. test-update-serverless v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants