Skip to content

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback#139578

Merged
nicktindall merged 5 commits intoelastic:mainfrom
nicktindall:fix_GoogleCloudStorageBlobStoreRepositoryTests
Dec 22, 2025
Merged

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback#139578
nicktindall merged 5 commits intoelastic:mainfrom
nicktindall:fix_GoogleCloudStorageBlobStoreRepositoryTests

Conversation

@nicktindall
Copy link
Copy Markdown
Contributor

@nicktindall nicktindall commented Dec 16, 2025

The reason for this test failure is that the GoogleCloudStorageBlobStoreRepositoryTests.GoogleErroneousHttpHandler#requestUniqueId implementation returns IDs that are not unique between nodes. This can cause the fault injection to exceed the maximum retries and fail the test.

The reason it only showed up since the change is this test moved from using time-limited retries (unlimited by number) to using limited-by-number retries.

Example interleaving:

(imagine client max retries is 3, mock server fails requests twice before allowing them to proceed)

source node request result
node1 GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat blocked first time
node1 GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat blocked second time
node2 GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat allow through because third block
node1 GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat blocked first time (third for node1)
node1 GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat blocked second time (fourth for node1, client fails due to retries exceeded)

The PR adds a client-id header which we can include in the request unique ID to disambiguate between clients.

Closes: #139556
Closes: #139646
Closes: #139665

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Dec 16, 2025
@nicktindall nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed v9.3.0 labels Dec 16, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. and removed needs:triage Requires assignment of a team area label labels Dec 16, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

CLIENT_ID_HEADER,
exchange.getRequestURI()
);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/batch/ requests don't include the custom header, but they also aren't failed by this thing so that's OK

Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 368 to 378
if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) {
String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN);
// In the event of a resumable retry, the GCS client uses the same idempotency token for
// the retry status check and the subsequent retries.
// Including the range header allows us to disambiguate between the requests
// see https://github.com/googleapis/java-storage/issues/3040
if (exchange.getRequestHeaders().containsKey("Content-Range")) {
idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range");
}
return idempotencyToken;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought IDEMPOTENCY_TOKEN should address this problem, isn't it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear to be populated for these requests (debugger indicates its not there). These are GET requests so perhaps idempotency is implied?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right these GET request without idempotency token. In short it comes from

final var meteredGet = client.meteredObjectsGet(purpose, blobId.getBucket(), blobId.getName());

final var meteredGet = client.meteredObjectsGet(purpose, blobId.getBucket(), blobId.getName());
-> storageRpc.objects().get(bucket, blob)

This returns an Storage.Objects.Get that does not perform retries and GCS client skips all retrying logic including idempotency key. I'm not sure why we should abandon default's client retry logic.

I think we should change

FROM storage.objects().get(bucket, blob) -> Storage.Objects.Get

TO storage.reader(blobId) -> com.google.cloud.ReadChannel

A reader does use default retry logic and populates idempotency. It's a bigger change, but I think a right one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506

Considering all above CLIENT_ID is good choice :)

Copy link
Copy Markdown
Contributor

@mhl-b mhl-b Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% positive, but my quick test showed using reader with 0-chunk-size does not do any buffering. The reader would have inner type of com.google.cloud.storage.ApiaryUnbufferedReadableByteChannel that does what we do in retrying stream, keep track of how many bytes were read and retry from there.

I think this is what we want.

Copy link
Copy Markdown
Contributor Author

@nicktindall nicktindall Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure and GCS both provide this behaviour out of the box, but the RetryingInputStream has lots of logic specific to our use-case

  • Retry indefinitely for indexing ops
  • Retry never for repo analysis ops
  • Don't count failures where we made meaningful progress

That was the reason I factored that logic out for re-use (the original ticket was about Azure not being tenacious enough in its retries). It also gives us a place to put more such common logic that might be ES-specific.

I will merge this as-is but happy to reconsider what we want from each layer going forward. I'll create a ticket to follow up.

Copy link
Copy Markdown
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 368 to 378
if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) {
String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN);
// In the event of a resumable retry, the GCS client uses the same idempotency token for
// the retry status check and the subsequent retries.
// Including the range header allows us to disambiguate between the requests
// see https://github.com/googleapis/java-storage/issues/3040
if (exchange.getRequestHeaders().containsKey("Content-Range")) {
idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range");
}
return idempotencyToken;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506

Considering all above CLIENT_ID is good choice :)

@nicktindall nicktindall enabled auto-merge (squash) December 21, 2025 23:54
@nicktindall
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/elasticsearch-serverless-es-pr-check

@nicktindall nicktindall disabled auto-merge December 22, 2025 03:48
@nicktindall
Copy link
Copy Markdown
Contributor Author

@elasticsearchmachine test this please

@nicktindall nicktindall merged commit b353e57 into elastic:main Dec 22, 2025
35 checks passed
@nicktindall nicktindall deleted the fix_GoogleCloudStorageBlobStoreRepositoryTests branch December 22, 2025 04:56
@mhl-b
Copy link
Copy Markdown
Contributor

mhl-b commented Feb 2, 2026

💚 All backports created successfully

Status Branch Result
9.3

Questions ?

Please refer to the Backport tool documentation

mhl-b pushed a commit to mhl-b/elasticsearch that referenced this pull request Feb 2, 2026
mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Feb 2, 2026
elasticsearchmachine pushed a commit that referenced this pull request Feb 2, 2026
…dRollback (#139578) (#141689)

Co-authored-by: Nick Tindall <nick.tindall@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

4 participants