Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback by nicktindall · Pull Request #139578 · elastic/elasticsearch

nicktindall · 2025-12-16T03:13:03Z

The reason for this test failure is that the GoogleCloudStorageBlobStoreRepositoryTests.GoogleErroneousHttpHandler#requestUniqueId implementation returns IDs that are not unique between nodes. This can cause the fault injection to exceed the maximum retries and fail the test.

The reason it only showed up since the change is this test moved from using time-limited retries (unlimited by number) to using limited-by-number retries.

Example interleaving:

(imagine client max retries is 3, mock server fails requests twice before allowing them to proceed)

source node	request	result
`node1`	`GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat`	blocked first time
`node1`	`GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat`	blocked second time
`node2`	`GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat`	allow through because third block
`node1`	`GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat`	blocked first time (third for `node1`)
`node1`	`GET /download/storage/v1/b/bucket/o/tests-xxx/master.dat`	blocked second time (fourth for `node1`, client fails due to retries exceeded)

The PR adds a client-id header which we can include in the request unique ID to disambiguate between clients.

Closes: #139556
Closes: #139646
Closes: #139665

…dRollback

elasticsearchmachine · 2025-12-16T03:14:08Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall · 2025-12-16T03:15:44Z

...Test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java

+                        CLIENT_ID_HEADER,
+                        exchange.getRequestURI()
+                    );
+                }


/batch/ requests don't include the custom header, but they also aren't failed by this thing so that's OK

ywangd

LGTM

mhl-b · 2025-12-16T05:00:30Z

...Test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java

            if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) {
                String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN);
                // In the event of a resumable retry, the GCS client uses the same idempotency token for
                // the retry status check and the subsequent retries.
                // Including the range header allows us to disambiguate between the requests
                // see https://github.com/googleapis/java-storage/issues/3040
                if (exchange.getRequestHeaders().containsKey("Content-Range")) {
                    idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range");
                }
                return idempotencyToken;
            }


Hm, I thought IDEMPOTENCY_TOKEN should address this problem, isn't it?

It doesn't appear to be populated for these requests (debugger indicates its not there). These are GET requests so perhaps idempotency is implied?

Right these GET request without idempotency token. In short it comes from

elasticsearch/modules/repository-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

Line 71 in c0e1c56

final var meteredGet = client.meteredObjectsGet(purpose, blobId.getBucket(), blobId.getName());

final var meteredGet = client.meteredObjectsGet(purpose, blobId.getBucket(), blobId.getName()); -> storageRpc.objects().get(bucket, blob)

This returns an Storage.Objects.Get that does not perform retries and GCS client skips all retrying logic including idempotency key. I'm not sure why we should abandon default's client retry logic.

I think we should change

FROM storage.objects().get(bucket, blob) -> Storage.Objects.Get TO storage.reader(blobId) -> com.google.cloud.ReadChannel

A reader does use default retry logic and populates idempotency. It's a bigger change, but I think a right one.

Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506

Considering all above CLIENT_ID is good choice :)

I'm not 100% positive, but my quick test showed using reader with 0-chunk-size does not do any buffering. The reader would have inner type of com.google.cloud.storage.ApiaryUnbufferedReadableByteChannel that does what we do in retrying stream, keep track of how many bytes were read and retry from there.

I think this is what we want.

Azure and GCS both provide this behaviour out of the box, but the RetryingInputStream has lots of logic specific to our use-case

Retry indefinitely for indexing ops

Retry never for repo analysis ops

Don't count failures where we made meaningful progress

That was the reason I factored that logic out for re-use (the original ticket was about Azure not being tenacious enough in its retries). It also gives us a place to put more such common logic that might be ES-specific.

I will merge this as-is but happy to reconsider what we want from each layer going forward. I'll create a ticket to follow up.

mhl-b

LGTM

mhl-b · 2025-12-16T06:56:55Z

...Test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java

            if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) {
                String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN);
                // In the event of a resumable retry, the GCS client uses the same idempotency token for
                // the retry status check and the subsequent retries.
                // Including the range header allows us to disambiguate between the requests
                // see https://github.com/googleapis/java-storage/issues/3040
                if (exchange.getRequestHeaders().containsKey("Content-Range")) {
                    idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range");
                }
                return idempotencyToken;
            }


Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506

Considering all above CLIENT_ID is good choice :)

# Conflicts: # muted-tests.yml

nicktindall · 2025-12-22T01:25:31Z

@elasticmachine run elasticsearch-ci/elasticsearch-serverless-es-pr-check

nicktindall · 2025-12-22T03:50:36Z

@elasticsearchmachine test this please

mhl-b · 2026-02-02T20:34:24Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3

Questions ?

Please refer to the Backport tool documentation

…dRollback (elastic#139578)

…dRollback (#139578) (#141689) Co-authored-by: Nick Tindall <nick.tindall@elastic.co>

…dRollback (#139578) (#141690)

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAn…

b8c2406

…dRollback

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Dec 16, 2025

nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed v9.3.0 labels Dec 16, 2025

elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. and removed needs:triage Requires assignment of a team area label labels Dec 16, 2025

nicktindall commented Dec 16, 2025

View reviewed changes

Fail test when non-batch request has no client ID

846b597

nicktindall requested a review from ywangd December 16, 2025 03:26

nicktindall added the v9.3.0 label Dec 16, 2025

ywangd approved these changes Dec 16, 2025

View reviewed changes

mhl-b reviewed Dec 16, 2025

View reviewed changes

Merge branch 'main' into fix_GoogleCloudStorageBlobStoreRepositoryTests

224c73d

mhl-b approved these changes Dec 16, 2025

View reviewed changes

mhl-b mentioned this pull request Dec 16, 2025

[CI] GoogleCloudStorageBlobStoreRepositoryTests testDeleteBlobs failing #139646

Closed

ywangd mentioned this pull request Dec 17, 2025

[CI] GoogleCloudStorageBlobStoreRepositoryTests testReadNonExistingPath failing #139665

Closed

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

nicktindall added 2 commits December 22, 2025 09:32

Merge branch 'main' into fix_GoogleCloudStorageBlobStoreRepositoryTests

147a293

# Conflicts: # muted-tests.yml

Unmute fixed tests

6d5d24d

nicktindall enabled auto-merge (squash) December 21, 2025 23:54

nicktindall disabled auto-merge December 22, 2025 03:48

nicktindall merged commit b353e57 into elastic:main Dec 22, 2025
35 checks passed

nicktindall deleted the fix_GoogleCloudStorageBlobStoreRepositoryTests branch December 22, 2025 04:56

mhl-b mentioned this pull request Jan 15, 2026

[CI] org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests.testIndicesDeletedFromRepository failing in tests #140711

Closed

PeteGillinElastic mentioned this pull request Feb 2, 2026

[CI] GoogleCloudStorageBlobStoreRepositoryTests testIndicesDeletedFromRepository failing #141660

Closed

mhl-b mentioned this pull request Feb 2, 2026

[9.3] Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback (#139578) #141689

Merged

mhl-b pushed a commit to mhl-b/elasticsearch that referenced this pull request Feb 2, 2026

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAn…

f143e06

…dRollback (elastic#139578)

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Feb 2, 2026

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAn…

6e0aede

…dRollback (elastic#139578)

elasticsearchmachine pushed a commit that referenced this pull request Feb 2, 2026

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAn…

90f499b

…dRollback (#139578) (#141689) Co-authored-by: Nick Tindall <nick.tindall@elastic.co>

elasticsearchmachine pushed a commit that referenced this pull request Feb 2, 2026

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAn…

f61b4a5

…dRollback (#139578) (#141690)

Conversation

nicktindall commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example interleaving:

Uh oh!

elasticsearchmachine commented Dec 16, 2025

Uh oh!

nicktindall Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

mhl-b Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mhl-b Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mhl-b Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mhl-b Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

mhl-b Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Dec 22, 2025

Uh oh!

nicktindall commented Dec 22, 2025

Uh oh!

Uh oh!

mhl-b commented Feb 2, 2026

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicktindall commented Dec 16, 2025 •

edited

Loading

mhl-b Dec 16, 2025 •

edited

Loading

nicktindall Dec 21, 2025 •

edited

Loading