Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback#139578
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| CLIENT_ID_HEADER, | ||
| exchange.getRequestURI() | ||
| ); | ||
| } |
There was a problem hiding this comment.
/batch/ requests don't include the custom header, but they also aren't failed by this thing so that's OK
| if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) { | ||
| String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN); | ||
| // In the event of a resumable retry, the GCS client uses the same idempotency token for | ||
| // the retry status check and the subsequent retries. | ||
| // Including the range header allows us to disambiguate between the requests | ||
| // see https://github.com/googleapis/java-storage/issues/3040 | ||
| if (exchange.getRequestHeaders().containsKey("Content-Range")) { | ||
| idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range"); | ||
| } | ||
| return idempotencyToken; | ||
| } |
There was a problem hiding this comment.
Hm, I thought IDEMPOTENCY_TOKEN should address this problem, isn't it?
There was a problem hiding this comment.
It doesn't appear to be populated for these requests (debugger indicates its not there). These are GET requests so perhaps idempotency is implied?
There was a problem hiding this comment.
Right these GET request without idempotency token. In short it comes from
final var meteredGet = client.meteredObjectsGet(purpose, blobId.getBucket(), blobId.getName());
-> storageRpc.objects().get(bucket, blob)
This returns an Storage.Objects.Get that does not perform retries and GCS client skips all retrying logic including idempotency key. I'm not sure why we should abandon default's client retry logic.
I think we should change
FROM storage.objects().get(bucket, blob) -> Storage.Objects.Get
TO storage.reader(blobId) -> com.google.cloud.ReadChannel
A reader does use default retry logic and populates idempotency. It's a bigger change, but I think a right one.
There was a problem hiding this comment.
Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506
Considering all above CLIENT_ID is good choice :)
There was a problem hiding this comment.
I'm not 100% positive, but my quick test showed using reader with 0-chunk-size does not do any buffering. The reader would have inner type of com.google.cloud.storage.ApiaryUnbufferedReadableByteChannel that does what we do in retrying stream, keep track of how many bytes were read and retry from there.
I think this is what we want.
There was a problem hiding this comment.
Azure and GCS both provide this behaviour out of the box, but the RetryingInputStream has lots of logic specific to our use-case
- Retry indefinitely for indexing ops
- Retry never for repo analysis ops
- Don't count failures where we made meaningful progress
That was the reason I factored that logic out for re-use (the original ticket was about Azure not being tenacious enough in its retries). It also gives us a place to put more such common logic that might be ES-specific.
I will merge this as-is but happy to reconsider what we want from each layer going forward. I'll create a ticket to follow up.
| if (exchange.getRequestHeaders().containsKey(IDEMPOTENCY_TOKEN)) { | ||
| String idempotencyToken = exchange.getRequestHeaders().getFirst(IDEMPOTENCY_TOKEN); | ||
| // In the event of a resumable retry, the GCS client uses the same idempotency token for | ||
| // the retry status check and the subsequent retries. | ||
| // Including the range header allows us to disambiguate between the requests | ||
| // see https://github.com/googleapis/java-storage/issues/3040 | ||
| if (exchange.getRequestHeaders().containsKey("Content-Range")) { | ||
| idempotencyToken += " " + exchange.getRequestHeaders().getFirst("Content-Range"); | ||
| } | ||
| return idempotencyToken; | ||
| } |
There was a problem hiding this comment.
Hm, there is a story about reader being bad, neither small or large chunks are acceptable for us due to internal buffering. #55506
Considering all above CLIENT_ID is good choice :)
# Conflicts: # muted-tests.yml
|
@elasticmachine run elasticsearch-ci/elasticsearch-serverless-es-pr-check |
|
@elasticsearchmachine test this please |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
The reason for this test failure is that the
GoogleCloudStorageBlobStoreRepositoryTests.GoogleErroneousHttpHandler#requestUniqueIdimplementation returns IDs that are not unique between nodes. This can cause the fault injection to exceed the maximum retries and fail the test.The reason it only showed up since the change is this test moved from using time-limited retries (unlimited by number) to using limited-by-number retries.
Example interleaving:
(imagine client max retries is 3, mock server fails requests twice before allowing them to proceed)
node1GET /download/storage/v1/b/bucket/o/tests-xxx/master.datnode1GET /download/storage/v1/b/bucket/o/tests-xxx/master.datnode2GET /download/storage/v1/b/bucket/o/tests-xxx/master.datnode1GET /download/storage/v1/b/bucket/o/tests-xxx/master.datnode1)node1GET /download/storage/v1/b/bucket/o/tests-xxx/master.datnode1, client fails due to retries exceeded)The PR adds a client-id header which we can include in the request unique ID to disambiguate between clients.
Closes: #139556
Closes: #139646
Closes: #139665