Fix testReadBlobWithReadTimeouts retries count by mhl-b · Pull Request #139999 · elastic/elasticsearch

mhl-b · 2025-12-25T04:56:50Z

RetryingInputStream allows more retries than maxRetry count configured for the blob store. If stream is able to read a meaningful amount of bytes (1% of read buffer size) a free retry is given to the stream that will not count into maxRetry.

Fixing test by counting extra retries from RetryingInputStream.

elasticsearchmachine · 2025-12-25T04:57:14Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2026-01-02T07:56:21Z

...main/java/org/elasticsearch/repositories/blobstore/AbstractBlobContainerRetriesTestCase.java

+        final int meaningfulProgressSize = (int) (bufferSize.getBytes() / 100L);
+        final byte[] bytesPerRetry = randomByteArrayOfLength(meaningfulProgressSize / maxRetries);


I don't understand the computation here. We need each attempt to deliver strictly fewer than bufferSize.getBytes() / 100L bytes to avoid it being considered "meaningful" but why also divide by maxRetries? Does the GCS SDK retry attempt to resume the download internally up to maxRetries times? If so, that's different from the S3 SDK behaviour, and I'd prefer we made that explicit.

Also if maxRetries == 1 then sometimes this'll send exactly bufferSize.getBytes() / 100L which is large enough to be meaningful. It needs to be strictly less than this size.

Also, rather than hard-coding the meaningful size to be bufferSize.getBytes() / 100L let's delegate the calculation to the repository somehow. Note that GCS repositories don't allow any control over the buffer size, it's always GoogleCloudStorageBlobStore.SDK_DEFAULT_CHUNK_SIZE i.e. 16MiB regardless of our choice of bufferSize above.

I don't understand the computation here. We need each attempt to deliver strictly fewer than bufferSize.getBytes() / 100L bytes to avoid it being considered "meaningful" but why also divide by maxRetries?

I misunderstood currentStreamProgress, thanks for pointing out. Now I see that we compare offset of the RetryingInputStream with current try offset. I missed part where we update current offset during stream read.

private long currentStreamProgress() { if (currentStream == null) { return 0L; } return Math.subtractExact(Math.addExact(start, offset), currentStream.getFirstOffset()); }

Does the GCS SDK retry attempt to resume the download internally up to maxRetries times?

It does not. GCS should be the same as S3 right now.

Also if maxRetries == 1 then sometimes this'll send exactly bufferSize.getBytes() / 100L which is large enough to be meaningful. It needs to be strictly less than this size.

Not really, exchange -> contentPartSizes.add((long) sendIncompleteContent(exchange, bytes)) does not send whole content, it's at least 1 byte short.

Also, rather than hard-coding the meaningful size to be bufferSize.getBytes() / 100L let's delegate the calculation to the repository somehow.

Done. Exposed meaningful progress and added retry counting in test. d3857cd

nicktindall · 2026-01-05T04:50:58Z

...main/java/org/elasticsearch/repositories/blobstore/AbstractBlobContainerRetriesTestCase.java

+        int meaningfulProgressRetries = Math.toIntExact(
+            contentPartSizes.stream().filter(partSize -> partSize >= meaningfulProgressSize.get()).count()
+        );
+        assertThat(exception.getSuppressed().length, getMaxRetriesMatcher(maxRetries + meaningfulProgressRetries));


I think S3BlobContainerRetriesTest works around this problem by overriding getMaxRetriesMatcher

See

elasticsearch/modules/repository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobContainerRetriesTests.java

Lines 1458 to 1461 in 9fa3317

protected Matcher<Integer> getMaxRetriesMatcher(int maxRetries) {

// some attempts make meaningful progress and do not count towards the max retry limit

return allOf(greaterThanOrEqualTo(maxRetries), lessThanOrEqualTo(S3RetryingInputStream.MAX_SUPPRESSED_EXCEPTIONS));

}

What you've done here seems like an improvement because it is more specific, perhaps we can remove that override and in-line getMaxRetriesMatcher as part of (or subsequent to) this change?

Cool, I missed that override. Will remove this from S3 tests and make it similar to GCP.

Done 8304536. getMaxRetriesMatcher is still in use in other place, otherwise I could remove it.

mhl-b · 2026-01-09T22:09:03Z

@nicktindall, @DaveCTurner, ready for another review

nicktindall · 2026-01-12T01:12:39Z

...main/java/org/elasticsearch/repositories/blobstore/AbstractBlobContainerRetriesTestCase.java

        httpServer.createContext(
            downloadStorageEndpoint(blobContainer, "read_blob_incomplete"),
-            exchange -> sendIncompleteContent(exchange, bytes)
+            exchange -> retryContentSizes.add((long) sendIncompleteContent(exchange, bytes))


I wonder whether this might still flake, for example if the server thinks it sent a response > meaningful progress size, but due to buffering etc in the client infrastructure (they are all different I think), we might end up seeing < meaningful progress size in the client.

Assuming it doesn't, this LGTM

I ran several successful tests with exact response size of meaningful bytes. I would expect client should not sit on buffered data and not giving it to the application. But can prefetch more if it's available. In this case what fixtures flushes to the socket should be immediately visible to the client.

nicktindall

LGTM with comment

All done

DaveCTurner

LGTM2

mhl-b · 2026-02-02T21:17:36Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3

Questions ?

Please refer to the Backport tool documentation

fix testReadBlobWithReadTimeouts retries count

68e4a4e

mhl-b requested review from schase-es, ywangd and zhubotang-wq December 25, 2025 04:56

mhl-b added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Dec 25, 2025

elasticsearchmachine added the v9.4.0 label Dec 25, 2025

remove muted test

e3dfc35

mhl-b requested a review from DaveCTurner December 29, 2025 21:46

DaveCTurner previously requested changes Jan 2, 2026

View reviewed changes

mhl-b added 3 commits January 2, 2026 12:22

expose RetryingInputStream meaningfulProgressSize in tests

d3857cd

unmute

8a0f7f7

merge main

09b72eb

mhl-b requested a review from DaveCTurner January 2, 2026 20:46

cleanup

44a49a4

nicktindall reviewed Jan 5, 2026

View reviewed changes

mhl-b added 2 commits January 9, 2026 12:08

s3 test

8304536

merge main

c452a1c

mhl-b requested a review from nicktindall January 9, 2026 20:11

Merge branch 'main' into testReadBlobWithReadTimeouts-fix

43b74c0

nicktindall reviewed Jan 12, 2026

View reviewed changes

nicktindall approved these changes Jan 12, 2026

View reviewed changes

DaveCTurner approved these changes Jan 12, 2026

View reviewed changes

mhl-b and others added 3 commits January 12, 2026 09:11

Merge branch 'main' into testReadBlobWithReadTimeouts-fix

2b42e78

Merge branch 'main' into testReadBlobWithReadTimeouts-fix

636412b

Merge branch 'main' into testReadBlobWithReadTimeouts-fix

ce18172

mhl-b merged commit 9e9d277 into elastic:main Jan 15, 2026
35 checks passed

nicktindall mentioned this pull request Jan 20, 2026

Fix S3BlobContainerRetriesTests #140942

Merged

spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026

Fix testReadBlobWithReadTimeouts retries count (elastic#139999)

b2e6b50

mhl-b mentioned this pull request Jan 24, 2026

[CI] GoogleCloudStorageBlobContainerRetriesTests testReadBlobWithReadTimeouts failing #141230

Closed

mhl-b mentioned this pull request Feb 2, 2026

[9.3] Fix testReadBlobWithReadTimeouts retries count (#139999) #141690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix testReadBlobWithReadTimeouts retries count#139999

Fix testReadBlobWithReadTimeouts retries count#139999
mhl-b merged 12 commits intoelastic:mainfrom
mhl-b:testReadBlobWithReadTimeouts-fix

mhl-b commented Dec 25, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Dec 25, 2025

Uh oh!

DaveCTurner Jan 2, 2026

Uh oh!

mhl-b Jan 2, 2026 •

edited

Loading

Uh oh!

nicktindall Jan 5, 2026 •

edited

Loading

Uh oh!

mhl-b Jan 5, 2026

Uh oh!

mhl-b Jan 9, 2026 •

edited

Loading

Uh oh!

mhl-b commented Jan 9, 2026

Uh oh!

nicktindall Jan 12, 2026

Uh oh!

mhl-b Jan 12, 2026

Uh oh!

nicktindall left a comment

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

mhl-b commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		final int meaningfulProgressSize = (int) (bufferSize.getBytes() / 100L);
		final byte[] bytesPerRetry = randomByteArrayOfLength(meaningfulProgressSize / maxRetries);

	protected Matcher<Integer> getMaxRetriesMatcher(int maxRetries) {
	// some attempts make meaningful progress and do not count towards the max retry limit
	return allOf(greaterThanOrEqualTo(maxRetries), lessThanOrEqualTo(S3RetryingInputStream.MAX_SUPPRESSED_EXCEPTIONS));
	}

Conversation

mhl-b commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 25, 2025

Uh oh!

DaveCTurner Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b commented Jan 9, 2026

Uh oh!

nicktindall Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhl-b commented Feb 2, 2026

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mhl-b commented Dec 25, 2025 •

edited

Loading

mhl-b Jan 2, 2026 •

edited

Loading

nicktindall Jan 5, 2026 •

edited

Loading

mhl-b Jan 9, 2026 •

edited

Loading