Skip to content

transport: retry HTTP 429 (Too Many Requests)#2301

Merged
Subserial merged 1 commit into
google:mainfrom
iahsanGill:fix/retry-http-429
May 18, 2026
Merged

transport: retry HTTP 429 (Too Many Requests)#2301
Subserial merged 1 commit into
google:mainfrom
iahsanGill:fix/retry-http-429

Conversation

@iahsanGill

Copy link
Copy Markdown
Contributor

Summary

Fixes #2111.

As filed by @knqyf263, the retry transport never retries HTTP 429 responses even though TooManyRequestsErrorCode is already classified temporary in pkg/v1/remote/transport/error.go. The retry transport short-circuits on its codes list before consulting Temporary(), and since 429 isn't in defaultRetryStatusCodes, the conversion to a temporary *transport.Error never happens and retry.IsTemporary never gets called.

The fix is symmetric across the two parallel "is this transient" classifications:

Where Before After
defaultRetryStatusCodes in pkg/v1/remote/options.go 408, 500, 502, 503, 504, 499, 522 adds http.StatusTooManyRequests
temporaryStatusCodes in pkg/v1/remote/transport/error.go 408, 500, 502, 503, 504 adds http.StatusTooManyRequests for parity with temporaryErrorCodes[TooManyRequestsErrorCode]

429 is now retried by the default transport (Duration: 1s, Factor: 3, Steps: 3, ~1.3s total).

Interaction with pkg/gcrane/copy.go

gcrane's copy path has its own outer retry loop that explicitly handles 429:

b := retry.IsTemporary(err) || hasStatusCode(err, http.StatusTooManyRequests) || isServerError(err)

After this change the exhausted *transport.Error still carries StatusCode == 429, so hasStatusCode keeps matching and GCRBackoff continues to drive longer-term retry at the copy-operation layer. The two layers compose cleanly:

  • transport-level: ~1.3s per request, handles transient 429 spikes
  • gcrane-level: ~6s → 1min → 10min per copy, handles longer throttling windows

A test asserts the final response still surfaces with StatusCode == 429, so this composition is verified.

Out of scope

Honoring the Retry-After response header. The retry backoff is currently static — honoring a server-dictated delay would need plumbing through internal/retry/wait. Worth a follow-up; not included here to keep the change focused on the reported bug.

Test plan

  • TestRetryTransport_TooManyRequests in pkg/v1/remote/transport/retry_test.go: three 429s through the transport, asserts all three attempts made and the final response surfaces with StatusCode == 429.
  • TestTemporaryStatusCodes_Includes429 in the same file: pins 429 into the temporaryStatusCodes map so the raw-status fallback stays in sync with temporaryErrorCodes[TooManyRequestsErrorCode].
  • TestDefaultRetryStatusCodes_Includes429 in new pkg/v1/remote/options_test.go: pins 429 into the default list itself.
  • go test ./... -count=1 full suite passes locally.
  • gofmt, goimports, golangci-lint v2.11 clean.
  • woke and US-locale misspell clean on the diff.

Fixes google#2111. As filed by knqyf263, 429 responses are never retried by
the default retry transport even though TooManyRequestsErrorCode is
already classified temporary in transport/error.go.

The retry transport in pkg/v1/remote/transport/retry.go short-circuits
on its codes list before consulting Temporary(); since 429 isn't in
defaultRetryStatusCodes, the conversion to a temporary transport.Error
never happens and retry.IsTemporary never gets called.

The fix is symmetric across the two parallel classifications:

  * defaultRetryStatusCodes in pkg/v1/remote/options.go now includes
    http.StatusTooManyRequests, so the retry transport wraps a 429
    response into a temporary *transport.Error and retries it through
    the existing backoff (Duration 1s, Factor 3, Steps 3).

  * temporaryStatusCodes in pkg/v1/remote/transport/error.go also
    gains http.StatusTooManyRequests, bringing the raw-status fallback
    path into alignment with temporaryErrorCodes which has already
    been classifying TooManyRequestsErrorCode as temporary.

The exhausted *transport.Error still carries StatusCode == 429, so
pkg/gcrane/copy.go's hasStatusCode helper continues to route 429s
through GCRBackoff at the application layer. The two retry layers
compose: ~1.3s of transport-level retry per request, then ~6s+ of
copy-level retry per image. No existing test expectations change.

Out of scope here: honoring the Retry-After response header. The retry
backoff is currently static; honoring server-dictated delays would
need plumbing through the retry/wait package. Worth a follow-up.

Tests:

  * pkg/v1/remote/transport/retry_test.go gains
    TestRetryTransport_TooManyRequests, which exercises three 429s
    through the retry transport and asserts both that all three
    attempts are made and that the final response surfaces with
    StatusCode == 429 (so gcrane's hasStatusCode keeps matching).
    It also pins the new entry in temporaryStatusCodes.

  * pkg/v1/remote/options_test.go (new file) pins 429 in the default
    retry status code list. A future refactor that drops 429 will
    fail loudly before the regression ships.

Verified locally:
  go test ./... (full suite green)
  gofmt + goimports clean
  golangci-lint v2.11: 0 issues
  woke clean on touched files
  reviewdog/action-misspell US: clean on the diff
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.89%. Comparing base (68a569e) to head (94295b7).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2301   +/-   ##
=======================================
  Coverage   56.89%   56.89%           
=======================================
  Files         165      165           
  Lines       11331    11331           
=======================================
  Hits         6447     6447           
  Misses       4120     4120           
  Partials      764      764           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Subserial Subserial merged commit c55facd into google:main May 18, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP 429 (Too Many Requests) errors are not retried despite being marked as temporary

3 participants