Skip to content

fix: token cache exits after exhaustive retries#4551

Merged
alvarowolfx merged 23 commits intogoogleapis:mainfrom
alvarowolfx:fix-token-cache-poisoned
Feb 10, 2026
Merged

fix: token cache exits after exhaustive retries#4551
alvarowolfx merged 23 commits intogoogleapis:mainfrom
alvarowolfx:fix-token-cache-poisoned

Conversation

@alvarowolfx
Copy link
Copy Markdown
Collaborator

Towards #4541

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.02%. Comparing base (2728c51) to head (2766e7a).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/auth/src/token_cache.rs 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4551      +/-   ##
==========================================
- Coverage   95.03%   95.02%   -0.02%     
==========================================
  Files         195      195              
  Lines        7473     7475       +2     
==========================================
+ Hits         7102     7103       +1     
- Misses        371      372       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alvarowolfx alvarowolfx changed the title fix: token cache exists after exaustive retries fix: token cache exits after exaustive retries Feb 4, 2026
@alvarowolfx alvarowolfx changed the title fix: token cache exits after exaustive retries fix: token cache exits after exhaustive retries Feb 4, 2026
@alvarowolfx alvarowolfx marked this pull request as ready for review February 6, 2026 15:56
@alvarowolfx alvarowolfx requested review from a team, coryan and dbolduc February 6, 2026 15:56
Comment on lines +137 to +141
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not read right.

Suggested change
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Err(err) => {
// On transient errors, even if the retry policy is exhausted,
// we want to continue running this retry loop.
// This loop cannot stop because that may leave the
// credentials in an unrecoverable state (see #...).
// We considered using a notification to wake up the next time
// a caller wants to retrieve a token, but that seemed prone to
// deadlocks. We may implement this as an improvement (#....).
// On permanent errors, then there is really no point in trying
// again, by definition of "permanent". If the error was misclassified
// as permanent, that is a bug in the retry policy and better fixed
// there than implemented as a workaround here.

With prompts the question: are we certain our default policy does not misclassify some errors as permanent? Can you check?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop always map errors to gax:authentication errors, so I think in that part is correct, so I don't any other gax error being raised. If was not that I would be worried because https://github.com/googleapis/google-cloud-rust/pull/4551/changes#diff-43612f24f6413f8e5304d6af3e912553d7913ceea55c203c549db570b52d0d30L133 treat all non auth errors as permanent.

In the auth side, I was doing some audit on every place that creates CredentialsError::from_msg(false, ...), CredentialsError::new(false,...), CredentialsError::from_source( or CredentialsError::from_http_response and they are mostly consider non transient errors the ones related to transport and decoding errors when fetching data from remote services, so I think that's correct.

I could not find yet any case of misclassification, but you raised a valid point

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment and created an issue to track future improvement on the retry loop.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a bug to track these things? It may be too late to change, but we should think it through.

Copy link
Copy Markdown
Collaborator

@coryan coryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better than what we have today, so approved. It could also get some improvements.

Comment on lines +137 to +141
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

Copy link
Copy Markdown
Member

@dbolduc dbolduc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not important. I just like doing puzzles)

let expiry = token_result.as_ref().ok().map(|t| t.expires_at);
let tagged = token_result.map(|token| {
let expiry = token_result.as_ref().map(|t| t.expires_at);
let tagged = token_result.clone().map(|token| {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly nit: you can avoid the clone.

enum Refresh {
  Deadline(Instant),
  Never,
  Soon,
}

let refresh = match token_result.as_ref() {
  Ok(t) => {
    match t.expires_at {
      Some(d) => Refresh::Deadline(d),
      None => Refresh::Never,
    }
  },
  Err(e) if e.is_transient() => Refresh::Soon,
  Err(_) => Refresh::Never,
};

But maybe you think saving one clone is silly for a token that is cloned on every request anyway

@alvarowolfx alvarowolfx requested a review from a team as a code owner February 10, 2026 14:13
@alvarowolfx alvarowolfx merged commit d36c42a into googleapis:main Feb 10, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants