Skip to content

fix(auth): preserve valid session on refresh failure and cooldown repeat failures#2436

Merged
mandarini merged 4 commits into
masterfrom
fix/auth-refresh-preserve-session-2
Jun 11, 2026
Merged

fix(auth): preserve valid session on refresh failure and cooldown repeat failures#2436
mandarini merged 4 commits into
masterfrom
fix/auth-refresh-preserve-session-2

Conversation

@mandarini

@mandarini mandarini commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes the long-standing _callRefreshToken issue where a transient or non-retryable refresh failure destroyed a session whose access token was still valid, and the related symptom where a sustained outage had the SDK hammer /token with the same dead refresh token until the access token actually expired or the user wiped local storage.

This is the second attempt — see #2430 for the prior approach. Key difference: that PR preserved storage in _callRefreshToken but __loadSession still translated the refresh error into { session: null, error }, so getSession() callers stayed effectively logged out — the same failure mode that got #2146 rejected. This PR closes that gap end-to-end while keeping explicit refresh entry points (refreshSession, setSession) honest about failures.

What changed?

Five complementary changes, all in packages/core/auth-js/:

  • Proactive vs reactive in _callRefreshToken. On non-retryable error, re-read storage and skip _removeSession if the access token is still inside its real expiry window. Return shape unchanged — explicit callers still see the underlying error.
  • Caller-visible preservation in __loadSession only. When _callRefreshToken errors but the in-scope currentSession is still valid, hand the caller the preserved session instead of { session: null, error }. Guards against a concurrent signOut clearing storage during the refresh attempt by re-reading storage before returning. Scoped to __loadSession deliberately so refreshSession() / setSession() keep their honest error semantics.
  • Serial-failure cooldown cache. Any refresh failure is cached on the client for REFRESH_FAILURE_COOLDOWN_MS (60s, two auto-refresh ticks). Subsequent serial callers within that window receive the cached failure synchronously instead of firing another /token call. Cleared on any successful refresh, on _removeSession, and on a TOKEN_REFRESHED / SIGNED_IN broadcast from another tab.
  • Wider transient classification in lib/fetch.ts. NETWORK_ERROR_CODES now includes 500, 501, and the Cloudflare-origin 525-529 codes. Previously these were misclassified as non-retryable, which on the old catch path triggered _removeSession() during outages.
  • Strip the redundant _removeSession in _recoverAndRefresh. _callRefreshToken's catch is now the single source of truth for "session is dead enough to wipe." Stops the double-SIGNED_OUT during init that @nathanschram flagged on Bug: _callRefreshToken permanently deletes session on non-retryable refresh failure, even when access token is still valid #2145, and prevents the new proactive-preserve from being undone at init time.

Why was this change needed?

Two real-world failure modes converge on the same code path:

  1. Proactive refresh destroying still-valid sessions. __loadSession() triggers a refresh whenever the access token is within EXPIRY_MARGIN_MS (90s) of expiry. If that refresh failed with any non-retryable error (multi-tab rotation race, mobile-browser tab lifecycle, transient 400 from GoTrue), _removeSession() was called unconditionally, destroying a session whose access token still worked for up to 90 more seconds. The user was silently logged out with getSession() returning { session: null, error: null } and no recovery path short of a full reload and re-login.
  2. Refresh storm during outages. When the same /token call kept failing (DNS unreachable, persistent 4xx/5xx), every subsequent getSession() call in the 90s margin re-fired _callRefreshToken against the same broken refresh token. Reporters on Bug: _callRefreshToken permanently deletes session on non-retryable refresh failure, even when access token is still valid #2145 documented hundreds to tens of thousands of /token requests per hour from a single client, all hitting the same failure.

The proactive/reactive distinction in _callRefreshToken plus the __loadSession mirror address (1). The cooldown cache addresses (2) by capping /token calls to one per 60s window during sustained failure. The widened NETWORK_ERROR_CODES ensures common outage status codes are classified as transient instead of dragging the session into the reactive-removal path — the Reddit r/Supabase report of a real outage signing out entire mobile-app user bases was 500 + HTML body responses falling into the non-retryable branch.

Closes #2145

Screenshots/Examples

Before:

// access token still valid for 60s, refresh fails with 400 invalid_grant
await supabase.auth.getSession()
// returns { data: { session: null }, error: null }
// storage cleared, SIGNED_OUT emitted, user logged out

After:

// access token still valid for 60s, refresh fails with 400 invalid_grant
await supabase.auth.getSession()
// returns { data: { session: <existing valid session> }, error: null }
// storage preserved, no SIGNED_OUT emitted, access token still works
// next refresh attempt deferred by REFRESH_FAILURE_COOLDOWN_MS (60s)

refreshSession() and setSession() are unchanged — they still surface the refresh error to their callers so they don't lie about whether the token actually rotated.

Breaking changes

  • This PR contains no breaking changes

No public API changes, no exported type changes, no method signatures changed. Three observable behavior changes worth calling out:

  1. Error class for 500/525-529 from auth. Previously AuthApiError, now AuthRetryableFetchError. Both extend AuthError, so catch (e) { if (e instanceof AuthError) ... } is unaffected. Only instanceof AuthApiError for those specific status codes would stop matching.
  2. Fewer spurious SIGNED_OUT events. onAuthStateChange callbacks see strictly fewer SIGNED_OUT events — only when the session is genuinely dead, never extra. Init-time non-retryable refresh now fires SIGNED_OUT exactly once instead of twice.
  3. getSession() returns the preserved session in proactive-preserve scenarios it previously returned null for. This is the headline bug fix.

The auto-refresh ticker cadence (AUTO_REFRESH_TICK_DURATION_MS) is unchanged. The commit-guard logic for mid-flight signOut races is unchanged. refreshSession() / setSession() semantics are unchanged.

Checklist

  • I have read the Contributing Guidelines
  • My PR title follows the conventional commit format: <type>(<scope>): <description>
  • I have run pnpm nx format to ensure consistent code formatting
  • I have added tests for new functionality (if applicable)
  • I have updated documentation (if applicable)

Additional notes

The cooldown shape follows @thomaslarsson's lastRefreshResult sketch on #2145. Concurrent dedupe via refreshingDeferred is preserved; the cooldown extends the dedupe contract to serial callers spaced across short failure windows.

Tests added under a new describe('Refresh-token lifecycle (proactive/reactive, cooldown)') block in GoTrueClient.test.ts, split into five sub-describes:

  • storage preservation_callRefreshToken preserves on proactive failure, removes on reactive, preserves on retryable network failure regardless of expiry
  • caller-visible preservation in getSession — returns preserved session on proactive-preserve, null on reactive, null when storage cleared concurrently (race guard)
  • explicit-caller contractrefreshSession() and setSession() still surface the error on proactive-preserve scenarios
  • failure cooldown — 50 serial calls collapse to 1 /token, cleared on success, cleared in _removeSession, expires after REFRESH_FAILURE_COOLDOWN_MS (verified with fake timers)
  • init cleanup_recoverAndRefresh emits SIGNED_OUT exactly once on non-retryable refresh failure

The BroadcastChannel cache-clear branch is not unit tested — globalThis.BroadcastChannel is undefined in the Jest node env and adding a stub for one small branch isn't worth the surface area. Inline comment in the test file documents this.

@mandarini mandarini requested review from a team as code owners June 10, 2026 12:42
@mandarini mandarini marked this pull request as draft June 10, 2026 12:42
@github-actions github-actions Bot added the auth-js Related to the auth-js library. label Jun 10, 2026
@pkg-pr-new

pkg-pr-new Bot commented Jun 10, 2026

Copy link
Copy Markdown

Open in StackBlitz

@supabase/auth-js

npm i https://pkg.pr.new/@supabase/auth-js@2436

@supabase/functions-js

npm i https://pkg.pr.new/@supabase/functions-js@2436

@supabase/postgrest-js

npm i https://pkg.pr.new/@supabase/postgrest-js@2436

@supabase/realtime-js

npm i https://pkg.pr.new/@supabase/realtime-js@2436

@supabase/storage-js

npm i https://pkg.pr.new/@supabase/storage-js@2436

@supabase/supabase-js

npm i https://pkg.pr.new/@supabase/supabase-js@2436

commit: 05eccde

@mandarini mandarini self-assigned this Jun 10, 2026
@mandarini mandarini marked this pull request as ready for review June 10, 2026 12:55
Comment thread packages/core/auth-js/src/GoTrueClient.ts Outdated
@mandarini mandarini merged commit ad23adf into master Jun 11, 2026
39 of 40 checks passed
@mandarini mandarini deleted the fix/auth-refresh-preserve-session-2 branch June 11, 2026 06:51
@thomaslarsson

Copy link
Copy Markdown

@mandarini Thank you for taking my report seriously and trying to fix the issues. I really appreciate it and want to try to provide som value back to the community. I installed 2.107.0-beta.1 at June 1st around midnight Europe/Oslo time. I added Operation email notifications every time one of our guards tripped after that. We email on auth guard trips (5 min aggregation window; counting individual emails ≈ counting events).

Some stats:

Guard Email threads Total emails Share
circuit_breaker_trip 11 60 82%
middleware_legacy_cleanup 8 13 18%
Total 19 73

My analysis of this PR shows:

Expected impact of #2436 on our guards

Likely helps a lot — circuit_breaker_trip (~82% of alerts)

PR change Why it maps to our production pain
Proactive preserve in _callRefreshToken Transient refresh failures no longer _removeSession while access token still valid → fewer silent logouts and re-auth loops
__loadSession returns preserved session getSession() callers stay logged in instead of { session: null } → less client thrashing
60s refresh-failure cooldown Caps serial /token hammer — directly reduces trips on our 3 refreshes/min budget
500 / CF 525–529 → retryable Outages no longer drag sessions into reactive removal
Single _removeSession in _recoverAndRefresh Fewer spurious SIGNED_OUT → fewer re-armed refresh loops
Expectation: circuit_breaker_trip emails should drop sharply after upgrade. We'll keep the breaker as a safety net for stale deploy bundles, dead refresh tokens, and cookie-domain edge cases auth-js can't reach.

I will deploy 2.108.2-beta.5 to production now and report back later. Hopefully I don't keep getting those specific operational emails the next couple of weeks. 😅

@mandarini

Copy link
Copy Markdown
Contributor Author

Thank you @thomaslarsson for the detailed report!!! :D :D Let me know how the testing goes! If an issue comes up, can you please open a new issue on supabase-js and tag me? It will be easier to track!! Thank you SO much! 💚

mandarini pushed a commit to supabase/supabase that referenced this pull request Jun 16, 2026
This PR updates @supabase/*-js libraries to version 2.108.2.

**Source**: supabase-js-stable-release

**Changes**:
- Updated @supabase/supabase-js to 2.108.2
- Updated @supabase/auth-js to 2.108.2
- Updated @supabase/realtime-js to 2.108.2
- Updated @supabase/postgest-js to 2.108.2
- Refreshed pnpm-lock.yaml

---

## Release Notes

## v2.108.2

## 2.108.2 (2026-06-15)

### 🩹 Fixes

- **auth:** preserve valid session on refresh failure and cooldown
repeat failures
([#2436](supabase/supabase-js#2436))
- **realtime:** clarify httpSend() 404 error and server migration note
([#2444](supabase/supabase-js#2444))
- **release:** pin Deno and bound JSR publish to survive stranded-task
hangs ([#2439](supabase/supabase-js#2439))
- **release:** restore JSR publish flags and enable for beta
([#2440](supabase/supabase-js#2440))

### ❤️ Thank You

- Katerina Skroumpelou @mandarini
## v2.108.1

## 2.108.1 (2026-06-09)

### 🩹 Fixes

- **ci:** forward DOGFOOD_APP_CLIENT_ID to dogfood workflow
([#2434](supabase/supabase-js#2434))
- **postgrest:** then typing
([#2349](supabase/supabase-js#2349))

### ❤️ Thank You

- Katerina Skroumpelou @mandarini
- Vaibhav @7ttp

This PR was created automatically.

Co-authored-by: supabase-workflow-trigger[bot] <266661614+supabase-workflow-trigger[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auth-js Related to the auth-js library.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: _callRefreshToken permanently deletes session on non-retryable refresh failure, even when access token is still valid

3 participants