Skip to content

JWT-only OS auth: no auth-worker roundtrips on page loads#1408

Merged
jonastemplestein merged 5 commits into
mainfrom
glib-submarine
Jun 9, 2026
Merged

JWT-only OS auth: no auth-worker roundtrips on page loads#1408
jonastemplestein merged 5 commits into
mainfrom
glib-submarine

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

Authenticate normal OS requests purely from signed JWT claims so the OS worker never makes a network roundtrip to the auth worker on page loads — including on cold isolate starts.

  • JWT-only request auth: OS middleware calls auth.authenticate({ includeUserInfo: false }); user/org/project session data comes entirely from access/ID token claims.
  • Static JWKS: the auth SDK accepts a static JWKS (ITERATE_AUTH_JWKSAPP_CONFIG_ITERATE_AUTH__JWKS, typesafe via the AppConfig schema) and uses createLocalJWKSet, eliminating the remote JWKS fetch on first verification per isolate. Falls back to createRemoteJWKSet when unset.
  • Org names in access-token claims (apps/auth), so sessions don't need userinfo for display names.
  • Deleted active/current-organization context: route auth validates org membership synchronously from session claims; oRPC replaced the active-organization middleware with an authenticated-user middleware plus signed principal/project claims. Net −172 lines.

Remaining OS → auth-worker calls (by design): explicit project creation (mutation boundary) and OAuth token refresh (~5 min token expiry).

Verification

  • pnpm typecheck && pnpm lint && pnpm format && pnpm test all green.
  • Static checks: no activeOrganization/current-org references remain; only expected auth-worker usages (JWKS wiring, includeUserInfo: false, project creation).
  • Preview proof (preview_3): completed the OAuth login on os.iterate-preview-3.com, landed on an authenticated /projects page, then reloaded. Network log for the authenticated reload: the document + static assets only — zero requests to auth.iterate.com, zero to /api/iterate-auth/session (server-side JWT verification used the baked-in static JWKS).

Ops notes

  • ITERATE_AUTH_JWKS set in the os Doppler prd + preview root configs (branch configs inherit) and personal dev configs. dev_localhost deliberately skipped — it signs with local auth keys.
  • The preview_3 OAuth client secret in Doppler had drifted from prod auth (invalid_client on code exchange); fixed by re-syncing via sync-auth-clients.ts with rotation.
  • Follow-up tasks added: tasks/os-deploy-time-jwks-fetch.md (fetch JWKS at deploy time to fix the key-rotation story), tasks/os-auth-spurious-logout-refresh.md (suspected concurrent-refresh race causing spurious logouts in dev).

🤖 Generated with Claude Code


Note

High Risk
Changes core authentication, authorization, and project/org scoping across middleware, oRPC, and UI; misconfigured JWKS or stale JWT claims could break access until token refresh.

Overview
OS request and page auth now rely on locally verified JWTs instead of calling the auth worker on every load. The auth SDK gains optional static JWKS (createLocalJWKSet) and authenticate({ includeUserInfo: false }) so cookie sessions are built from access/ID token claims only; auth tokens can carry optional org names so UI does not need userinfo.

The active/current-organization layer is removed (~170 lines). Authorization uses the user/admin principal and signed org/project claims in the token: project list/read/create gates on JWT project claims (not per-request auth-worker project lists), organizationSlug is required when the user belongs to multiple orgs, and findBySlug is documented as globally unique. oRPC middleware is renamed to authenticated-user; project access helpers drop org-scoped auth-worker lookups.

The root route hydrates auth from middleware-resolved session (SSR snapshot, no /api/iterate-auth/session on reload). Create project adds an organization selector when needed. Codemode/MCP paths stop threading activeOrganization through RPC props.

Deploy wiring adds APP_CONFIG_ITERATE_AUTH__JWKS; follow-up task files note deploy-time JWKS fetch and dev refresh races.

Reviewed by Cursor Bugbot for commit 9e4ebee. Bugbot is set up for automated code reviews on this repo. Configure here.

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: 9e4ebee
Preview: https://os.iterate-preview-6.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-09T21:38:47.208Z

Authenticate normal OS requests purely from signed JWT claims so the OS
worker never calls the auth worker on page loads, including cold isolate
starts:

- OS middleware authenticates with includeUserInfo: false; session data
  (user, orgs, projects) comes from access/ID token claims only.
- Auth SDK accepts a static JWKS (ITERATE_AUTH_JWKS /
  APP_CONFIG_ITERATE_AUTH__JWKS) and uses createLocalJWKSet when
  configured, eliminating the remote JWKS fetch on first verification.
- Access-token org claims now carry organization names, so sessions no
  longer need userinfo for display names.
- Deleted the active/current-organization context entirely: route auth
  validates org membership synchronously from session claims, and oRPC
  uses an authenticated-user middleware plus signed principal/project
  claims for project authorization.

The only remaining OS -> auth-worker call is explicit project creation
(projects-capability), an intentional mutation boundary. Token refresh
(every ~5 min) still talks to auth by design.

Verified on preview_3: authenticated /projects load and reload with zero
browser or worker requests to auth.iterate.com and zero
/api/iterate-auth/session calls. ITERATE_AUTH_JWKS is set in the os
Doppler prd/preview root configs and personal dev configs (deploy-time
JWKS fetch is a follow-up task; dev_localhost intentionally skipped as
it signs with local keys).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread apps/os/src/lib/auth.ts Outdated
# Conflicts:
#	apps/os/src/components/app-sidebar.tsx
#	apps/os/src/lib/auth.ts
#	apps/os/src/routes/_app/org/$organizationSlug/index.tsx
#	apps/os/src/routes/index.tsx
#	apps/os/src/routes/organization.tsx
Comment thread apps/os/src/capnweb/projects-capability.ts
Comment thread apps/os/src/capnweb/projects-capability.ts
Project creation previously failed with BAD_REQUEST for users whose
signed session lists more than one organization. The create input now
accepts an optional organizationSlug, validated against the signed org
claims; single-org users keep the implicit default. The create-project
form shows an organization picker when the session has multiple orgs.

Also note the claims-staleness-after-mutation follow-up (new projects
absent from claims-based reads until token refresh) in the refresh task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5edf1c0. Configure here.

Comment thread apps/os/src/routes/__root.tsx Outdated
jonastemplestein and others added 2 commits June 9, 2026 22:19
Root beforeLoad read the session from getGlobalStartContext(), which only
exists during SSR. Client-side navigations recomputed the route context
without it, so every guarded route redirected to /sign-in once you
navigated in the SPA. The SSR pass now seeds an auth snapshot (session,
issuer, project-host slug) into the query cache — which TanStack Start
already dehydrates to the client — and client navigations reuse it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Aligns the session propagation with TanStack Start's first-party
authentication pattern (server function awaited in root beforeLoad,
result in router context) while keeping zero-roundtrip navigations:
the snapshot query has infinite staleTime, so SSR seeds it once via the
dehydrated query cache and client-side navigations reuse it. Unlike the
previous hand-rolled cache fallback, a cache miss now fetches the
session from the OS worker (still no auth-worker roundtrip) instead of
silently treating the user as signed out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jonastemplestein jonastemplestein merged commit 4bb3e28 into main Jun 9, 2026
10 checks passed
@jonastemplestein jonastemplestein deleted the glib-submarine branch June 9, 2026 21:37
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
…1410)

Follow-up to #1408. Three reliability fixes on the OS/auth path, one
branch.

## 1. ~5-minute logout (root cause from library source)

`@better-auth/oauth-provider`'s refresh-token grant **rotates** the
refresh token on every use and, on reuse of an already-rotated token,
**revokes the entire token family** (`handleRefreshTokenGrant` →
`createRefreshToken` marks the old token `revoked`, and a later reuse
deletes all of the user+client's refresh tokens). A normal OS page load
fires several concurrent requests; once the 5-minute access token was
within the 30s refresh skew, each request independently hit the token
endpoint with the same cookie token — the first rotated it, the rest
looked like theft and nuked the session → logout, repeating roughly
every 5 minutes.

Fixes (`apps/auth/src/lib/server.ts`, the SDK OS bundles):
- **Single-flight refresh** per refresh token — concurrent refreshes
collapse to one token-endpoint call (`createSingleFlight`, extracted +
unit-tested).
- **Never refresh on WebSocket upgrades** — an upgrade response can't
carry `Set-Cookie`, so refreshing there would rotate the token into a
response the browser can't store and strand the session (this is also
the **REPL websocket failure**: once the access token went stale, the
capnweb upgrade tried to refresh, failed, and 401'd).
- **Tolerate a failed refresh while the access token is still valid** —
serve the request and let a later one retry, instead of dropping the
session on any hiccup.
- **Access-token TTL 5m → 30m** (`auth-plugins.ts`) so refresh is rare.
Tradeoff: org/project claim changes propagate within ≤30m (mitigated for
the creator by client cache seeding).

## 2. Deploy-time JWKS (`apps/os/alchemy.run.ts`)

#1408 verified JWTs locally from a static JWKS but relied on a hand-set
Doppler secret. Now the alchemy script **fetches `<issuer>/jwks` at
deploy time** into `APP_CONFIG` (typesafe), so key rotation only needs
an OS redeploy. A loopback issuer (local dev auth, own keys) skips the
static JWKS; a failed fetch falls back to runtime JWKS. **Verified**:
preview-5 deploy log shows the JWKS baked into config and the worker
healthy with zero auth-worker roundtrips.

## 3. Stream append skeleton flash
(`apps/os/.../project-stream-view.tsx`)

On append the virtualized window shifted (the list grows and the view
force-scrolled to the bottom), which re-created the visible-range SQL
query. `stream-browser-db.query()` seeds a new range query as `pending`
carrying a *different* range's rows, so `rowsByIndex` missed the visible
indices and every visible row blanked to a grey `bg-slate-100` skeleton
for a frame — the "skeleton flash + all rows redraw". Fixes: retain the
last committed rows across range re-queries (only genuinely-new indices
fall back to a skeleton), and only auto-scroll to the bottom when
already pinned there (don't yank a scrolled-up reader and trigger a
full-window re-query).

## Proof status (being precise)

- **Refresh single-flight**: proven by
`apps/os/src/auth/iterate-auth-single-flight.test.ts` (deterministic) +
root cause read from the oauth-provider source. The end-to-end "wait 5
minutes in a browser" proof was **not** completed: production auth is
Google-only (no headless sign-in) and doesn't honor service-token
impersonation at the public `oauth2/authorize`, and the fixed local
stack now issues 30m tokens. Worth a manual 5-min check after this
deploys to prod auth.
- **Deploy-time JWKS**: verified on a real preview-5 deploy.
- **Stream flash**: fix is code-reasoned and safe (retain-last-rows +
sticky-scroll); not pixel-verified — the local headless harness was too
unstable (Chrome OOM, dev issuer drift) to capture the sub-second flash
reliably. See `apps/os/docs/headless-local-debugging.md`.

## Docs

`apps/os/docs/headless-local-debugging.md` — driving the full local
OS+Auth stack headlessly (test OTP `424242`, signup allowlist,
orgs/projects, OAuth/consent quirks, reading local D1, MutationObserver
over throttled timers).

`pnpm typecheck && pnpm lint && pnpm test` all green.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Changes authentication refresh semantics, WebSocket session behavior,
and access-token lifetime—security-critical paths that affect all
signed-in users and long-lived connections.
> 
> **Overview**
> Addresses three reliability issues on the OS/auth path: periodic
session logout, JWT verification at deploy, and stream UI flicker.
> 
> **Auth session (~5‑minute logout):** Adds exported
`createSingleFlight` and wraps refresh-token grants so concurrent
requests sharing one cookie collapse to a single token call—avoiding
rotated-token reuse that revokes the whole family. Cookie middleware
skips refresh on WebSocket upgrades (no `Set-Cookie`), tolerates refresh
failures while the access token is still valid, and extends access-token
TTL from 5m to 30m on the auth provider.
> 
> **OS deploy:** `alchemy.run.ts` fetches issuer JWKS at deploy time
into static config (loopback dev skips production JWKS; fetch failure
falls back to runtime JWKS).
> 
> **Stream UI:** `project-stream-view` keeps last committed SQLite rows
during virtualizer range re-queries and only auto-scrolls when the user
is pinned near the bottom, reducing skeleton flashes on append.
> 
> Adds `headless-local-debugging.md`, README link, and unit tests for
`createSingleFlight`.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
10682c4. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-09T23:12:17.956Z",
      "headSha": "10682c43e5022fef9c39f55405bed1f423384950",
      "message": null,
      "publicUrl": "https://os.iterate-preview-2.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27241670590",
      "shortSha": "10682c4"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_2",
    "leasedUntil": 1781050179583,
    "leaseId": "0aa9f837-5428-4f86-be56-bfc11f0a201d",
    "slug": "preview-2",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-2`
Doppler config: `preview_2`
Type: `environment-config-lease`
Leased until: 2026-06-10T00:09:39.583Z

### OS
Status: deployed
Commit: `10682c4`
Preview: https://os.iterate-preview-2.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27241670590)
Updated: 2026-06-09T23:12:17.956Z
<!-- /CLOUDFLARE_PREVIEW -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant