P1.7 #377: user-facing surface for OAuth re-auth (stacked on P1.9)#422
Merged
Conversation
Closes #379. Three coordinated levers + a chrome banner give every install the panic button CLAUDE.md launch-criterion #8 requires. Lever 1 — operator env var: - SKYTWIN_AUTO_EXECUTE_DISABLED=true on the API/worker process. - Read ONCE at PolicyEvaluator construction (override via ctor option for tests so the env-var read isn't a hidden dependency). - New early-return at the top of `evaluate()` sits AHEAD of every other check (trust tier, injection guard, autonomy, quiet hours, policy rules) so no downstream allow path can bypass it. - Uses { allowed: true, requiresApproval: true } not deny — actions still land in the Approvals queue. Lever 2 — per-user toggle: - autonomy_settings.paused: true | false on the user row. - New PUT /api/users/:userId/autonomy-pause endpoint sets/clears the flag, writes pausedAt + optional pausedReason on transition. - AutonomySettings type gains paused, pausedAt, pausedReason fields in @skytwin/shared-types/user.ts. Lever 3 — chrome banner: - index.html ships a sticky #autonomy-banner above <main>. - app.js updateAutonomyBanner() fetches GET /autonomy-state, renders operator + user lines independently, and shows a Resume button only for the user-pause line (operator pause needs an env-var change). - Refreshed on every navigate() + every 30s; backed off when API known offline. - New GET /api/users/:userId/autonomy-state returns combined state. Settings page: - New "Pause auto-execution" card with confirmation modal on both transitions and optional reason prompt. - Hydrated from /autonomy-state on render; refreshes banner + on-page state together after a flip. - Coexists with the existing "Pause everything (demote to observer)" button — different lever, clearer label. Operator pause reason wins when both flags are set so the banner copy reflects who set the pause. Tests: - 5 new policy-engine tests cover the operator-paused, user-paused, both-paused (operator wins), neither-paused regression, and isGloballyPaused() reporting matrix. - 177 policy-engine tests, 713 API tests pass. Safety Invariant #1 preserved: the new check strengthens the single PolicyEvaluator.evaluate funnel rather than adding a parallel path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes #377. Pre-fix silent-breakage trap: Google revokes a refresh token, worker correctly trips the per-user circuit breaker, dashboard keeps rendering "Listening" — user only notices days later when "did you get my email?" surfaces it. This PR adds the single piece of state the API can read to render a "Gmail disconnected — Reconnect" banner: DB (new): - migration 060-connector-health.sql adds connector_health(user_id, connector_name, status, error_code, last_success_at, last_failure_at, updated_at). PRIMARY KEY (user_id, connector_name); ON DELETE CASCADE on user. - packages/db/src/repositories/connector-health-repository.ts exposes upsert + findByUser. Re-exported from @skytwin/db. Worker: - apps/worker/src/index.ts:pollUserConnectors upserts status='needs_reauth' on the existing OAuthRefreshError.permanent branch (alongside the circuit-breaker force-trip). - Per-connector success upserts status='connected' so a multi-connector user with one bad connector doesn't have a working one stuck. Keyed on thisConnectorFailed (not loop-wide hadFailure) so a working Calendar isn't blocked by a failing Gmail. - New extractErrorCode helper parses 'invalid_grant' / 'unauthorized_client' / etc out of the OAuthRefreshError message so the banner can render conditional copy. - DB writes wrapped in try/catch — must not break the circuit-breaker logic; the log line is still the operator's audit trail. API: - apps/api/src/routes/connectors.ts mounts /api/connectors with one endpoint: GET /:userId/status → { connectors, anyNeedsReauth }. - Wired into apps/api/src/index.ts with sessionAuth + bindUserIdParamValidator + bindUserIdParamOwnership. Frontend: - index.html adds #connectors-banner sticky under #autonomy-banner. - styles.css: amber (warning color, not red — re-auth is housekeeping not panic); stacks below #autonomy-banner when both fire via body.has-autonomy-banner.has-connectors-banner. - app.js updateConnectorsBanner() polls /connectors/:userId/status on every navigate() + every 60s (matches worker poll cadence). Conditional copy on errorCode ('invalid_grant' → "access was revoked or expired"). - Single Reconnect CTA → #/connect-gmail (only connector this PR covers); future connectors should branch off the broken row. Build + tests: - All builds clean (db, worker, api). - 713 API tests, 301 db tests, full monorepo suite green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…h-reauth-banner # Conflicts: # CHANGELOG.md # apps/web/public/css/styles.css # apps/web/public/index.html # apps/web/public/js/app.js # apps/web/public/js/pages/settings.js # packages/policy-engine/src/__tests__/policy-evaluator.test.ts # packages/policy-engine/src/policy-evaluator.ts
There was a problem hiding this comment.
Pull request overview
Adds a user-facing “needs re-auth” surface for OAuth connector failures (notably Google invalid_grant refresh-token revocations) by persisting per-(user, connector) health in the DB, exposing it via an API endpoint, and rendering a sticky dashboard banner so silent connector breakage is visible/actionable.
Changes:
- Add
connector_healthtable + DB repository (upsert,findByUser) and export via@skytwin/db. - Worker writes
needs_reauthon permanent OAuth refresh failures andconnectedon per-connector success (best-effort writes). - API exposes
GET /api/connectors/:userId/status; web polls it and shows an amber “Reconnect” banner.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/db/src/migrations/060-connector-health.sql | Creates connector_health table + index to persist per-connector health state. |
| packages/db/src/repositories/connector-health-repository.ts | Adds repository for upserting and reading connector health snapshots. |
| packages/db/src/repositories/index.ts | Re-exports the new connector health repository + row type. |
| packages/db/src/index.ts | Re-exports connector health repository/type at package root. |
| apps/worker/src/index.ts | Writes connector health on permanent OAuth failures and on per-connector success; adds error-code extraction helper. |
| apps/api/src/routes/connectors.ts | Adds /api/connectors/:userId/status endpoint to read connector health. |
| apps/api/src/index.ts | Mounts the new connectors router under /api/connectors. |
| apps/web/public/index.html | Adds the connectors re-auth banner DOM. |
| apps/web/public/css/styles.css | Styles/stacking rules for the amber connectors banner (including stacking with autonomy banner). |
| apps/web/public/js/app.js | Implements polling + rendering for the connectors banner; adds reconnect click handler. |
| apps/web/public/js/pages/settings.js | Contains an (unrelated) merge-conflict artifact in the autonomy pause prompt string. |
| CHANGELOG.md | Contains unresolved merge-conflict markers in Unreleased section. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+17
to
+23
| <<<<<<< HEAD | ||
| ### Added (Epic D — reliability, #361) | ||
|
|
||
| - **OAuth re-auth user-facing surface (#377).** Pre-fix, when Google revoked SkyTwin's refresh token (user clicked "Remove third-party app" in their Google account, or inactivity aged the token out), the worker correctly tripped the per-user circuit breaker and stopped hammering Google's token endpoint — but the dashboard kept rendering "Listening" and the user noticed days later when "did you get my email?" surfaced the silent breakage. This PR adds the single missing piece: a user-facing signal that something needs re-auth. New `connector_health` table (migration `060-connector-health.sql`) with one row per `(user_id, connector_name)` records `status: 'connected' | 'needs_reauth' | 'disabled'`, `error_code`, and `last_success_at` / `last_failure_at`. New `connectorHealthRepository.upsert` writes from the worker on every poll outcome — `needs_reauth` on the existing `OAuthRefreshError.permanent === true` branch (alongside the circuit-breaker force-trip), `connected` per-connector on success so a multi-connector user with one bad connector doesn't have a working one stuck. New `extractErrorCode()` parses the Google `error` field out of the OAuthRefreshError message so the banner can render conditional copy ("invalid_grant" → "access was revoked or expired"). New `GET /api/connectors/:userId/status` (apps/api/src/routes/connectors.ts) returns `{ connectors: {name: {status, errorCode, lastSuccessAt, lastFailureAt}}, anyNeedsReauth }`. New amber chrome banner (sticks below the red kill-switch banner if both fire) reads from the endpoint on every `navigate()` + every 60s and shows a single "Reconnect" CTA that jumps to `#/connect-gmail`. Banner is not dismissible — the worker has stopped doing work for this user and they need to fix it. CLAUDE.md launch criterion: "silent breakage that masquerades as normal operation" is the single worst class of bug for a delegated-action product; this closes it. | ||
|
|
||
| ======= | ||
| >>>>>>> origin/main |
Comment on lines
+1052
to
+1056
| <<<<<<< HEAD | ||
| reason = window.prompt('Optional: why are you pausing? (saved with the audit log; leave blank to skip)', '') || undefined; | ||
| ======= | ||
| reason = window.prompt('Optional: why are you pausing? (stored on your user record; leave blank to skip)', '') || undefined; | ||
| >>>>>>> origin/main |
Comment on lines
+136
to
+147
| * Pluck the Google `error` field out of an OAuthRefreshError message | ||
| * (e.g. `"invalid_grant"`, `"unauthorized_client"`) so the dashboard | ||
| * banner can render conditional copy ("revoked" vs "client misconfig"). | ||
| * Returns null when the message has no recognisable code so the | ||
| * caller can default to a safe fallback. | ||
| * | ||
| * The OAuthRefreshError message format is `"OAuth refresh failed: <code> | ||
| * <description?>"` — see packages/connectors/src/oauth/google-oauth.ts. | ||
| */ | ||
| function extractErrorCode(message: string): string | null { | ||
| const match = message.match(/refresh failed:\s*([a-z_]+)/i); | ||
| return match?.[1] ?? null; |
Comment on lines
+25
to
+33
| CREATE TABLE IF NOT EXISTS connector_health ( | ||
| user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, | ||
| connector_name STRING NOT NULL, | ||
| status STRING NOT NULL, | ||
| error_code STRING, | ||
| last_success_at TIMESTAMPTZ, | ||
| last_failure_at TIMESTAMPTZ, | ||
| updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), | ||
| PRIMARY KEY (user_id, connector_name) |
Comment on lines
+51
to
+63
| const now = new Date(); | ||
| await query( | ||
| `INSERT INTO connector_health ( | ||
| user_id, connector_name, status, error_code, | ||
| last_success_at, last_failure_at, updated_at | ||
| ) | ||
| VALUES ($1, $2, $3, $4, $5, $6, $7) | ||
| ON CONFLICT (user_id, connector_name) DO UPDATE SET | ||
| status = EXCLUDED.status, | ||
| error_code = EXCLUDED.error_code, | ||
| last_success_at = COALESCE(EXCLUDED.last_success_at, connector_health.last_success_at), | ||
| last_failure_at = COALESCE(EXCLUDED.last_failure_at, connector_health.last_failure_at), | ||
| updated_at = EXCLUDED.updated_at`, |
| * can render a "Gmail disconnected — Reconnect" banner when a user's | ||
| * refresh token has been revoked at the OAuth provider or otherwise | ||
| * permanently invalidated. The worker writes the underlying state on | ||
| * every poll outcome — see apps/worker/src/index.ts:pollUserConnectors. |
Comment on lines
+28
to
+52
| router.get('/:userId/status', async (req, res, next) => { | ||
| try { | ||
| const { userId } = req.params as { userId: string }; | ||
| const rows = await connectorHealthRepository.findByUser(userId); | ||
| const connectors: Record<string, { | ||
| status: 'connected' | 'needs_reauth' | 'disabled'; | ||
| errorCode: string | null; | ||
| lastSuccessAt: string | null; | ||
| lastFailureAt: string | null; | ||
| }> = {}; | ||
| let anyNeedsReauth = false; | ||
| for (const row of rows) { | ||
| connectors[row.connector_name] = { | ||
| status: row.status, | ||
| errorCode: row.error_code, | ||
| lastSuccessAt: row.last_success_at?.toISOString() ?? null, | ||
| lastFailureAt: row.last_failure_at?.toISOString() ?? null, | ||
| }; | ||
| if (row.status === 'needs_reauth') anyNeedsReauth = true; | ||
| } | ||
| res.json({ userId, connectors, anyNeedsReauth }); | ||
| } catch (error) { | ||
| next(error); | ||
| } | ||
| }); |
Comment on lines
+35
to
+87
| export const connectorHealthRepository = { | ||
| /** | ||
| * Upsert the connector's current health snapshot. On 'connected' | ||
| * the caller should pass `lastSuccessAt: new Date()`; on | ||
| * 'needs_reauth' (or 'disabled'), `lastFailureAt`. The opposite | ||
| * timestamp is preserved across the upsert so a flap doesn't | ||
| * destroy the last-known-good marker. | ||
| */ | ||
| async upsert(input: { | ||
| userId: string; | ||
| connectorName: string; | ||
| status: ConnectorHealthRow['status']; | ||
| errorCode?: string | null; | ||
| lastSuccessAt?: Date | null; | ||
| lastFailureAt?: Date | null; | ||
| }): Promise<void> { | ||
| const now = new Date(); | ||
| await query( | ||
| `INSERT INTO connector_health ( | ||
| user_id, connector_name, status, error_code, | ||
| last_success_at, last_failure_at, updated_at | ||
| ) | ||
| VALUES ($1, $2, $3, $4, $5, $6, $7) | ||
| ON CONFLICT (user_id, connector_name) DO UPDATE SET | ||
| status = EXCLUDED.status, | ||
| error_code = EXCLUDED.error_code, | ||
| last_success_at = COALESCE(EXCLUDED.last_success_at, connector_health.last_success_at), | ||
| last_failure_at = COALESCE(EXCLUDED.last_failure_at, connector_health.last_failure_at), | ||
| updated_at = EXCLUDED.updated_at`, | ||
| [ | ||
| input.userId, | ||
| input.connectorName, | ||
| input.status, | ||
| input.errorCode ?? null, | ||
| input.lastSuccessAt ?? null, | ||
| input.lastFailureAt ?? null, | ||
| now, | ||
| ], | ||
| ); | ||
| }, | ||
|
|
||
| /** Return every connector health row for a given user. */ | ||
| async findByUser(userId: string): Promise<ConnectorHealthRow[]> { | ||
| const result = await query<ConnectorHealthRow>( | ||
| `SELECT user_id, connector_name, status, error_code, | ||
| last_success_at, last_failure_at, updated_at | ||
| FROM connector_health | ||
| WHERE user_id = $1 | ||
| ORDER BY connector_name`, | ||
| [userId], | ||
| ); | ||
| return result.rows; | ||
| }, |
8 findings from copilot-pull-request-reviewer (2 critical + 6 substantive):
1. CRITICAL — Resolved unresolved merge-conflict markers in CHANGELOG.md
(line 17-23) and apps/web/public/js/pages/settings.js (line 1052-1056).
The rebase onto main left both files with HEAD/origin markers intact;
I missed them on the previous push.
2. CRITICAL — extractErrorCode() regex didn't match the actual
OAuthRefreshError format. The connector throws "Google OAuth token
refresh failed (permanent|transient): <status> <body>" where <body>
is the raw Google token-endpoint response (JSON like
{"error":"invalid_grant","error_description":"..."}). The old
/refresh failed:\s*([a-z_]+)/i would never have matched because of
the (permanent|transient) tag before the colon. Switched to
/"error"\s*:\s*"([a-z_]+)"/i which parses the JSON field directly.
Extracted to apps/worker/src/oauth-error-code.ts so it can be tested
without booting the worker's runtime side effects.
3. Added CHECK constraint on connector_health.status. The TS row type
pins it to a narrow union but the column was STRING — DB-level
constraint enforces the invariant for direct SQL writes too.
4. Switched connectorHealthRepository.upsert to DB-side now() instead
of new Date(). Multi-node deployments can't get clock-skewed
updated_at values, matches the repo convention.
5. Fixed doc-comment reference: apps/api/src/routes/connectors.ts now
names "pollUser" (real symbol) instead of "pollUserConnectors".
6. Added apps/api/src/__tests__/connectors-routes.test.ts (4 tests):
empty user, mixed connected/needs_reauth, all-connected, malformed
userId rejection by the shared validator.
7. Added packages/db/src/__tests__/connector-health-repository.test.ts
(7 tests): needs_reauth + connected upserts, DB-side now() lock-in
(catches the convention regression Copilot flagged), COALESCE flap
preservation, null-coalescence, findByUser ordering, empty user.
8. Added apps/worker/src/__tests__/oauth-error-code.test.ts (5 tests):
real OAuthRefreshError message, unauthorized_client variant,
whitespace-tolerant JSON, null fallback on garbage / transient 503s.
CHANGELOG.md gains the missing CHECK / now() / message-format details
under the existing #377 entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
|
Addressed all 8 Copilot findings in 6bf4dc7.
16 new tests pass locally. Pushed; CI re-running. |
4 tasks
jayzalowitz
added a commit
that referenced
this pull request
May 26, 2026
* fix(p1.13): cascade ON DELETE on every legacy FK to users(id) (#413) Pre-fix, 32 of the 39 user-owned tables had a FK to users(id) without ON DELETE CASCADE. The clause wasn't standardised until the newer tables (ai_provider_settings, lifebooks, recovery_codes, model_downloads, connector_health from #377) landed. The practical effect: any "delete my account" code path either had to enumerate every table (fragile, easy to miss a new one) or got blocked by FK violations on the first row in behavioral_patterns / signals / mempalace / sessions. New migration 061-cascade-cleanup.sql does the one-time DDL backfill: for each of the 32 tables, ALTER TABLE <t> DROP CONSTRAINT IF EXISTS <t>_user_id_fkey; ALTER TABLE <t> ADD CONSTRAINT <t>_user_id_fkey FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE; The DROP uses IF EXISTS so re-runs no-op; the ADD trips 42710 (duplicate_object) on re-run which the migration runner's IDEMPOTENT_DDL_CODES set absorbs. Constraint names follow CockroachDB's default `<table>_<column>_fkey` convention — verified across the codebase as the universal pattern (no FK to users(id) is hand-named). Safety net: new E2E test packages/db/src/__tests__/cascade-cleanup.e2e.test.ts (gated on E2E=true) queries information_schema.referential_constraints after migrations run and fails if any FK to users(id) is still set to NO ACTION — catches drift if a fork ever adds a hand-named FK that the migration's conventional DROP would miss. Also exercises cascade end-to-end on behavioral_patterns (one of the 32) to prove the FK semantics actually delete child rows, not just the metadata flag. Schema-level prerequisite for #376 (delete-my-data endpoint) — without cascade, that endpoint would have either been a 30-statement manual purge or a runtime FK error on first delete. Migration number 061 leaves 060 to PR #422 (connector_health from #377). Both migrations are independent; either ordering is fine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(p1.13): address Copilot review on PR #425 (2 doc fixes) 1. Migration 061 comment claimed "7 tables already had ON DELETE CASCADE" but the codebase actually has many more cascading FKs (mcp_*, capability_provenance_nodes, etc. from migration 027+). Reworded to enumerate the categories without an inaccurate count and to cite the grep used to find the exhaustive non-cascade target list. 2. Typo in cascade-cleanup.e2e.test.ts comment: "referencee" → "referencing column". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #421 (P1.9 kill switch). Will rebase against main once #421 lands.
Summary
Closes #377. Pre-fix: when Google revoked SkyTwin's refresh token, the worker correctly stopped polling — but the dashboard rendered "Listening" and the user noticed days later. This PR adds the single missing user-facing signal.
Changes
DB
060-connector-health.sql:connector_health(user_id, connector_name, status, error_code, last_success_at, last_failure_at, updated_at), PRIMARY KEY (user_id, connector_name), ON DELETE CASCADE.connectorHealthRepository.upsert+findByUser. Re-exported from@skytwin/db.Worker
apps/worker/src/index.ts:pollUserConnectorsupsertsstatus='needs_reauth'on theOAuthRefreshError.permanentbranch (alongside the existing circuit-breaker force-trip).status='connected'keyed on a newthisConnectorFailedflag (not loop-widehadFailure) so a working Calendar isn't blocked by a failing Gmail.extractErrorCode()helper parsesinvalid_grant/unauthorized_client/ etc. from the error message so the banner can render conditional copy.API
apps/api/src/routes/connectors.tsmounts/api/connectorswith one endpoint:GET /:userId/status→{ connectors, anyNeedsReauth }. Session + ownership gated.Frontend
index.htmladds#connectors-bannersticky under the kill-switch banner.styles.css: amber (warning color, not red — re-auth is housekeeping, not panic); stacks below the autonomy banner when both fire.app.jsupdateConnectorsBanner()polls/connectors/:userId/statuson everynavigate()+ every 60s. Conditional copy onerrorCode. Single "Reconnect" CTA →#/connect-gmail.Verified
pnpm build✓pnpm test✓ (713 API + 301 db + full monorepo green)Test plan
needs_reauthtoconnector_health.curl /api/connectors/<uuid>/statusreturnsanyNeedsReauth: true.#/connect-gmail. Complete OAuth → next worker poll →connected→ banner disappears on next refresh.🤖 Generated with Claude Code