Skip to content

P1.7 #377: user-facing surface for OAuth re-auth (stacked on P1.9)#422

Merged
jayzalowitz merged 4 commits into
mainfrom
jayzalowitz/p1.7-oauth-reauth-banner
May 26, 2026
Merged

P1.7 #377: user-facing surface for OAuth re-auth (stacked on P1.9)#422
jayzalowitz merged 4 commits into
mainfrom
jayzalowitz/p1.7-oauth-reauth-banner

Conversation

@jayzalowitz

Copy link
Copy Markdown
Owner

Stacked on #421 (P1.9 kill switch). Will rebase against main once #421 lands.

Summary

Closes #377. Pre-fix: when Google revoked SkyTwin's refresh token, the worker correctly stopped polling — but the dashboard rendered "Listening" and the user noticed days later. This PR adds the single missing user-facing signal.

Changes

DB

  • Migration 060-connector-health.sql: connector_health(user_id, connector_name, status, error_code, last_success_at, last_failure_at, updated_at), PRIMARY KEY (user_id, connector_name), ON DELETE CASCADE.
  • New connectorHealthRepository.upsert + findByUser. Re-exported from @skytwin/db.

Worker

  • apps/worker/src/index.ts:pollUserConnectors upserts status='needs_reauth' on the OAuthRefreshError.permanent branch (alongside the existing circuit-breaker force-trip).
  • Per-connector success upserts status='connected' keyed on a new thisConnectorFailed flag (not loop-wide hadFailure) so a working Calendar isn't blocked by a failing Gmail.
  • DB writes wrapped in try/catch — must not break the circuit-breaker logic.
  • New extractErrorCode() helper parses invalid_grant / unauthorized_client / etc. from the error message so the banner can render conditional copy.

API

  • New apps/api/src/routes/connectors.ts mounts /api/connectors with one endpoint: GET /:userId/status{ connectors, anyNeedsReauth }. Session + ownership gated.

Frontend

  • index.html adds #connectors-banner sticky under the kill-switch banner.
  • styles.css: amber (warning color, not red — re-auth is housekeeping, not panic); stacks below the autonomy banner when both fire.
  • app.js updateConnectorsBanner() polls /connectors/:userId/status on every navigate() + every 60s. Conditional copy on errorCode. Single "Reconnect" CTA → #/connect-gmail.

Verified

  • pnpm build
  • pnpm test ✓ (713 API + 301 db + full monorepo green)

Test plan

  • Revoke SkyTwin's access at Google → next worker poll (~60s) writes needs_reauth to connector_health.
  • curl /api/connectors/<uuid>/status returns anyNeedsReauth: true.
  • Dashboard within 60s shows amber banner: "Gmail disconnected (access was revoked or expired)…".
  • Click Reconnect → lands on #/connect-gmail. Complete OAuth → next worker poll → connected → banner disappears on next refresh.

🤖 Generated with Claude Code

jayzalowitz and others added 2 commits May 25, 2026 20:53
Closes #379. Three coordinated levers + a chrome banner give every
install the panic button CLAUDE.md launch-criterion #8 requires.

Lever 1 — operator env var:
- SKYTWIN_AUTO_EXECUTE_DISABLED=true on the API/worker process.
- Read ONCE at PolicyEvaluator construction (override via ctor option
  for tests so the env-var read isn't a hidden dependency).
- New early-return at the top of `evaluate()` sits AHEAD of every
  other check (trust tier, injection guard, autonomy, quiet hours,
  policy rules) so no downstream allow path can bypass it.
- Uses { allowed: true, requiresApproval: true } not deny — actions
  still land in the Approvals queue.

Lever 2 — per-user toggle:
- autonomy_settings.paused: true | false on the user row.
- New PUT /api/users/:userId/autonomy-pause endpoint sets/clears the
  flag, writes pausedAt + optional pausedReason on transition.
- AutonomySettings type gains paused, pausedAt, pausedReason fields
  in @skytwin/shared-types/user.ts.

Lever 3 — chrome banner:
- index.html ships a sticky #autonomy-banner above <main>.
- app.js updateAutonomyBanner() fetches GET /autonomy-state, renders
  operator + user lines independently, and shows a Resume button only
  for the user-pause line (operator pause needs an env-var change).
- Refreshed on every navigate() + every 30s; backed off when API
  known offline.
- New GET /api/users/:userId/autonomy-state returns combined state.

Settings page:
- New "Pause auto-execution" card with confirmation modal on both
  transitions and optional reason prompt.
- Hydrated from /autonomy-state on render; refreshes banner +
  on-page state together after a flip.
- Coexists with the existing "Pause everything (demote to observer)"
  button — different lever, clearer label.

Operator pause reason wins when both flags are set so the banner copy
reflects who set the pause.

Tests:
- 5 new policy-engine tests cover the operator-paused, user-paused,
  both-paused (operator wins), neither-paused regression, and
  isGloballyPaused() reporting matrix.
- 177 policy-engine tests, 713 API tests pass.

Safety Invariant #1 preserved: the new check strengthens the
single PolicyEvaluator.evaluate funnel rather than adding a parallel
path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes #377. Pre-fix silent-breakage trap: Google revokes a refresh
token, worker correctly trips the per-user circuit breaker, dashboard
keeps rendering "Listening" — user only notices days later when "did
you get my email?" surfaces it.

This PR adds the single piece of state the API can read to render a
"Gmail disconnected — Reconnect" banner:

DB (new):
- migration 060-connector-health.sql adds connector_health(user_id,
  connector_name, status, error_code, last_success_at,
  last_failure_at, updated_at). PRIMARY KEY (user_id, connector_name);
  ON DELETE CASCADE on user.
- packages/db/src/repositories/connector-health-repository.ts exposes
  upsert + findByUser. Re-exported from @skytwin/db.

Worker:
- apps/worker/src/index.ts:pollUserConnectors upserts
  status='needs_reauth' on the existing OAuthRefreshError.permanent
  branch (alongside the circuit-breaker force-trip).
- Per-connector success upserts status='connected' so a
  multi-connector user with one bad connector doesn't have a working
  one stuck. Keyed on thisConnectorFailed (not loop-wide hadFailure)
  so a working Calendar isn't blocked by a failing Gmail.
- New extractErrorCode helper parses 'invalid_grant' /
  'unauthorized_client' / etc out of the OAuthRefreshError message
  so the banner can render conditional copy.
- DB writes wrapped in try/catch — must not break the circuit-breaker
  logic; the log line is still the operator's audit trail.

API:
- apps/api/src/routes/connectors.ts mounts /api/connectors with one
  endpoint: GET /:userId/status → { connectors, anyNeedsReauth }.
- Wired into apps/api/src/index.ts with sessionAuth +
  bindUserIdParamValidator + bindUserIdParamOwnership.

Frontend:
- index.html adds #connectors-banner sticky under #autonomy-banner.
- styles.css: amber (warning color, not red — re-auth is housekeeping
  not panic); stacks below #autonomy-banner when both fire via
  body.has-autonomy-banner.has-connectors-banner.
- app.js updateConnectorsBanner() polls /connectors/:userId/status on
  every navigate() + every 60s (matches worker poll cadence).
  Conditional copy on errorCode ('invalid_grant' → "access was
  revoked or expired").
- Single Reconnect CTA → #/connect-gmail (only connector this PR
  covers); future connectors should branch off the broken row.

Build + tests:
- All builds clean (db, worker, api).
- 713 API tests, 301 db tests, full monorepo suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jayzalowitz jayzalowitz changed the base branch from jayzalowitz/p1.9-kill-switch to main May 26, 2026 01:31
…h-reauth-banner

# Conflicts:
#	CHANGELOG.md
#	apps/web/public/css/styles.css
#	apps/web/public/index.html
#	apps/web/public/js/app.js
#	apps/web/public/js/pages/settings.js
#	packages/policy-engine/src/__tests__/policy-evaluator.test.ts
#	packages/policy-engine/src/policy-evaluator.ts
Copilot AI review requested due to automatic review settings May 26, 2026 01:35

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a user-facing “needs re-auth” surface for OAuth connector failures (notably Google invalid_grant refresh-token revocations) by persisting per-(user, connector) health in the DB, exposing it via an API endpoint, and rendering a sticky dashboard banner so silent connector breakage is visible/actionable.

Changes:

  • Add connector_health table + DB repository (upsert, findByUser) and export via @skytwin/db.
  • Worker writes needs_reauth on permanent OAuth refresh failures and connected on per-connector success (best-effort writes).
  • API exposes GET /api/connectors/:userId/status; web polls it and shows an amber “Reconnect” banner.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
packages/db/src/migrations/060-connector-health.sql Creates connector_health table + index to persist per-connector health state.
packages/db/src/repositories/connector-health-repository.ts Adds repository for upserting and reading connector health snapshots.
packages/db/src/repositories/index.ts Re-exports the new connector health repository + row type.
packages/db/src/index.ts Re-exports connector health repository/type at package root.
apps/worker/src/index.ts Writes connector health on permanent OAuth failures and on per-connector success; adds error-code extraction helper.
apps/api/src/routes/connectors.ts Adds /api/connectors/:userId/status endpoint to read connector health.
apps/api/src/index.ts Mounts the new connectors router under /api/connectors.
apps/web/public/index.html Adds the connectors re-auth banner DOM.
apps/web/public/css/styles.css Styles/stacking rules for the amber connectors banner (including stacking with autonomy banner).
apps/web/public/js/app.js Implements polling + rendering for the connectors banner; adds reconnect click handler.
apps/web/public/js/pages/settings.js Contains an (unrelated) merge-conflict artifact in the autonomy pause prompt string.
CHANGELOG.md Contains unresolved merge-conflict markers in Unreleased section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CHANGELOG.md Outdated
Comment on lines +17 to +23
<<<<<<< HEAD
### Added (Epic D — reliability, #361)

- **OAuth re-auth user-facing surface (#377).** Pre-fix, when Google revoked SkyTwin's refresh token (user clicked "Remove third-party app" in their Google account, or inactivity aged the token out), the worker correctly tripped the per-user circuit breaker and stopped hammering Google's token endpoint — but the dashboard kept rendering "Listening" and the user noticed days later when "did you get my email?" surfaced the silent breakage. This PR adds the single missing piece: a user-facing signal that something needs re-auth. New `connector_health` table (migration `060-connector-health.sql`) with one row per `(user_id, connector_name)` records `status: 'connected' | 'needs_reauth' | 'disabled'`, `error_code`, and `last_success_at` / `last_failure_at`. New `connectorHealthRepository.upsert` writes from the worker on every poll outcome — `needs_reauth` on the existing `OAuthRefreshError.permanent === true` branch (alongside the circuit-breaker force-trip), `connected` per-connector on success so a multi-connector user with one bad connector doesn't have a working one stuck. New `extractErrorCode()` parses the Google `error` field out of the OAuthRefreshError message so the banner can render conditional copy ("invalid_grant" → "access was revoked or expired"). New `GET /api/connectors/:userId/status` (apps/api/src/routes/connectors.ts) returns `{ connectors: {name: {status, errorCode, lastSuccessAt, lastFailureAt}}, anyNeedsReauth }`. New amber chrome banner (sticks below the red kill-switch banner if both fire) reads from the endpoint on every `navigate()` + every 60s and shows a single "Reconnect" CTA that jumps to `#/connect-gmail`. Banner is not dismissible — the worker has stopped doing work for this user and they need to fix it. CLAUDE.md launch criterion: "silent breakage that masquerades as normal operation" is the single worst class of bug for a delegated-action product; this closes it.

=======
>>>>>>> origin/main
Comment thread apps/web/public/js/pages/settings.js Outdated
Comment on lines +1052 to +1056
<<<<<<< HEAD
reason = window.prompt('Optional: why are you pausing? (saved with the audit log; leave blank to skip)', '') || undefined;
=======
reason = window.prompt('Optional: why are you pausing? (stored on your user record; leave blank to skip)', '') || undefined;
>>>>>>> origin/main
Comment thread apps/worker/src/index.ts Outdated
Comment on lines +136 to +147
* Pluck the Google `error` field out of an OAuthRefreshError message
* (e.g. `"invalid_grant"`, `"unauthorized_client"`) so the dashboard
* banner can render conditional copy ("revoked" vs "client misconfig").
* Returns null when the message has no recognisable code so the
* caller can default to a safe fallback.
*
* The OAuthRefreshError message format is `"OAuth refresh failed: <code>
* <description?>"` — see packages/connectors/src/oauth/google-oauth.ts.
*/
function extractErrorCode(message: string): string | null {
const match = message.match(/refresh failed:\s*([a-z_]+)/i);
return match?.[1] ?? null;
Comment on lines +25 to +33
CREATE TABLE IF NOT EXISTS connector_health (
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
connector_name STRING NOT NULL,
status STRING NOT NULL,
error_code STRING,
last_success_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, connector_name)
Comment on lines +51 to +63
const now = new Date();
await query(
`INSERT INTO connector_health (
user_id, connector_name, status, error_code,
last_success_at, last_failure_at, updated_at
)
VALUES ($1, $2, $3, $4, $5, $6, $7)
ON CONFLICT (user_id, connector_name) DO UPDATE SET
status = EXCLUDED.status,
error_code = EXCLUDED.error_code,
last_success_at = COALESCE(EXCLUDED.last_success_at, connector_health.last_success_at),
last_failure_at = COALESCE(EXCLUDED.last_failure_at, connector_health.last_failure_at),
updated_at = EXCLUDED.updated_at`,
Comment thread apps/api/src/routes/connectors.ts Outdated
* can render a "Gmail disconnected — Reconnect" banner when a user's
* refresh token has been revoked at the OAuth provider or otherwise
* permanently invalidated. The worker writes the underlying state on
* every poll outcome — see apps/worker/src/index.ts:pollUserConnectors.
Comment on lines +28 to +52
router.get('/:userId/status', async (req, res, next) => {
try {
const { userId } = req.params as { userId: string };
const rows = await connectorHealthRepository.findByUser(userId);
const connectors: Record<string, {
status: 'connected' | 'needs_reauth' | 'disabled';
errorCode: string | null;
lastSuccessAt: string | null;
lastFailureAt: string | null;
}> = {};
let anyNeedsReauth = false;
for (const row of rows) {
connectors[row.connector_name] = {
status: row.status,
errorCode: row.error_code,
lastSuccessAt: row.last_success_at?.toISOString() ?? null,
lastFailureAt: row.last_failure_at?.toISOString() ?? null,
};
if (row.status === 'needs_reauth') anyNeedsReauth = true;
}
res.json({ userId, connectors, anyNeedsReauth });
} catch (error) {
next(error);
}
});
Comment on lines +35 to +87
export const connectorHealthRepository = {
/**
* Upsert the connector's current health snapshot. On 'connected'
* the caller should pass `lastSuccessAt: new Date()`; on
* 'needs_reauth' (or 'disabled'), `lastFailureAt`. The opposite
* timestamp is preserved across the upsert so a flap doesn't
* destroy the last-known-good marker.
*/
async upsert(input: {
userId: string;
connectorName: string;
status: ConnectorHealthRow['status'];
errorCode?: string | null;
lastSuccessAt?: Date | null;
lastFailureAt?: Date | null;
}): Promise<void> {
const now = new Date();
await query(
`INSERT INTO connector_health (
user_id, connector_name, status, error_code,
last_success_at, last_failure_at, updated_at
)
VALUES ($1, $2, $3, $4, $5, $6, $7)
ON CONFLICT (user_id, connector_name) DO UPDATE SET
status = EXCLUDED.status,
error_code = EXCLUDED.error_code,
last_success_at = COALESCE(EXCLUDED.last_success_at, connector_health.last_success_at),
last_failure_at = COALESCE(EXCLUDED.last_failure_at, connector_health.last_failure_at),
updated_at = EXCLUDED.updated_at`,
[
input.userId,
input.connectorName,
input.status,
input.errorCode ?? null,
input.lastSuccessAt ?? null,
input.lastFailureAt ?? null,
now,
],
);
},

/** Return every connector health row for a given user. */
async findByUser(userId: string): Promise<ConnectorHealthRow[]> {
const result = await query<ConnectorHealthRow>(
`SELECT user_id, connector_name, status, error_code,
last_success_at, last_failure_at, updated_at
FROM connector_health
WHERE user_id = $1
ORDER BY connector_name`,
[userId],
);
return result.rows;
},
8 findings from copilot-pull-request-reviewer (2 critical + 6 substantive):

1. CRITICAL — Resolved unresolved merge-conflict markers in CHANGELOG.md
   (line 17-23) and apps/web/public/js/pages/settings.js (line 1052-1056).
   The rebase onto main left both files with HEAD/origin markers intact;
   I missed them on the previous push.

2. CRITICAL — extractErrorCode() regex didn't match the actual
   OAuthRefreshError format. The connector throws "Google OAuth token
   refresh failed (permanent|transient): <status> <body>" where <body>
   is the raw Google token-endpoint response (JSON like
   {"error":"invalid_grant","error_description":"..."}). The old
   /refresh failed:\s*([a-z_]+)/i would never have matched because of
   the (permanent|transient) tag before the colon. Switched to
   /"error"\s*:\s*"([a-z_]+)"/i which parses the JSON field directly.
   Extracted to apps/worker/src/oauth-error-code.ts so it can be tested
   without booting the worker's runtime side effects.

3. Added CHECK constraint on connector_health.status. The TS row type
   pins it to a narrow union but the column was STRING — DB-level
   constraint enforces the invariant for direct SQL writes too.

4. Switched connectorHealthRepository.upsert to DB-side now() instead
   of new Date(). Multi-node deployments can't get clock-skewed
   updated_at values, matches the repo convention.

5. Fixed doc-comment reference: apps/api/src/routes/connectors.ts now
   names "pollUser" (real symbol) instead of "pollUserConnectors".

6. Added apps/api/src/__tests__/connectors-routes.test.ts (4 tests):
   empty user, mixed connected/needs_reauth, all-connected, malformed
   userId rejection by the shared validator.

7. Added packages/db/src/__tests__/connector-health-repository.test.ts
   (7 tests): needs_reauth + connected upserts, DB-side now() lock-in
   (catches the convention regression Copilot flagged), COALESCE flap
   preservation, null-coalescence, findByUser ordering, empty user.

8. Added apps/worker/src/__tests__/oauth-error-code.test.ts (5 tests):
   real OAuthRefreshError message, unauthorized_client variant,
   whitespace-tolerant JSON, null fallback on garbage / transient 503s.

CHANGELOG.md gains the missing CHECK / now() / message-format details
under the existing #377 entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jayzalowitz

Copy link
Copy Markdown
Owner Author

Addressed all 8 Copilot findings in 6bf4dc7.

# Finding Resolution
1 CHANGELOG.md merge-conflict markers Removed; kept the #377 entry
2 settings.js merge-conflict markers Removed; kept the "stored on your user record" copy
3 extractErrorCode() regex doesn't match OAuthRefreshError format Rewrote against the real "error":"..." JSON field; extracted to apps/worker/src/oauth-error-code.ts with 5 new tests
4 No CHECK constraint on connector_health.status Added CHECK (status IN ('connected','needs_reauth','disabled')) in migration 060
5 updated_at uses app time via new Date() Switched to DB-side now() in both INSERT and ON CONFLICT branches
6 Doc ref to pollUserConnectors (wrong) Updated to real symbol pollUser in apps/worker/src/index.ts
7 No API tests for the connectors route Added apps/api/src/__tests__/connectors-routes.test.ts — 4 tests covering empty, mixed, all-connected, malformed userId rejection
8 No DB tests for connectorHealthRepository Added packages/db/src/__tests__/connector-health-repository.test.ts — 7 tests covering upsert variants, the now() invariant, COALESCE flap preservation, findByUser ordering

16 new tests pass locally. Pushed; CI re-running.

@jayzalowitz jayzalowitz merged commit 5338f93 into main May 26, 2026
12 checks passed
@jayzalowitz jayzalowitz deleted the jayzalowitz/p1.7-oauth-reauth-banner branch May 26, 2026 02:18
jayzalowitz added a commit that referenced this pull request May 26, 2026
* fix(p1.13): cascade ON DELETE on every legacy FK to users(id) (#413)

Pre-fix, 32 of the 39 user-owned tables had a FK to users(id) without
ON DELETE CASCADE. The clause wasn't standardised until the newer
tables (ai_provider_settings, lifebooks, recovery_codes,
model_downloads, connector_health from #377) landed. The practical
effect: any "delete my account" code path either had to enumerate
every table (fragile, easy to miss a new one) or got blocked by FK
violations on the first row in behavioral_patterns / signals /
mempalace / sessions.

New migration 061-cascade-cleanup.sql does the one-time DDL backfill:
for each of the 32 tables,

  ALTER TABLE <t> DROP CONSTRAINT IF EXISTS <t>_user_id_fkey;
  ALTER TABLE <t> ADD  CONSTRAINT <t>_user_id_fkey
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;

The DROP uses IF EXISTS so re-runs no-op; the ADD trips 42710
(duplicate_object) on re-run which the migration runner's
IDEMPOTENT_DDL_CODES set absorbs.

Constraint names follow CockroachDB's default `<table>_<column>_fkey`
convention — verified across the codebase as the universal pattern
(no FK to users(id) is hand-named).

Safety net: new E2E test packages/db/src/__tests__/cascade-cleanup.e2e.test.ts
(gated on E2E=true) queries information_schema.referential_constraints
after migrations run and fails if any FK to users(id) is still set to
NO ACTION — catches drift if a fork ever adds a hand-named FK that
the migration's conventional DROP would miss. Also exercises cascade
end-to-end on behavioral_patterns (one of the 32) to prove the FK
semantics actually delete child rows, not just the metadata flag.

Schema-level prerequisite for #376 (delete-my-data endpoint) —
without cascade, that endpoint would have either been a 30-statement
manual purge or a runtime FK error on first delete.

Migration number 061 leaves 060 to PR #422 (connector_health from
#377). Both migrations are independent; either ordering is fine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(p1.13): address Copilot review on PR #425 (2 doc fixes)

1. Migration 061 comment claimed "7 tables already had ON DELETE
   CASCADE" but the codebase actually has many more cascading FKs
   (mcp_*, capability_provenance_nodes, etc. from migration 027+).
   Reworded to enumerate the categories without an inaccurate count
   and to cite the grep used to find the exhaustive non-cascade
   target list.

2. Typo in cascade-cleanup.e2e.test.ts comment: "referencee" →
   "referencing column".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P1.7 OAuth re-auth has no user-facing surface

2 participants