Local resilience: retry/backoff, process supervision, graceful degradation

## Context

SkyTwin runs locally on the user's desktop. Several components lose state on restart or fail ungracefully when external services (Gmail, Google Calendar) have transient errors. The system works in the happy path but degrades poorly — and for a daily-use product, "works when nothing goes wrong" isn't enough.

**Claude Code estimate: ~2-3h**

## Current State (verified 2026-04-04)

| Component | Implementation | Gap |
|-----------|---------------|-----|
| Rate limiter | \`Map<string, RateLimitEntry>\` in \`ask.ts\` | Lost on restart |
| Gmail connector | Single 401 retry for expired token | No retry for 429, 500, network errors |
| Calendar connector | No retry at all | Throws on any error |
| Worker poll loop | \`catch\` logs error, continues next cycle | No backoff, no circuit breaker per-user |
| Health check | \`GET /api/health\` returns status + uptime | No liveness vs readiness distinction |
| Process supervision | \`bin/skytwin-dev\` runs processes with \`&\` | No restart-on-crash |
| Docker Compose | No \`restart:\` policy | Containers stay dead |
| Temporal profiles | In-memory \`Map\` | Lost on restart |
| IronClaw adapter | Has circuit breaker (3 failures → open) | Good — but connectors don't have this |

## Proposed Change

### 1. Retry with exponential backoff for connectors

Shared retry wrapper in \`@skytwin/core\`:

\`\`\`typescript
interface RetryConfig {
  maxRetries: number;       // default: 3
  baseDelayMs: number;      // default: 1000
  maxDelayMs: number;       // default: 30000
  retryableStatuses: number[]; // [429, 500, 502, 503]
}
\`\`\`

Apply to Gmail, Calendar connectors, and worker HTTP POSTs. Respect \`Retry-After\` header on 429s.

### 2. Per-user circuit breaker in worker

3 consecutive failures → stop polling that user for 5 minutes (exponential: 5m → 10m → 20m). Reuse pattern from IronClaw's \`CircuitBreaker\`.

### 3. Persistent rate limiter

Move from in-memory \`Map\` to CockroachDB. New table or rolling-window query pattern matching \`spend_tracking\`.

### 4. Process supervision

Update \`bin/skytwin-dev\` and \`docker-compose.yml\` with restart logic, health checks, and restart caps.

### 5. Atomic approval execution

Wrap approval response + execution in CockroachDB transaction. If execution fails after approval is recorded, user sees "approved but execution failed" rather than silent data loss. Directly addresses safety invariant #6: "Feedback flows back."

### 6. Persist temporal profiles

Reconstruct temporal profiles from recent decisions on startup rather than storing separately. Temporal profiles are derived data.

### 7. Liveness vs readiness health checks

Split \`GET /api/health\` into \`/health/live\` (process running) and \`/health/ready\` (DB connected, ready to serve).

## Acceptance Criteria

1. Gmail connector receives HTTP 429 → retries 3 times with exponential backoff (1s, 2s, 4s) → signal is eventually ingested (verify via decision record in DB)
2. Gmail connector receives HTTP 429 with \`Retry-After: 5\` header → waits 5 seconds before first retry
3. Calendar connector throws network error → retries 3 times → logs warning if all fail → next poll cycle proceeds normally
4. Worker polls user whose OAuth token is permanently revoked → 3 consecutive 401s → circuit breaker opens → user skipped for 5 minutes → worker logs warning with user ID
5. Circuit breaker resets after backoff period → user polled again on next cycle
6. Server restarts → rate limiter state preserved → user who had 55/60 requests still has 55/60 (not reset to 0)
7. \`kill -9\` on API process → \`bin/skytwin-dev\` restarts it within 5 seconds → API responds to \`/health/ready\` within 10 seconds
8. 5 crashes in 5 minutes → process stays dead → script logs "max restarts exceeded"
9. User approves action → server crashes between approval write and execution → on restart, approval record exists in DB with status "approved" (not lost)
10. \`GET /api/health/live\` returns 200 immediately after process start
11. \`GET /api/health/ready\` returns 503 until DB connection established, then 200
12. Docker Compose services have \`restart: unless-stopped\` and health checks
13. Server restarts → temporal profile for active user reconstructed from last 100 decisions within 2 seconds
14. All 432 existing tests pass
15. No safety invariant regressions in eval suite
16. PR passes \`/review\` before merge

## Testing Plan

| Layer | What | Count |
|-------|------|-------|
| Unit | \`withRetry()\` — success, retry on 429, retry on 500, max retries exceeded, Retry-After header | +5 |
| Unit | Circuit breaker — open after 3 failures, close after backoff, half-open probe | +4 |
| Unit | \`isWithinQuietHours()\` edge cases (moved to #14) | — |
| Integration | Gmail connector 429 → retry → success | +1 |
| Integration | Worker circuit breaker → skip user → resume | +1 |
| Integration | Approval + execution atomicity — crash simulation | +1 |
| Integration | Health check /live vs /ready during startup | +1 |
| Unit | Temporal profile reconstruction from decisions | +2 |

## Priority Ordering

1. **Retry with backoff** — most impactful, signals stop being lost
2. **Per-user circuit breaker** — prevents cascading failures
3. **Process supervision** — system stays alive
4. **Atomic approval execution** — safety invariant #6
5. **Persistent rate limiter** — correctness
6. **Temporal profile persistence** — nice to have
7. **Health check split** — infrastructure prep

## Files Reference

| File | Change |
|------|--------|
| \`packages/core/src/retry.ts\` | New: \`withRetry()\` wrapper with backoff |
| \`packages/core/src/circuit-breaker.ts\` | New: generalized from IronClaw's circuit breaker |
| \`packages/connectors/src/gmail-connector.ts\` | Wrap API calls in \`withRetry()\` |
| \`packages/connectors/src/google-calendar-connector.ts\` | Wrap API calls in \`withRetry()\` |
| \`apps/worker/src/index.ts\` | Per-user circuit breaker, retry on API POST |
| \`apps/api/src/routes/ask.ts\` | Move rate limiter to DB |
| \`apps/api/src/routes/approvals.ts\` | Transactional approval + execution |
| \`apps/api/src/index.ts\` | Split health check endpoints |
| \`bin/skytwin-dev\` | Process restart loop with backoff + cap |
| \`docker-compose.yml\` | Restart policies, health checks for all services |
| \`packages/db/src/migrations/011-rate-limits.sql\` | New: rate_limit_entries table |
| \`packages/twin-model/src/temporal-profile-manager.ts\` | Reconstruct from decisions on init |

## Out of Scope

- Distributed rate limiting across multiple nodes (local-first, single machine)
- Redis/memcached caching layer
- Job queue replacement for synchronous worker (sync made reliable instead)

## Related

- #8 — Live notifications (needs processes that stay alive)
- #13 — Desktop app (will use process supervision patterns from this issue)
- CLAUDE.md safety invariant #6: "Feedback flows back"
- Part of v0.4 epic (#12)

---

## Working Context Protocol

During implementation, maintain two sources of truth to survive context compaction:

1. **Local context file**: Write progress, decisions, and blockers to `.context/issue-11-resilience.md` (gitignored). Update this file after each meaningful step. On compaction, re-read this file to restore state.
2. **GitHub issue**: Post progress comments on [#11](https://github.com/jayzalowitz/skytwin/issues/11) at key milestones (subtask complete, blocker hit, design decision made). Reference the issue URL in your conversation so it persists across compaction: https://github.com/jayzalowitz/skytwin/issues/11

This ensures no quality loss across compaction events — the local file has granular state, the GitHub issue has durable history.

File	Change
`packages/core/src/retry.ts`	New: `withRetry()` wrapper with backoff
`packages/core/src/circuit-breaker.ts`	New: generalized from IronClaw's circuit breaker
`packages/connectors/src/gmail-connector.ts`	Wrap API calls in `withRetry()`
`packages/connectors/src/google-calendar-connector.ts`	Wrap API calls in `withRetry()`
`apps/worker/src/index.ts`	Per-user circuit breaker, retry on API POST
`apps/api/src/routes/ask.ts`	Move rate limiter to DB
`apps/api/src/routes/approvals.ts`	Transactional approval + execution
`apps/api/src/index.ts`	Split health check endpoints
`bin/skytwin-dev`	Process restart loop with backoff + cap
`docker-compose.yml`	Restart policies, health checks for all services
`packages/db/src/migrations/011-rate-limits.sql`	New: rate_limit_entries table
`packages/twin-model/src/temporal-profile-manager.ts`	Reconstruct from decisions on init

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local resilience: retry/backoff, process supervision, graceful degradation #11

Context

Current State (verified 2026-04-04)

Proposed Change

1. Retry with exponential backoff for connectors

2. Per-user circuit breaker in worker

3. Persistent rate limiter

4. Process supervision

5. Atomic approval execution

6. Persist temporal profiles

7. Liveness vs readiness health checks

Acceptance Criteria

Testing Plan

Priority Ordering

Files Reference

Out of Scope

Related

Working Context Protocol

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Implementation	Gap
Rate limiter	`Map<string, RateLimitEntry>` in `ask.ts`	Lost on restart
Gmail connector	Single 401 retry for expired token	No retry for 429, 500, network errors
Calendar connector	No retry at all	Throws on any error
Worker poll loop	`catch` logs error, continues next cycle	No backoff, no circuit breaker per-user
Health check	`GET /api/health` returns status + uptime	No liveness vs readiness distinction
Process supervision	`bin/skytwin-dev` runs processes with `&`	No restart-on-crash
Docker Compose	No `restart:` policy	Containers stay dead
Temporal profiles	In-memory `Map`	Lost on restart
IronClaw adapter	Has circuit breaker (3 failures → open)	Good — but connectors don't have this

Layer	What	Count
Unit	`withRetry()` — success, retry on 429, retry on 500, max retries exceeded, Retry-After header	+5
Unit	Circuit breaker — open after 3 failures, close after backoff, half-open probe	+4
Unit	`isWithinQuietHours()` edge cases (moved to #14)	—
Integration	Gmail connector 429 → retry → success	+1
Integration	Worker circuit breaker → skip user → resume	+1
Integration	Approval + execution atomicity — crash simulation	+1
Integration	Health check /live vs /ready during startup	+1
Unit	Temporal profile reconstruction from decisions	+2

Local resilience: retry/backoff, process supervision, graceful degradation #11

Description

Context

Current State (verified 2026-04-04)

Proposed Change

1. Retry with exponential backoff for connectors

2. Per-user circuit breaker in worker

3. Persistent rate limiter

4. Process supervision

5. Atomic approval execution

6. Persist temporal profiles

7. Liveness vs readiness health checks

Acceptance Criteria

Testing Plan

Priority Ordering

Files Reference

Out of Scope

Related

Working Context Protocol

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions