Skip to content

Local resilience: retry/backoff, process supervision, graceful degradation #11

@jayzalowitz

Description

@jayzalowitz

Context

SkyTwin runs locally on the user's desktop. Several components lose state on restart or fail ungracefully when external services (Gmail, Google Calendar) have transient errors. The system works in the happy path but degrades poorly — and for a daily-use product, "works when nothing goes wrong" isn't enough.

Claude Code estimate: ~2-3h

Current State (verified 2026-04-04)

Component Implementation Gap
Rate limiter `Map<string, RateLimitEntry>` in `ask.ts` Lost on restart
Gmail connector Single 401 retry for expired token No retry for 429, 500, network errors
Calendar connector No retry at all Throws on any error
Worker poll loop `catch` logs error, continues next cycle No backoff, no circuit breaker per-user
Health check `GET /api/health` returns status + uptime No liveness vs readiness distinction
Process supervision `bin/skytwin-dev` runs processes with `&` No restart-on-crash
Docker Compose No `restart:` policy Containers stay dead
Temporal profiles In-memory `Map` Lost on restart
IronClaw adapter Has circuit breaker (3 failures → open) Good — but connectors don't have this

Proposed Change

1. Retry with exponential backoff for connectors

Shared retry wrapper in `@skytwin/core`:

```typescript
interface RetryConfig {
maxRetries: number; // default: 3
baseDelayMs: number; // default: 1000
maxDelayMs: number; // default: 30000
retryableStatuses: number[]; // [429, 500, 502, 503]
}
```

Apply to Gmail, Calendar connectors, and worker HTTP POSTs. Respect `Retry-After` header on 429s.

2. Per-user circuit breaker in worker

3 consecutive failures → stop polling that user for 5 minutes (exponential: 5m → 10m → 20m). Reuse pattern from IronClaw's `CircuitBreaker`.

3. Persistent rate limiter

Move from in-memory `Map` to CockroachDB. New table or rolling-window query pattern matching `spend_tracking`.

4. Process supervision

Update `bin/skytwin-dev` and `docker-compose.yml` with restart logic, health checks, and restart caps.

5. Atomic approval execution

Wrap approval response + execution in CockroachDB transaction. If execution fails after approval is recorded, user sees "approved but execution failed" rather than silent data loss. Directly addresses safety invariant #6: "Feedback flows back."

6. Persist temporal profiles

Reconstruct temporal profiles from recent decisions on startup rather than storing separately. Temporal profiles are derived data.

7. Liveness vs readiness health checks

Split `GET /api/health` into `/health/live` (process running) and `/health/ready` (DB connected, ready to serve).

Acceptance Criteria

  1. Gmail connector receives HTTP 429 → retries 3 times with exponential backoff (1s, 2s, 4s) → signal is eventually ingested (verify via decision record in DB)
  2. Gmail connector receives HTTP 429 with `Retry-After: 5` header → waits 5 seconds before first retry
  3. Calendar connector throws network error → retries 3 times → logs warning if all fail → next poll cycle proceeds normally
  4. Worker polls user whose OAuth token is permanently revoked → 3 consecutive 401s → circuit breaker opens → user skipped for 5 minutes → worker logs warning with user ID
  5. Circuit breaker resets after backoff period → user polled again on next cycle
  6. Server restarts → rate limiter state preserved → user who had 55/60 requests still has 55/60 (not reset to 0)
  7. `kill -9` on API process → `bin/skytwin-dev` restarts it within 5 seconds → API responds to `/health/ready` within 10 seconds
  8. 5 crashes in 5 minutes → process stays dead → script logs "max restarts exceeded"
  9. User approves action → server crashes between approval write and execution → on restart, approval record exists in DB with status "approved" (not lost)
  10. `GET /api/health/live` returns 200 immediately after process start
  11. `GET /api/health/ready` returns 503 until DB connection established, then 200
  12. Docker Compose services have `restart: unless-stopped` and health checks
  13. Server restarts → temporal profile for active user reconstructed from last 100 decisions within 2 seconds
  14. All 432 existing tests pass
  15. No safety invariant regressions in eval suite
  16. PR passes `/review` before merge

Testing Plan

Layer What Count
Unit `withRetry()` — success, retry on 429, retry on 500, max retries exceeded, Retry-After header +5
Unit Circuit breaker — open after 3 failures, close after backoff, half-open probe +4
Unit `isWithinQuietHours()` edge cases (moved to #14)
Integration Gmail connector 429 → retry → success +1
Integration Worker circuit breaker → skip user → resume +1
Integration Approval + execution atomicity — crash simulation +1
Integration Health check /live vs /ready during startup +1
Unit Temporal profile reconstruction from decisions +2

Priority Ordering

  1. Retry with backoff — most impactful, signals stop being lost
  2. Per-user circuit breaker — prevents cascading failures
  3. Process supervision — system stays alive
  4. Atomic approval execution — safety invariant feat: SkyTwin M2/M3/M4 — safe delegation, real workflows, learning & evals #6
  5. Persistent rate limiter — correctness
  6. Temporal profile persistence — nice to have
  7. Health check split — infrastructure prep

Files Reference

File Change
`packages/core/src/retry.ts` New: `withRetry()` wrapper with backoff
`packages/core/src/circuit-breaker.ts` New: generalized from IronClaw's circuit breaker
`packages/connectors/src/gmail-connector.ts` Wrap API calls in `withRetry()`
`packages/connectors/src/google-calendar-connector.ts` Wrap API calls in `withRetry()`
`apps/worker/src/index.ts` Per-user circuit breaker, retry on API POST
`apps/api/src/routes/ask.ts` Move rate limiter to DB
`apps/api/src/routes/approvals.ts` Transactional approval + execution
`apps/api/src/index.ts` Split health check endpoints
`bin/skytwin-dev` Process restart loop with backoff + cap
`docker-compose.yml` Restart policies, health checks for all services
`packages/db/src/migrations/011-rate-limits.sql` New: rate_limit_entries table
`packages/twin-model/src/temporal-profile-manager.ts` Reconstruct from decisions on init

Out of Scope

  • Distributed rate limiting across multiple nodes (local-first, single machine)
  • Redis/memcached caching layer
  • Job queue replacement for synchronous worker (sync made reliable instead)

Related


Working Context Protocol

During implementation, maintain two sources of truth to survive context compaction:

  1. Local context file: Write progress, decisions, and blockers to .context/issue-11-resilience.md (gitignored). Update this file after each meaningful step. On compaction, re-read this file to restore state.
  2. GitHub issue: Post progress comments on #11 at key milestones (subtask complete, blocker hit, design decision made). Reference the issue URL in your conversation so it persists across compaction: Local resilience: retry/backoff, process supervision, graceful degradation #11

This ensures no quality loss across compaction events — the local file has granular state, the GitHub issue has durable history.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions