You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SkyTwin runs locally on the user's desktop. Several components lose state on restart or fail ungracefully when external services (Gmail, Google Calendar) have transient errors. The system works in the happy path but degrades poorly — and for a daily-use product, "works when nothing goes wrong" isn't enough.
Apply to Gmail, Calendar connectors, and worker HTTP POSTs. Respect `Retry-After` header on 429s.
2. Per-user circuit breaker in worker
3 consecutive failures → stop polling that user for 5 minutes (exponential: 5m → 10m → 20m). Reuse pattern from IronClaw's `CircuitBreaker`.
3. Persistent rate limiter
Move from in-memory `Map` to CockroachDB. New table or rolling-window query pattern matching `spend_tracking`.
4. Process supervision
Update `bin/skytwin-dev` and `docker-compose.yml` with restart logic, health checks, and restart caps.
5. Atomic approval execution
Wrap approval response + execution in CockroachDB transaction. If execution fails after approval is recorded, user sees "approved but execution failed" rather than silent data loss. Directly addresses safety invariant #6: "Feedback flows back."
6. Persist temporal profiles
Reconstruct temporal profiles from recent decisions on startup rather than storing separately. Temporal profiles are derived data.
7. Liveness vs readiness health checks
Split `GET /api/health` into `/health/live` (process running) and `/health/ready` (DB connected, ready to serve).
Acceptance Criteria
Gmail connector receives HTTP 429 → retries 3 times with exponential backoff (1s, 2s, 4s) → signal is eventually ingested (verify via decision record in DB)
Gmail connector receives HTTP 429 with `Retry-After: 5` header → waits 5 seconds before first retry
Calendar connector throws network error → retries 3 times → logs warning if all fail → next poll cycle proceeds normally
Worker polls user whose OAuth token is permanently revoked → 3 consecutive 401s → circuit breaker opens → user skipped for 5 minutes → worker logs warning with user ID
Circuit breaker resets after backoff period → user polled again on next cycle
Server restarts → rate limiter state preserved → user who had 55/60 requests still has 55/60 (not reset to 0)
`kill -9` on API process → `bin/skytwin-dev` restarts it within 5 seconds → API responds to `/health/ready` within 10 seconds
5 crashes in 5 minutes → process stays dead → script logs "max restarts exceeded"
User approves action → server crashes between approval write and execution → on restart, approval record exists in DB with status "approved" (not lost)
`GET /api/health/live` returns 200 immediately after process start
`GET /api/health/ready` returns 503 until DB connection established, then 200
Docker Compose services have `restart: unless-stopped` and health checks
Server restarts → temporal profile for active user reconstructed from last 100 decisions within 2 seconds
All 432 existing tests pass
No safety invariant regressions in eval suite
PR passes `/review` before merge
Testing Plan
Layer
What
Count
Unit
`withRetry()` — success, retry on 429, retry on 500, max retries exceeded, Retry-After header
+5
Unit
Circuit breaker — open after 3 failures, close after backoff, half-open probe
During implementation, maintain two sources of truth to survive context compaction:
Local context file: Write progress, decisions, and blockers to .context/issue-11-resilience.md (gitignored). Update this file after each meaningful step. On compaction, re-read this file to restore state.
Context
SkyTwin runs locally on the user's desktop. Several components lose state on restart or fail ungracefully when external services (Gmail, Google Calendar) have transient errors. The system works in the happy path but degrades poorly — and for a daily-use product, "works when nothing goes wrong" isn't enough.
Claude Code estimate: ~2-3h
Current State (verified 2026-04-04)
Proposed Change
1. Retry with exponential backoff for connectors
Shared retry wrapper in `@skytwin/core`:
```typescript
interface RetryConfig {
maxRetries: number; // default: 3
baseDelayMs: number; // default: 1000
maxDelayMs: number; // default: 30000
retryableStatuses: number[]; // [429, 500, 502, 503]
}
```
Apply to Gmail, Calendar connectors, and worker HTTP POSTs. Respect `Retry-After` header on 429s.
2. Per-user circuit breaker in worker
3 consecutive failures → stop polling that user for 5 minutes (exponential: 5m → 10m → 20m). Reuse pattern from IronClaw's `CircuitBreaker`.
3. Persistent rate limiter
Move from in-memory `Map` to CockroachDB. New table or rolling-window query pattern matching `spend_tracking`.
4. Process supervision
Update `bin/skytwin-dev` and `docker-compose.yml` with restart logic, health checks, and restart caps.
5. Atomic approval execution
Wrap approval response + execution in CockroachDB transaction. If execution fails after approval is recorded, user sees "approved but execution failed" rather than silent data loss. Directly addresses safety invariant #6: "Feedback flows back."
6. Persist temporal profiles
Reconstruct temporal profiles from recent decisions on startup rather than storing separately. Temporal profiles are derived data.
7. Liveness vs readiness health checks
Split `GET /api/health` into `/health/live` (process running) and `/health/ready` (DB connected, ready to serve).
Acceptance Criteria
Testing Plan
Priority Ordering
Files Reference
Out of Scope
Related
Working Context Protocol
During implementation, maintain two sources of truth to survive context compaction:
.context/issue-11-resilience.md(gitignored). Update this file after each meaningful step. On compaction, re-read this file to restore state.This ensures no quality loss across compaction events — the local file has granular state, the GitHub issue has durable history.