Skip to content

Make SkyTwin usable by a non-technical person end-to-end #2

@jayzalowitz

Description

@jayzalowitz

Problem

The decision engine, inference system, and safety framework are solid — but the product has no human-facing layer. A non-technical person cannot onboard, understand what the twin is doing, approve decisions, or build trust over time. The core intelligence works but it's invisible and inaccessible.

This issue tracks everything needed to go from "engineer's prototype" to "a regular person can sit down and use this."


1. Onboarding Flow

Current state: User lands on dashboard, sees "No decisions yet. Send events to the API to get started." No explanation of what a twin is.

Required:

  • Welcome screen on first visit explaining what SkyTwin does in plain language ("an AI that learns how you handle email and calendar, then starts doing it for you")
  • 3-step onboarding wizard:
    1. Explain — what the twin is, what it will and won't do, privacy guarantees
    2. Connect — Google account OAuth flow with clear explanation of permissions requested and why
    3. Set comfort level — choose initial trust tier using human language ("just watch and suggest" / "handle routine stuff" / "take action on most things") instead of OBSERVER / LOW_AUTONOMY / HIGH_AUTONOMY
  • Progressive disclosure: don't show empty pages, guide user to next step
  • First-run experience after connecting: show the twin processing its first few signals and explain what it's seeing

2. Fix the Approvals Workflow (Critical)

Current state: The approvals page always shows empty. There's no endpoint to list pending approvals. This is the single most important UI — the trust-building loop where users see what the twin wants to do and approve/reject.

Required:

  • DB schema for pending approval requests (or add status tracking to existing approval_requests table)
  • GET /api/approvals/:userId/pending endpoint that queries un-responded approval requests
  • GET /api/approvals/:userId/history endpoint for past decisions
  • Real-time or polling-based notification when a new approval is waiting
  • Approval cards that explain decisions in plain English:
    • "I want to archive this newsletter from TechCrunch because you've archived the last 12"
    • "Should I decline this meeting? It conflicts with your focus time and you've declined similar ones before"
  • Approve/reject with optional "tell me why" text input for corrections
  • Batch approval for low-risk routine items ("approve all 5 newsletter archives")

3. Implement OAuth Token Storage

Current state: Google OAuth flow generates tokens via exchangeCode() but there's no OAuthTokenStore implementation. Tokens aren't persisted. Real connectors (Gmail, Calendar) exist but can't run.

Required:

  • Implement OAuthTokenStore backed by oauth_tokens table (migration 002 already exists)
  • Store encrypted access + refresh tokens per user per provider
  • Token refresh logic: auto-refresh expired tokens before connector polls
  • Wire OAuth callback (/api/oauth/google/callback) to persist tokens via the store
  • Wire OAuthTokenStore into GmailConnector and GoogleCalendarConnector
  • Wire stored tokens into EmailActionHandler and CalendarActionHandler (currently expect accessToken in step params but nothing provides it)
  • Handle token revocation / disconnection cleanup
  • Document required Google Cloud project setup (OAuth consent screen, scopes, redirect URI)

4. Rewrite Twin Profile UI in Human Language

Current state: Shows database columns: domain: email, key: auto_archive, value: true, confidence: HIGH, source: inferred. Meaningless to a non-technical user.

Required:

  • Translate preferences into natural language: "I've noticed you always archive newsletters — want me to do that automatically?"
  • Translate inferences into learning narratives: "After watching 47 emails, I'm fairly confident you prefer to..."
  • Group by behavior, not by database domain:
    • "Email habits" → what the twin knows about your email preferences
    • "Calendar style" → meeting preferences, focus time, scheduling patterns
    • "Spending" → subscription and purchase tendencies
  • Show confidence as a visual indicator (progress bar or descriptive text like "very confident" / "still learning") not HIGH / MODERATE / LOW
  • Allow corrections inline: click a preference → "Actually, I only archive newsletters from these senders" → creates CONFIRMED correction
  • Show what the twin is still unsure about and how many more observations it needs

5. Confidence & Learning Dashboard

Current state: No visibility into how well the twin is performing or how much it has learned. Detected patterns (temporal, cross-domain) are stored but invisible.

Required:

  • Overall twin confidence score: "Your twin is 73% confident across email and 41% confident on calendar"
  • Learning progress visualization: "Watched 142 emails, made 28 decisions, you corrected 3"
  • Timeline of twin milestones: "Mar 15: Started auto-archiving newsletters (you approved 5 in a row)"
  • Active hours display: "I've noticed you're most active 9am-11am and 2pm-4pm"
  • Behavioral traits in plain English: "You tend to be cautious with spending" / "You respond quickly to meeting invites"
  • Accuracy trend: "Decision accuracy this week: 91% (up from 84% last week)"
  • Wire AccuracyTracker and ContinuousEvalRunner to real data instead of stub responses in /api/evals

6. Wire Detected Patterns into Decision Scoring

Current state: PatternDetector, TemporalAnalyzer, and CrossDomainAnalyzer detect behavioral patterns but the DecisionMaker never uses them. The twin learns things it doesn't act on.

Required:

  • Feed detected habits into candidate action scoring (if user has a habit of archiving newsletters, boost archive candidate confidence)
  • Use temporal profile in decision timing: don't auto-execute during off-hours, batch non-urgent actions
  • Use cross-domain traits in risk assessment: if user is cautious_spender, increase scrutiny on purchase-related actions
  • Use response time patterns: if user typically responds to meeting invites within 1 hour, wait before auto-declining
  • Log which patterns influenced each decision in the ExplanationRecord

7. Multi-User Worker Support

Current state: Worker hardcodes userId: 'default-user' and only uses mock connectors. Can't serve multiple real users.

Required:

  • Worker polls list of active users from DB
  • Per-user connector instantiation with their OAuth tokens
  • Per-user polling intervals (respect rate limits)
  • Graceful handling of token expiry / revocation per user
  • User-scoped signal deduplication (don't re-process same email)
  • Configurable toggle between mock and real connectors per user (for testing)

8. Error Handling & User-Friendly Messaging

Current state: Failed API calls surface HTTP errors. No human-readable error states in the UI.

Required:

  • Friendly error states: "Couldn't reach Google — I'll retry in a few minutes" instead of 502 Bad Gateway
  • Connection status indicator: show whether Gmail/Calendar connections are healthy
  • Explain when and why the twin can't act: "I need you to approve this because it's above your $50 spend limit"
  • Notification when the twin needs attention (pending approval, connection lost, unusual activity)
  • Graceful degradation: if one connector fails, keep the other running

9. Settings in Plain English

Current state: Trust tier selector shows OBSERVER, LOW_AUTONOMY, MODERATE_AUTONOMY, HIGH_AUTONOMY, FULL_AUTONOMY. Jargon.

Required:

  • Human-friendly tier names and descriptions:
    • "Watch only" — twin observes but never acts
    • "Suggest" — twin suggests actions, you approve each one
    • "Handle routine" — twin auto-handles low-risk repetitive tasks, asks about everything else
    • "Mostly autonomous" — twin handles most things, only asks about high-risk or unusual situations
    • "Full autopilot" — twin handles everything within your policies
  • Per-domain autonomy: "Auto-handle email but always ask about calendar"
  • Spend limit configuration with clear examples: "Max $25 per action, $100 per day"
  • Data & privacy controls: what data is stored, how to export, how to delete
  • "Pause twin" button — immediately stop all auto-execution without disconnecting

10. Mobile-Responsive UI

Current state: Desktop-only sidebar layout. CSS has basic responsiveness but not tested or optimized.

Required:

  • Responsive layout that works on phone screens (approval notifications on the go)
  • Approval actions accessible via mobile (approve/reject with one tap)
  • Collapsible sidebar on mobile
  • Touch-friendly interaction targets

Definition of Done

A non-technical person can:

  1. Visit the app and understand what it does within 30 seconds
  2. Connect their Google account through a guided flow
  3. Set their comfort level in plain language
  4. Watch the twin start learning from their real email and calendar
  5. See pending approvals explained in natural language and approve/reject
  6. Visit "Twin Profile" and understand what the twin has learned about them
  7. See accuracy and confidence metrics that make sense
  8. Correct the twin when it's wrong and see it learn from corrections
  9. Adjust settings without needing to know what "trust tier" or "domain" means
  10. Trust the system enough to gradually increase autonomy

Technical Notes

  • The decision pipeline, inference engine, safety invariants, and IronClaw handlers are already implemented. This issue is about the human layer on top of working infrastructure.
  • All changes must maintain the existing safety invariants from CLAUDE.md (policy checks, explanation records, trust tiers, spend limits, risk assessment, feedback loops).
  • Existing test suite (28 tasks) must continue passing. Add tests for new endpoints and flows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions