Skip to content

feat: real OpenClaw execution, 6 new domains, expanded onboarding#7

Merged
jayzalowitz merged 23 commits into
mainfrom
jayzalowitz/real-openclaw-grandma-mode
Apr 2, 2026
Merged

feat: real OpenClaw execution, 6 new domains, expanded onboarding#7
jayzalowitz merged 23 commits into
mainfrom
jayzalowitz/real-openclaw-grandma-mode

Conversation

@jayzalowitz

Copy link
Copy Markdown
Owner

Summary

  • Real LLM execution: OpenClaw bridge server connects SkyTwin to Ollama (gemma4 default), with trust-ranked adapter fallback chain (IronClaw → Direct → OpenClaw)
  • 6 new domains: Finance, Smart Home, Task Management, Social Media, Documents, Health — each with situation classifiers, candidate generators, IronClaw handlers, and 8 eval scenarios (48 total)
  • Expanded onboarding: 5-step flow with domain selection (10 toggleable cards), per-domain preference seeding, and grandma-friendly trust tier picker
  • Execution fixes: Direct adapter throws on missing handlers (enables fallback), all routes use proper TwinRepositoryAdapter, policy adapter filters malformed rules, all IDs use crypto.randomUUID()
  • 60+ action types across OPENCLAW_SKILLS (up from ~21), 3 seed users, Desktop app scaffold (Electron)

Test plan

  • All 353 tests pass (pnpm test)
  • All 16 packages build clean (pnpm build)
  • Events processed end-to-end for all 12 situation types via API
  • OpenClaw → Ollama gemma4 execution verified (adapter fallback working)
  • whatWouldIDo predictions return candidates with alternatives for new domains
  • New user onboarding flow tested in browser (5 steps, domain selection, preference seeding)
  • Policy enforcement verified: suggest-tier users get approval gates
  • Database seeded with 3 users and cross-domain decisions

🤖 Generated with Claude Code

jayzalowitz and others added 23 commits April 1, 2026 10:20
Pure logic engine that evaluates approval stats to determine tier
promotion/regression. Promotion thresholds: OBSERVER→SUGGEST (10
consecutive, 80% ratio), SUGGEST→LOW_AUTONOMY (20, 85%),
LOW_AUTONOMY→MODERATE_AUTONOMY (50, 90%). HIGH_AUTONOMY requires
explicit opt-in. Regression on: 3+ recent rejections, critical undo,
or 30%+ rejection ratio. OBSERVER is the floor.

Includes trust_tier_audit table, repository, and 21 unit tests covering
all promotion paths, regression triggers, and combined evaluation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds expires_at and batch_id columns to approval_requests. Expiry is
urgency-based: immediate=15min, normal=24h, low=72h. ApprovalRouter
class computes expiry, checks expiration, and supports batch
approve/reject. Repository gains expirePending() for worker cron
and batchRespond() for bulk operations.

Migration uses safe 3-step pattern. 14 new unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SpendTracker enforces rolling 24h daily spend limits using
COALESCE(actual, estimated) aggregation. Blocks actions when
currentSpend + proposedCost > maxDailySpendCents. Reconciliation
tracks estimate vs actual variance with percentage calculation.

Includes spend_records table, repository with getDailyTotal() and
reconcile(), and 12 unit tests covering boundary conditions,
zero-cost passthrough, and variance math.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rs (M2 Phase 4)

Domain autonomy: per-domain trust tier overrides using the more restrictive
of global and domain tier. Escalation triggers: 5 configurable trigger types
(amount threshold, risk tier, low confidence, novel situation, consecutive
rejections). Includes migration 009, DB repositories, and 19 new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…os (M2 Phase 5)

7 test groups covering all CLAUDE.md safety invariants: policy check
enforcement, explanation logging, trust tier gating (all 5 tiers x risk
levels), spend limits (per-action + daily), reversibility, feedback
flow-back, and mandatory risk assessment. Plus domain autonomy and
escalation trigger integration tests. 3 new safety regression scenarios
for daily spend limits, domain autonomy, and new user OBSERVER tier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WorkflowHandlerRegistry maps SituationType to handler functions,
generalizing the events route. Four new handlers (calendar-conflict,
subscription-renewal, grocery-reorder, travel-decision) plus matching
E2E tests that exercise the full pipeline for each situation type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 8)

Settings API: GET/PUT autonomy settings, PUT/DELETE domain policies,
POST/PATCH/DELETE escalation triggers. Settings page extended with
spend limit controls, domain override management, and escalation
trigger configuration. DB package now exports M2 repositories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncy) and 39 eval scenarios

Phase 9 of the build plan. Three new metric trackers for measuring decision
quality: EscalationCorrectnessTracker (precision/recall/F1), CalibrationErrorTracker
(ECE), DecisionLatencyTracker (P50/P90/P99). Five new scenario files covering
calendar (8), subscription (8), grocery (8), travel (8), and cross-domain (7)
situations. All wired into ContinuousEvalRunner and exported from index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…CI workflow

Phase 10 of the build plan. PreferenceEvolutionTracker records every preference
change with attribution (feedback, evidence, explicit, inference) and supports
point-in-time state reconstruction. TemporalReplayEngine diffs twin state between
two dates using twin_profile_versions + preference_history. TwinService now wires
evolution tracking into updatePreference() and processFeedback(). GitHub Actions
workflow runs eval suite on push/PR to main with safety regression gating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 7)

Rollback E2E tests verify execute→rollback lifecycle, irreversible rejection,
unknown plan handling, failure paths, operation log tracking, and health toggling
against MockIronClawAdapter. Contract tests run identical assertions against both
MockIronClawAdapter and DirectExecutionAdapter to verify behavioral parity on
execute, rollback, healthCheck, and multi-step plans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ask.ts: look up user trust tier from DB instead of hardcoding OBSERVER
- briefings.ts: query proactiveScanRepository for real briefings, persist
  preferences to user autonomy_settings
- skill-gaps.ts: query skillGapRepository with limit and actionType filter
- proposals.ts: look up proposal by ID, update status via proposalRepository,
  create preference on twin profile when accepted
- openclaw-adapter.ts: real HTTP client with /execute and /rollback endpoints,
  dry-run fallback when no server configured, proper rollback with plan tracking
- continuous-runner.ts: store per-scenario results in EvalRun.scenarioResults,
  reconstruct previous results for regression comparison
- eval-types.ts: add optional scenarioResults field to EvalRun

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 7)

MockIronClawServer: minimal HTTP server mimicking IronClaw's webhook API
with HMAC-SHA256 verification, configurable responses, and message recording.

Contract tests: validate that RealIronClawAdapter and MockIronClawAdapter
produce compatible outputs (ExecutionResult, RollbackResult, healthCheck).
15 tests covering buildPlan, execute, rollback, healthCheck, and HMAC auth.

Rollback E2E: 6 tests covering execute-then-rollback, irreversible rejection,
unknown plan, operation logging, and independent multi-plan rollback.

Execution router: added rollback-through-router integration test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a dropdown in the page header that lets users switch between
Mission Control (dense/monospace), Quiet Confidence (minimal/Linear),
and Warm Glass (glass-morphism/gradients) themes, each with dark and
light modes. Persists to localStorage with no flash on reload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pnpm/action-setup@v4 now reads the version from package.json's
packageManager field. Specifying both causes a conflict error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Read trustTier from DB instead of event payload in workflow handlers
  (trust-boundary violation, defaulted to OBSERVER not MODERATE_AUTONOMY)
- Critical undo now drops to OBSERVER instead of one tier down
- Reject negative spend amounts to prevent bypass
- Add atomic checkAndRecordSpend to prevent TOCTOU race in spend tracking
- Add ownership checks on escalation trigger PATCH/DELETE routes
- Input validation for spend limit settings
- Fail-closed for unknown escalation trigger types
- Parameterize SQL interval expressions to prevent injection
- Add findById to escalation trigger repository
- Update tests for negative-cost rejection and critical-undo behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…emma4 default

Wire real Ollama LLM execution through OpenClaw bridge with trust-ranked
adapter fallback chain (IronClaw → Direct → OpenClaw). Expand the system
from 6 to 12 situation types with 60+ action types across finance, smart
home, task management, social media, documents, and health domains.

Key changes:
- OpenClaw bridge server (Node.js) bridging SkyTwin to Ollama gemma4
- 6 new IronClaw action handlers with throw-for-fallback pattern
- Decision engine: 6 new candidate generators and situation classifiers
- 5-step onboarding: welcome, identity, domain selection (10 cards),
  preference seeding (2 questions/domain), trust tier choice
- Direct adapter throws on missing handlers to enable fallback chain
- All route files fixed to use proper TwinRepositoryAdapter (not raw repos)
- Policy repository adapter filters malformed string-condition rules
- 48 new eval scenarios (8 per domain)
- 3 seed users with cross-domain decision history
- All IDs use crypto.randomUUID() for CockroachDB UUID columns
- Desktop app scaffold (Electron) for Mac/Windows

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI failed with ERR_PNPM_OUTDATED_LOCKFILE because the desktop app's
electron dependencies were not in pnpm-lock.yaml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Direct adapter now throws instead of soft-failing when no handler is
registered, enabling the execution router's fallback chain to continue
to OpenClaw. Updated 2 tests that expected the old soft-failure behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bin/skytwin-dev: starts all services (CRDB, API, Web, Worker, OpenClaw)
  with a single command, auto-builds, migrates, seeds. Supports --stop
  and --no-ollama flags. Manages PID files for clean shutdown.
- bin/skytwin-install: detects OS (macOS/Linux/WSL) and installs only
  missing dependencies (Node via nvm, pnpm via corepack, Docker, Ollama).
  Never overwrites existing installs. Pulls gemma4 model for Ollama.
- Root package.json: added "start", "stop", "setup" convenience scripts.
- .env.example: expanded with all service ports and optional config.

Usage:
  pnpm setup    # install everything (safe, idempotent)
  pnpm start    # start all services
  pnpm stop     # tear down

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove trust tier from user creation request body (must be earned, not declared)
- Verify approval ownership before executing (user_id must match)
- Run policy evaluator on approved actions (spend limits still apply)
- Fix LLM bridge parse failure returning success:true → now success:false
- Delete divergent server.ts duplicate (server.mjs is canonical, model already drifted)
- Fix bridge port default 4100→3456 to match .env.example and skytwin-dev
- Add 30s timeout on Ollama fetch to prevent hung requests
- Fix desktop service restart counter resetting before health check
- Cap seed-preferences at 100 and use Promise.all for parallel writes
- Persist execution step info instead of empty arrays for rollback support
- Move TwinService to router-level scope in users route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jayzalowitz jayzalowitz merged commit a16fc76 into main Apr 2, 2026
1 check passed
jayzalowitz added a commit that referenced this pull request Apr 26, 2026
Closes #75. Three additive guards on the safety kernel:

1. ExecutionRouter throws InvariantViolationError when called without a
   RiskAssessment or with a CandidateAction whose id does not match the
   assessment. Pins Safety Invariants #1 and #7 at the boundary, so a
   future caller that bypasses the decision pipeline cannot silently
   auto-execute. (+4 unit tests)

2. DecisionMaker.whatWouldIDo no longer leaks blocked candidates as
   alternativeActions when policy denies every candidate. Returns an
   empty alternatives array and surfaces the blocking reason via
   policyNotes so the prediction reflects what the user could actually
   take. (+1 unit test pinning the no-leak contract)

3. POST /api/events/ingest emits a decision:blocked-by-policy SSE event
   when no action was selected and no approval was created, so users see
   the policy result instead of silent ingestion. (+1 unit test)

Production call sites (apps/api/src/routes/events.ts:230,
apps/api/src/routes/approvals.ts:264) already build matching
RiskAssessments — guards are inert for them, active against new
orphan callers.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 5, 2026
…line (#148 v1) (#153)

Closes #148 v1 — final phase-2 piece for the assistant epic. Chat now
routes detected action intents ("archive that email", "schedule a
meeting") through the existing decision pipeline. Saying an action in
chat creates an ApprovalRequest on the existing #/approvals page;
conversational messages still go through the LLM chat path unchanged.

Conservative v1 — chat-driven actions ALWAYS land in approvals, never
auto-execute, even when the engine returns autoExecute=true. Free-text
is too ambiguous to bypass the approval step on the first cut. Phase 2
of #148 lifts this when we have an LLM-confidence score + per-user
opt-in.

Safety invariants — all upheld:
- #1 (no auto-exec without policy): every intent runs through
  DecisionMaker.evaluate() → PolicyEvaluator.evaluate(). No bypass.
- #2 (always log explanations): ExplanationGenerator.generate()
  persists for every chat-driven decision. Persist failure logs but
  doesn't abort.
- #3 (trust tiers): pulled from user record, never from chat input.
- #4-#7 inherited from DecisionMaker.

Pieces:
- @skytwin/assistant: detectIntent (rule-based regex/keyword classifier,
  tolerant to short/ambiguous messages, false-positive guarded) +
  ActionRouter port + AssistantService.routeIntent. Package stays free
  of decision-engine + db deps.
- apps/api/src/routes/assistant.ts: buildActionRouter() factory wires
  TwinService + PolicyEvaluator + DecisionMaker + ExplanationGenerator
  + LabelInferencePort. Synthetic DecisionObject from chat intent.
  Persists ApprovalRequest. Emits approval:new SSE.
- POST /messages branches on intent BEFORE the LLM call. Both sync JSON
  and SSE response paths supported.
- pages/assistant.js: action footer renders approval-link or blocked
  notice based on metadata.intentRoute. CSS styled with theme variables.

Tests: 16 new (12 intent classifier + 4 routeIntent). Full suite green
across 40 packages; lint clean.

Phase 2 epic is now complete: #146 streaming + #147 context + #149
multi-turn + #148 action routing.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 5, 2026
Picks up the P1/P2 findings deferred from #154 (the hardening pass).
6 of the remaining UX review findings closed.

Settings cleanup (P1 #7, #9)
- Theme switcher relocated from page header (where it looked like a
  breadcrumb pill) to a dedicated "Visual theme" card in Settings.
- AI provider section now titled "AI brain — needed for Chat
  (optional otherwise)" instead of just "(optional)" — the Chat
  feature requires it, the old title was misleading.

Chat → Settings deep-link (P1 #9)
- When POST /messages returns 409 "no AI provider configured", the
  chat bubble explains the dependency + footer link to Settings →
  AI brain. Previously the user got "No AI provider configured"
  with no path forward.

Onboarding modal dimmer (P2 #15)
- Bumped overlay rgba(0,0,0,.85) → .92 + backdrop-filter: blur(4px)
  so the sidebar/page behind the modal is properly muted (was
  bleeding through the glass effect at .85).

Console error spam reduction (P2 #20)
- New isApiKnownOffline() in api-client.js. Badge-poll loop in
  app.js backs off 10s → 60s when API is known down. Pre-fix
  produced 110+ console errors/min against a dead server.

Date input theming on Audit (P1 #11)
- New .themed-date class so native date inputs match the dark glass
  aesthetic (background, border, color-scheme for picker icon).
- Webkit calendar-picker-indicator filter inverts the icon glyph
  on dark themes so it's actually visible.

Tests: no new unit tests (browser-only). Backend suite still green
across 40 packages.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 25, 2026
… closed) (#417)

* P1.1 #371: stop fabricating synthetic RiskAssessments at exec boundaries

Closes #371.

Safety Invariant #7 requires every CandidateAction carry a RiskAssessment
that the router actually consumes. Pre-fix, BOTH execution boundaries
(events.ts auto-execute, approvals.ts approve-execute) discarded the
decision-maker's per-dimension assessment and constructed a fresh one:

- events.ts re-derived from explanation.riskTier (a flat enum) and
  broadcast that single tier across all six dimensions. A candidate
  the decision-maker assessed HIGH on financial impact was routed LOW.
- approvals.ts hardcoded LOW on every dimension with the comment
  "user-approved = lower risk." A human click does not move the
  underlying risk dimensions — the adapter selection then picked a
  less-guarded adapter than the decision-maker intended.

Fix is on the consumer side; no changes to decision-maker, repository,
or execution-router invariants:

- events.ts: read `outcome.riskAssessment` directly (already attached
  by decision-maker.ts:263). Fall back to
  `decisionRepositoryAdapter.getRiskAssessment(actionId)` if absent.
  If both null, FAIL CLOSED: escalate to manual approval rather than
  fabricate. Drops the now-unused DimensionAssessment / RiskTier /
  RiskDimension imports.
- events.ts approval-create payload now stamps the original
  `outcome.selectedAction.id` into the stored candidate_action JSONB,
  so the approve handler can recover the linkage.
- approvals.ts: preserve the original candidate id from the stored
  action (was generating a fresh UUID, losing the lookup linkage).
  Look up the persisted assessment via decisionRepositoryAdapter. If
  missing, HTTP 409 risk_assessment_missing — same fail-closed
  posture as events.ts.

Legacy approval rows from before this PR don't carry an `id` on the
stored candidate_action and will hit the 409. The recovery is to
re-trigger the decision so a fresh assessment is persisted.

Tests:
- events-routes.test.ts: mock decisionRepositoryAdapter.getRiskAssessment
  to return a baseline LOW assessment so existing autoExecute-path tests
  proceed (mocks didn't include riskAssessment on the outcome).
- feedback-loop.test.ts: same mock + add valid UUID id to the
  candidate_action fixture rows so the approve path's lookup succeeds.
- All 713 tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* P1.1 (post-Copilot): move risk-assessment 409 to preflight; echo id in test mocks

Copilot review on PR #417 caught two concerns:

1. **The 409 risk_assessment_missing branch fired AFTER side effects.**
   approvalRepository.respond() had already marked the row 'approved',
   feedback was recorded, the episode was written, the memory port
   recorded the episode — then the 409 left an "approved" row with no
   execution and no way to retry (status conflict on the next attempt).

   Fix: pre-check the persisted RiskAssessment immediately after the
   dual-confirm gate, before any state mutation. The lookup uses
   `existing.candidate_action.id` (which events.ts now stamps) and
   stores both `preflightCandidateId` + `preflightRiskAssessment` for
   the later execute block to reuse. The execute block's redundant
   lookup is removed; it asserts non-null on the preflight result
   because the early 409 already returned for the missing case.

2. **Test mocks returned a fixed actionId, masking the router's
   actionId-match invariant.**

   Fix: both events-routes.test.ts and feedback-loop.test.ts now use
   `mockImplementation(async (actionId) => ({ actionId, ... }))` so the
   returned assessment echoes the input, and the
   `action.id === riskAssessment.actionId` invariant in
   execution-router can't silently pass on a mismatched fixture.

All 713 API tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request Jun 6, 2026
Correctness:
- deadline urgency: stale (past-relative-to-now) deadlines no longer read as
  critical; far-out deadlines no longer DOWNGRADE a type's default urgency (#1/#2)
- security markers curated to specific phrases — kill false positives on shipping
  notices / "welcome back" / articles (#3); marker check also applied on the LLM
  path so escalate-only holds regardless of classifier (safety defense-in-depth)
- digest emits signalRefs[] so citation chips actually render (#4)
- scope gate now covers calendar RSVP/invite write actions (#5)
- commitment extractor: clause-level negation (keep real commitments sharing a
  sentence with "if I…") (#6); "by <person>" no longer a deadline hint (#7)
- entity resolver compares full normalized string, not the truncated slug (#10)

Hardening/robustness:
- demo-guard isLocalDbTarget: exact host match, not substring (#8)
- provisionNewUser is genuinely best-effort (try/catch) — never 500s after the
  user row exists
- briefing-generator pinned to prompt v1 until it consumes v2 structured output
  (avoids requesting+discarding todos/topics); v2 deterministic_fallback fixed
- briefing test mock provides userRepository.getLocale so the LLM-prose path is
  actually exercised (#13)

Regression tests added for each. Full suite green (70/70 tasks).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request Jun 10, 2026
) (#488)

* feat(decision-engine): SignalText multi-source accessor + capability matrix (spec 07, #480)

Normalize any RawSignal (email/calendar/filesystem/voice) into a channel-agnostic
SignalText so commitment/deadline/security/cluster/entity capabilities are
source-agnostic. Extends AuthoringTier with authored_originated/received_shared;
adds a tested capability×source coverage matrix. Foundation for #475/#476/#479.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(api,db): observer default + new-user provisioning + seedUpsert (spec 10, #483)

- LOCKED: new users default to trust_tier 'observer' (users.ts) — matches DB
  default + CLAUDE.md; resolves the 3-way conflict that forced 'suggest'.
- provisionNewUser: eager empty twin profile + conservative autonomy defaults
  (no spend caps, so the built-in NO_SPEND_WITHOUT_LIMIT gate blocks spend
  until the user sets a budget — safe by construction).
- seedUpsert/buildUpsertSql: shared, tested idempotent upsert helper for
  re-runnable seeds (used by spec 09). Existing seed.ts already idempotent.

Part C (promotion soak-floor hoursInCurrentTier + tier-ladder intro) still TODO.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(worker,db): enforce promotion soak-floor via hoursInCurrentTier (spec 10 Part C, #483)

Daily promotion-eligibility job now populates hoursInCurrentTier (from last tier
change or account creation), so the engine actually enforces minDurationInTierHours
(24h observer->suggest, etc). Closes the documented gap where the floor was skipped
in the auto path. Fail-safe 0 keeps a promotion blocked when time can't be derived.
Tier-ladder intro UI folds into spec 08.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): deadline extraction feeds urgency (spec 03, #476)

extractDeadline parses absolute/relative dates (chrono-node) from any
text-bearing signal (SignalText-compatible) and returns the earliest credible
FUTURE deadline. situation-interpreter.enrichDeadline stamps rawEvent.deadline
when the connector didn't, so the existing assessUrgency consumer finally gets
fed. Rejects past dates + no-match. v1 leaves per-user-timezone resolution to
spec 12.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): commitment extraction from authored content (spec 02, #475)

extractCommitments surfaces the user's own stated obligations ("I'll send the
draft tomorrow" -> "Send the draft tomorrow") from authored SignalText. Gated to
authoredByUser + the commitments source allowlist (safety invariant #8: never
from inbound content). Rule extractor handles modal forms, excludes
questions/past/third-party/hypotheticals, dedups, and emits a deadlineHint for
spec 03. CommitmentStrategy seam left for an LLM path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): inbound security-alert classifier, escalate-only (spec 06, #479)

Adds SituationType.SECURITY_ALERT (enums.ts). classifySituation matches inbound
account-security markers FIRST (precedence over finance/email), urgency=high,
domain=security. The candidate generator emits ONLY a human-review escalation
that says "open the provider directly" with link-free parameters — never an
auto-executable action, never a URL from the untrusted body (safety invariant
#8). Provenance stays untrusted_external regardless of claimed sender.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): signal topic clustering for the digest (spec 04, #477)

clusterSignals groups awareness signals into life-domain topic clusters for the
Topics section. Anchors to known domains (beats the reference product's
mis-filing), guarantees complete + non-overlapping partition, caps cluster count
with overflow merged into "More updates" (logged via onMerge). Deterministic
fallback ships; ClusterStrategy seam for an LLM path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): source-coverage model for graceful degradation (spec 13, #487)

computeCoverage evaluates the capability x source matrix against a user's
connected accounts -> per-capability available/partial/unavailable + the sources
that would unlock each, plus a coldStart flag (zero sources, distinct from
connected-but-quiet). Excludes mock sources. Drives "connect X to unlock Y"
transparency; UI affordances render in spec 08.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(policy,decision-engine): access-faithful gates — scope + hidden (spec 11, #485)

Scope gate (policy-engine): requiredWriteScope/hasWriteScope/applyScopeGate.
Wired into DecisionMaker.generateCandidates — when grantedScopes is supplied,
un-granted write candidates (send/calendar) downgrade to a human-review "grant
access" item. Fail-safe NOT granted (safety invariant #8). Visibility filter
(decision-engine): isHidden/filterVisible — the single hide predicate the digest
routes input through (briefing-generator wiring lands with spec 01).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(db,worker,decision-engine): locale & timezone faithfulness (spec 12, #486)

Migration 063 adds users.language + users.timezone. userRepository.getLocale +
resolveLanguage/resolveTimezone/isNonEnglish helpers (safe fallbacks: en / UTC
with a logged-default flag). Briefing prose locale now reads the user profile
instead of hardcoded 'en'. isNonEnglish is the LLM-vs-rule routing signal for
the extractors (degraded-marker wiring is a follow-up on 02/03/06).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(db): launch demo fixture — opt-in, isolated, guarded (spec 09, #482)

assertDemoSafe (3-gate invariant #0): explicit-only, prod hard-blocked + non-local
needs override, identity isolation via is_demo (migration 064). Never wired into
bin/skytwin-dev/auto-seed — can't run for a real or new user. demo-fixture.ts
guards then upserts the reserved demo user + ingests a synthetic source-varied
corpus (email/calendar/file/voice) through /api/events/ingest; --reset deletes
is_demo rows only. `pnpm demo:fixture`. Guard fully unit-tested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine,policy-prompts): digest to-do/FYI split (spec 01, #474)

buildDigest partitions items into action-required to-dos (urgency-ordered, capped)
vs domain-clustered topics, with no overlap. Composes the epic: filters hidden
content first (spec 11), clusters topics (spec 04), carries sourceType+deadline
for the UI (spec 07/03). New briefing-prose v2 prompt emits the two-section
structured payload (todos + topics). The structured_payload column + repo read +
render land with spec 08 (UI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine): entity extraction + cross-signal resolution (spec 05, #478)

extractEntities pulls people (emails) + orgs (suffix-tagged) from SignalText.
resolveEntities links mentions to stable entityIds — exact email key for people
(never fuzzy), token-overlap floor for orgs, conservative mint-on-doubt so a
false merge can't corrupt the graph. linkEntitiesAcrossSignals aggregates "every
signal touching X". Persistence reuses MemoryPort.recordEntity; the
getSignalsForEntity port method is the remaining integration seam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(web,api): digest UI — two-bucket, source-aware, cited (spec 08, #481)

twin-briefing.js renders the structured digest: To-dos above Topics, each row
with a source-type chip (email/calendar/file/voice) + citation chips that open
the in-app signal detail (never an external URL — safety #8). Reuses the existing
singleton-delegator + hash-gate + data-action conventions (new open-signal
action). Falls back to prose when structured is null (back-compat). API /latest
passes through structured (nullable, forward-compatible). CSS reuses card/badge
tokens. Mobile BriefingScreen mirror is the remaining part of this spec.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(post-/review): address review findings across the epic

Correctness:
- deadline urgency: stale (past-relative-to-now) deadlines no longer read as
  critical; far-out deadlines no longer DOWNGRADE a type's default urgency (#1/#2)
- security markers curated to specific phrases — kill false positives on shipping
  notices / "welcome back" / articles (#3); marker check also applied on the LLM
  path so escalate-only holds regardless of classifier (safety defense-in-depth)
- digest emits signalRefs[] so citation chips actually render (#4)
- scope gate now covers calendar RSVP/invite write actions (#5)
- commitment extractor: clause-level negation (keep real commitments sharing a
  sentence with "if I…") (#6); "by <person>" no longer a deadline hint (#7)
- entity resolver compares full normalized string, not the truncated slug (#10)

Hardening/robustness:
- demo-guard isLocalDbTarget: exact host match, not substring (#8)
- provisionNewUser is genuinely best-effort (try/catch) — never 500s after the
  user row exists
- briefing-generator pinned to prompt v1 until it consumes v2 structured output
  (avoids requesting+discarding todos/topics); v2 deterministic_fallback fixed
- briefing test mock provides userRepository.getLocale so the LLM-prose path is
  actually exercised (#13)

Regression tests added for each. Full suite green (70/70 tasks).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): collapse prose under a disclosure when the digest renders (design-review)

Showing the structured two-bucket digest AND the full prose was the same briefing
twice. When structured is present, the prose moves under a "Full briefing"
<details> as the long-form view; falls back to inline prose when there's no
structured payload.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(decision-engine,web): power view — inline technical depth (spec 14)

One digest, two depths. Default stays the clean view (non-technical users
unaffected); a discoverable header "Power view" toggle (persisted) + per-item
"Details" expander reveal the depth SkyTwin already computes — provenance,
confidence %, urgency reason, why-it-didn't-auto-run (scope/tier/policy), real
source refs, and the explanation — plus a coverage panel ("what I can see,
connect X to unlock Y"). Not buried in settings.

buildDigestItemDetail is the pure view-model (raw codes -> human strings), unit
tested. UI follows the singleton-delegator/hash-gate/data-action conventions.
Digest payload carries optional per-item detail + coverage (generator populates).
Verified rendering via a headless-browser screenshot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(design): lock design system — calm command center, premium iris (DESIGN.md)

Source of truth grounded in a full element-and-state inventory of the digest
surfaces. Cool-neutral base (refines existing #0f1117 tokens; rejected the
warm/brown direction), iris #7C72E8 as the SINGLE accent meaning "needs you /
act", Fraunces voice + Geist + Geist Mono, action-vs-awareness hierarchy.
Catalogs every element + EVERY state including the gaps never rendered before:
cold-start, scope-blocked grant-access, loading, error, prose-fallback, distinct
security treatment, provenance in default view. CLAUDE.md now points UI + /qa +
/design-review at it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(web): implement DESIGN.md in the digest — iris, two-zone, gap states (spec 15)

Wires the locked design system into the real digest UI:
- Load Fraunces (twin voice) + Geist + Geist Mono (index.html)
- Iris #7C72E8 as the single accent = "needs you / act"; killed the CAPS
  source-chip soup -> one neutral source mark + a single "·N sources" citation;
  provenance as a dot (neutral, never accent)
- Action zone (to-dos: checkbox + inline Draft/Snooze/Verify/Grant, hover-reveal,
  always-on for security + touch) vs awareness zone (topics: lighter, no edge)
- Twin voice (Fraunces) + value line ("✓ N handled · M need you · K to catch up")
- Power view detail panel + coverage panel restyled to the system
- GAP STATES now designed: loading skeleton, empty-quiet, cold-start ("connect a
  source"), prose-fallback disclosure, distinct security treatment, scope-blocked
  "Grant access". Verified via headless-browser render of the real CSS.

Row-action wiring (draft/snooze/verify) routes/acknowledges until the act layer
lands. App-wide token adoption (vs digest-scoped iris) is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): make DOMContentLoaded handler async — SPA-breaking syntax error (pre-existing)

app.js:856 registered a non-async DOMContentLoaded handler, but the pairToken
branch (line ~904) uses `await fetch(...)` → "Unexpected reserved word" at parse
time, which aborts ALL app initialization. Every page rendered as an empty
#page-content shell. Present on origin/main; web JS has no type-check or tests, so
it shipped silently. One-word fix (() => → async () =>); verified by booting the
seeded app and touring dashboard/decisions/approvals/settings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(api,web): render the to-do/FYI digest live end-to-end (parity)

The digest existed as tested modules but never rendered in the running app:
the briefing generator produces no structured payload, so /latest returned
null and the UI fell back to "No briefing content yet". This closes that
seam so the AI-inbox parity (to-dos vs topics, multi-source) actually shows.

- live-digest.ts: compute the structured digest from a user's recent
  decisions — read each decision's RawSignal through toSignalText (spec 07)
  for real, source-agnostic titles, partition via buildDigest (spec 01/04),
  attach power-view detail (spec 14) and coverage (spec 13).
- twin-briefings /latest: when no structured_payload is stored, compute the
  digest live (best-effort; degrades to prose on error) and synthesize a
  briefing envelope so the page renders parity today. Forward-compatible:
  a stored payload still wins once the worker writes one.
- dashboard: Home leads with a read-only digest hero (action zone first,
  DESIGN.md) linking to the full interactive /briefing; stop showing the
  "connect Google" nag once the twin has produced decisions.
- index.html: first-class "Briefing" nav link.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(api,web,db): show "Needs you" for pending-approval decisions

The decisions log mapped auto_executed=false to "You OK'd", which mislabels a
decision still awaiting approval (notably an escalated security alert) as
already approved. Surface the outcome's requires_approval through the API and
add a distinct "Needs you" state so the log matches the Approvals page.

- decision-repository.getOutcomesForDecisions: also select requires_approval.
- decisions route: return requiresApproval per decision.
- decisions.js: Auto / Needs you / You OK'd / Pending, in that order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(api): describePreference never renders "[object Object]"

A structured preference value (e.g. a brand-preference object) fell through to
String(value) and rendered as "[object Object]" in the dashboard "What I've
learned" summaries. Render arrays/objects readably instead. Adds a regression
test covering objects, nested objects, arrays, booleans, strings, numbers, and
null.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): hide read controls on the live-computed briefing

The live digest (no stored row) carries the sentinel id 'live'; its "Mark as
read" button POSTed to /briefings/live/read and 400'd on the UUID check. Gate
the New badge + Mark-as-read on a persisted briefing so the control only shows
when there's a real row to mark.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(api,decision-engine): make power-view digest detail meaningful

The power-view detail panel rendered noise: "URGENCY: Default for security",
"REFS: email: 77538186" (an internal id slice), "WHY: Account notice" (just the
title again), and no confidence at all. Feed it real technical depth instead:

- confidence: pull decision_outcomes.confidence -> a real percentage.
- source ref: the actual sender/organizer/file ("email: no-reply@accounts.example"),
  not an opaque decision-id slice.
- urgencyReason: a real driver ("Security alert — always sent to you", "New
  invite — awaiting your RSVP", "Routine — no deadline detected") via a new
  optional urgencyReason override on buildDigestItemDetail, instead of the
  generic "Default for <domain>".
- drop the redundant explanation (it duplicated the title).
- honest whyNotAutoExecuted: use the engine's real escalation_reason, and only
  fall back to the trust-tier gate when the item genuinely required approval —
  no fabricated "trust_tier:observer" on escalate-only items.
- normalizeUrgency: map the DB default 'normal' to 'medium', not 'low'
  (silent demotion).
- name the recent-decisions window; drop the redundant maxTodos override.

Adds a DB-mocked buildLiveDigest suite (cold start, to-do mapping + detail,
malformed raw_event, provenance fail-safe, handledCount) plus normalizeUrgency
and urgencyReasonFor helper tests. Fixes the sections-fold test's @skytwin/db
mock to define query so the live-digest path resolves cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): don't suppress connect heroes once the twin has data

Gating the Connect-Google/Connect-Gmail heroes on `hasAnyData` hid the
onboarding CTA for users who have decisions but haven't connected Gmail (the
"Calendar connected, Gmail not yet" segment) — the heroes already self-suppress
when actually connected, so the extra gate only hurt real users. Revert to
gating on tourMode only. Also drop a dead `t.kind === 'security'` branch in the
Home digest hero (buildDigest never sets kind).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(digest): show what each item says + the recommended next step

The digest told you a title and a pile of system metadata (origin, confidence,
"why it escalated") but not the two things that actually matter: what the item
says and what to do about it. Surface both, sourced from data we already had:

- body: the real content (email snippet, event description, file excerpt,
  transcript) via toSignalText, rendered as a one-line preview under each title
  — visible by default, not buried in the power view.
- suggestedAction: the twin's recommended next step, taken from the pipeline's
  selected candidate action ("Accept this calendar invitation", "Review this
  security alert in the provider's official app — don't click links in the
  message"), with sensible fallbacks for escalate-only situations.

UI: the to-do/topic rows now lead with title -> what it says; the power-view
detail leads with the actionable "suggested" step, and the trust metadata
(origin/confidence/refs) drops below it. The Home hero shows the content line
plus an iris "→ next step" so it's actionable without opening anything.

Carries body through DigestItem/DigestTodo/DigestTopicItem + buildDigest, and
adds suggestedAction to DigestItemDetail. Tests cover body extraction, the
pipeline-selected action, and the security/RSVP fallbacks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(digest): clean, user-facing next step on every item

Two gaps from the last pass: some suggestions were the rule-based engine's raw
internal text ("Apply appropriate labels to this email", "Escalate to user:
Decision needed regarding: transcript"), and the suggestion only showed in the
power-view detail — so in the default view most items had no visible next step.

- suggestedActionFor now maps the structured selected action TYPE to plain
  English ("Accept the invite, or decline / propose another time", "Nothing
  needed — I'll file it", "Take a look and tell me what to do"), with a
  security-specific instruction and situation fallbacks. Every item gets a
  clean, user-facing step — no engine internals leak through.
- The "→ next step" now renders in the row itself for every item (to-do and
  topic), visible without the power view. The power-view detail drops back to
  the trust/technical metadata (origin, confidence, refs) it's meant for.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(digest): plain-language detail — drop the system vocabulary

The detail panel was accurate but spoke the way the system names things, not the
way a person asks. A non-technical user can't parse "ORIGIN: Inbound — untrusted",
"REFS", "NOT AUTO-RUN", a bare "CONFIDENCE: 80%", or "From your twin" — and
"untrusted" reads as a threat rather than "you didn't write this".

Rephrase everything user-facing:
- provenance: "Inbound — untrusted" -> "From someone else"; "From your twin" ->
  "From your assistant"; fail-safe stays "someone else".
- block reasons: "trust level (observer) asks me to check" -> "You've asked me
  to check with you before I act"; "From untrusted content" -> "It came from
  someone else, so I want your OK first". No internal codes leak.
- detail labels: origin/confidence/urgency/not-auto-run/refs become "where it's
  from / written by / how sure I am / why now / why I'm asking you".
- source ref: a real sender or a friendly "your calendar"/"a voice note", not an
  id slice or a filename echo.

Default view was already plain; this brings the power view to the same bar so
"advanced" doesn't mean "fluent in our nouns". Tests updated to assert the plain
wording and that no jargon leaks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(digest): every expand earns its rows; rename to "Your briefing"

Make the detail expansion uniformly useful and cut the filler:
- add "when" (relative time) — was missing entirely.
- "why now" is explanatory for FYI items too ("Not time-sensitive — just so
  you're aware") instead of the meaningless "Normal priority".
- confidence gets a word: "fairly sure (80%)", "very sure (100%)".
- drop the redundant "written by: someone else" (the sender already shows it);
  keep "written by: you" only when you authored it (genuinely notable).
- friendly source when there's no sender ("a voice note", "your files").

Also rename the page "Twin Briefing" → "Your briefing" with a plain subtitle,
matching the Home hero — "twin" is our metaphor, not a word a first-timer maps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(live-digest): align urgencyReasonFor assertion with new wording

The critical-urgency reason changed to "Urgent — needs your attention now";
update the assertion from /critical/i to /urgent/i. (Caught by the full test
run after the per-file runs passed — the prior commit shipped this red.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(decision-engine): persist candidates before risk assessments

saveRiskAssessment runs `UPDATE candidate_actions ... WHERE id = ?`, but
saveCandidates (the INSERT) ran AFTER it — so the UPDATE hit zero rows, the full
RiskAssessment (overallTier/dimensions) was lost, and only the thin
`{reasoning}` placeholder survived. At approve time the execute-preflight
(getRiskAssessment → parseRiskAssessmentFromRow, which requires overallTier)
then returned null → `risk_assessment_missing`, blocking the ENTIRE
approve→execute path (no action could ever be executed).

Move saveCandidates ahead of the risk-assessment loop so the rows exist when the
UPDATEs land. Adds a regression test asserting saveCandidates is invoked before
the first saveRiskAssessment (via vi.fn invocationCallOrder).

Found via a safe end-to-end execution-stack test (mock adapter + isolated
tokenless user + fake email); verified fixed: fresh fake email → approve →
execution completed via the (mock) adapter, no risk_assessment_missing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(execution): safe end-to-end execution-stack harness + OpenClaw test docs

bin/skytwin-test-execution-stack: a repeatable, no-real-side-effects test of the
full execution path (ingest → decide → policy/spend/risk gate → approval →
execution router → adapter → result). Two safety layers: an isolated TOKENLESS
test user (Direct handlers throw at resolveAccessToken before any Google fetch)
+ USE_MOCK_IRONCLAW (simulated adapter). Spins up its own mock-mode API on a
test port; re-runnable; asserts the stack executed and recorded a result.

docs/testing-openclaw.md: how to exercise the OpenClaw adapter safely against
local Ollama via the openclaw-bridge (verified working: Ollama installed, bridge
completes a fake action end-to-end, simulated, nothing real touched). Notes the
router trust-ranking caveat (direct outranks openclaw, so isolate it to see
OpenClaw execute) and the OPENCLAW_API_URL config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): setup — don't surface IronClaw credential-sync when it's unreachable

The Connect (#/setup) page showed "Not fully synced to IronClaw" + a "Sync to
IronClaw" button even when no IronClaw is configured/reachable (the common
case), so clicking it failed with a connection error. Gate the sync lookup on
ironclawSync.reachable: when IronClaw isn't reachable (no IronClaw, the local
mock, or a remote that's down) the sync affordance is hidden entirely — it's an
advanced feature that only applies to a real, reachable IronClaw. The execution
adapter row still shows its true state (Running / Registered-but-unreachable /
Not detected) via renderAdapterStatus.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web,api): Vault page loads under dev-auth bypass (was "API may be offline")

The Credential Vault page (#/credential-vault) showed "Unable to load vault
status. The API may be offline." on every load: the route's getUserId read only
req.user?.id (unset under the localhost dev-auth bypass), with none of the
req.query['userId'] fallback every other route has — so /credential-vault/status
400'd with "userId is required". Add the standard session→query→body userId
fallback (ownership still gated by requireOwnership when a real session exists),
and pass userId on the web's init/rotate/lock/unlock POST bodies so those work
under bypass too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(web): setup — optional execution adapters read as optional, not failed

An optional, unconfigured execution adapter (IronClaw / OpenClaw) rendered as
"Not detected" in the setup page's Live status — which reads like something is
broken. For optional engines, that's not a failure: most users never run them
(the always-available Direct adapter handles actions). renderAdapterStatus now
takes an `optional` flag; an optional adapter that isn't registered shows
"Optional — not connected" (calm, muted) instead of "Not detected". Direct still
shows "Not detected" if it ever went missing (a real problem). This is the
proper fix — correct whether or not a mock IronClaw is running, so no demo
crutch is needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs: add Codex agent instructions

* fix: address inbox intelligence review findings

* fix: require approval for missing-scope escalations

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant