feat: real OpenClaw execution, 6 new domains, expanded onboarding#7
Merged
Conversation
Pure logic engine that evaluates approval stats to determine tier promotion/regression. Promotion thresholds: OBSERVER→SUGGEST (10 consecutive, 80% ratio), SUGGEST→LOW_AUTONOMY (20, 85%), LOW_AUTONOMY→MODERATE_AUTONOMY (50, 90%). HIGH_AUTONOMY requires explicit opt-in. Regression on: 3+ recent rejections, critical undo, or 30%+ rejection ratio. OBSERVER is the floor. Includes trust_tier_audit table, repository, and 21 unit tests covering all promotion paths, regression triggers, and combined evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds expires_at and batch_id columns to approval_requests. Expiry is urgency-based: immediate=15min, normal=24h, low=72h. ApprovalRouter class computes expiry, checks expiration, and supports batch approve/reject. Repository gains expirePending() for worker cron and batchRespond() for bulk operations. Migration uses safe 3-step pattern. 14 new unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SpendTracker enforces rolling 24h daily spend limits using COALESCE(actual, estimated) aggregation. Blocks actions when currentSpend + proposedCost > maxDailySpendCents. Reconciliation tracks estimate vs actual variance with percentage calculation. Includes spend_records table, repository with getDailyTotal() and reconcile(), and 12 unit tests covering boundary conditions, zero-cost passthrough, and variance math. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rs (M2 Phase 4) Domain autonomy: per-domain trust tier overrides using the more restrictive of global and domain tier. Escalation triggers: 5 configurable trigger types (amount threshold, risk tier, low confidence, novel situation, consecutive rejections). Includes migration 009, DB repositories, and 19 new tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…os (M2 Phase 5) 7 test groups covering all CLAUDE.md safety invariants: policy check enforcement, explanation logging, trust tier gating (all 5 tiers x risk levels), spend limits (per-action + daily), reversibility, feedback flow-back, and mandatory risk assessment. Plus domain autonomy and escalation trigger integration tests. 3 new safety regression scenarios for daily spend limits, domain autonomy, and new user OBSERVER tier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WorkflowHandlerRegistry maps SituationType to handler functions, generalizing the events route. Four new handlers (calendar-conflict, subscription-renewal, grocery-reorder, travel-decision) plus matching E2E tests that exercise the full pipeline for each situation type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 8) Settings API: GET/PUT autonomy settings, PUT/DELETE domain policies, POST/PATCH/DELETE escalation triggers. Settings page extended with spend limit controls, domain override management, and escalation trigger configuration. DB package now exports M2 repositories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncy) and 39 eval scenarios Phase 9 of the build plan. Three new metric trackers for measuring decision quality: EscalationCorrectnessTracker (precision/recall/F1), CalibrationErrorTracker (ECE), DecisionLatencyTracker (P50/P90/P99). Five new scenario files covering calendar (8), subscription (8), grocery (8), travel (8), and cross-domain (7) situations. All wired into ContinuousEvalRunner and exported from index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…CI workflow Phase 10 of the build plan. PreferenceEvolutionTracker records every preference change with attribution (feedback, evidence, explicit, inference) and supports point-in-time state reconstruction. TemporalReplayEngine diffs twin state between two dates using twin_profile_versions + preference_history. TwinService now wires evolution tracking into updatePreference() and processFeedback(). GitHub Actions workflow runs eval suite on push/PR to main with safety regression gating. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 7) Rollback E2E tests verify execute→rollback lifecycle, irreversible rejection, unknown plan handling, failure paths, operation log tracking, and health toggling against MockIronClawAdapter. Contract tests run identical assertions against both MockIronClawAdapter and DirectExecutionAdapter to verify behavioral parity on execute, rollback, healthCheck, and multi-step plans. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ask.ts: look up user trust tier from DB instead of hardcoding OBSERVER - briefings.ts: query proactiveScanRepository for real briefings, persist preferences to user autonomy_settings - skill-gaps.ts: query skillGapRepository with limit and actionType filter - proposals.ts: look up proposal by ID, update status via proposalRepository, create preference on twin profile when accepted - openclaw-adapter.ts: real HTTP client with /execute and /rollback endpoints, dry-run fallback when no server configured, proper rollback with plan tracking - continuous-runner.ts: store per-scenario results in EvalRun.scenarioResults, reconstruct previous results for regression comparison - eval-types.ts: add optional scenarioResults field to EvalRun Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se 7) MockIronClawServer: minimal HTTP server mimicking IronClaw's webhook API with HMAC-SHA256 verification, configurable responses, and message recording. Contract tests: validate that RealIronClawAdapter and MockIronClawAdapter produce compatible outputs (ExecutionResult, RollbackResult, healthCheck). 15 tests covering buildPlan, execute, rollback, healthCheck, and HMAC auth. Rollback E2E: 6 tests covering execute-then-rollback, irreversible rejection, unknown plan, operation logging, and independent multi-plan rollback. Execution router: added rollback-through-router integration test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a dropdown in the page header that lets users switch between Mission Control (dense/monospace), Quiet Confidence (minimal/Linear), and Warm Glass (glass-morphism/gradients) themes, each with dark and light modes. Persists to localStorage with no flash on reload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pnpm/action-setup@v4 now reads the version from package.json's packageManager field. Specifying both causes a conflict error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Read trustTier from DB instead of event payload in workflow handlers (trust-boundary violation, defaulted to OBSERVER not MODERATE_AUTONOMY) - Critical undo now drops to OBSERVER instead of one tier down - Reject negative spend amounts to prevent bypass - Add atomic checkAndRecordSpend to prevent TOCTOU race in spend tracking - Add ownership checks on escalation trigger PATCH/DELETE routes - Input validation for spend limit settings - Fail-closed for unknown escalation trigger types - Parameterize SQL interval expressions to prevent injection - Add findById to escalation trigger repository - Update tests for negative-cost rejection and critical-undo behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…emma4 default Wire real Ollama LLM execution through OpenClaw bridge with trust-ranked adapter fallback chain (IronClaw → Direct → OpenClaw). Expand the system from 6 to 12 situation types with 60+ action types across finance, smart home, task management, social media, documents, and health domains. Key changes: - OpenClaw bridge server (Node.js) bridging SkyTwin to Ollama gemma4 - 6 new IronClaw action handlers with throw-for-fallback pattern - Decision engine: 6 new candidate generators and situation classifiers - 5-step onboarding: welcome, identity, domain selection (10 cards), preference seeding (2 questions/domain), trust tier choice - Direct adapter throws on missing handlers to enable fallback chain - All route files fixed to use proper TwinRepositoryAdapter (not raw repos) - Policy repository adapter filters malformed string-condition rules - 48 new eval scenarios (8 per domain) - 3 seed users with cross-domain decision history - All IDs use crypto.randomUUID() for CockroachDB UUID columns - Desktop app scaffold (Electron) for Mac/Windows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI failed with ERR_PNPM_OUTDATED_LOCKFILE because the desktop app's electron dependencies were not in pnpm-lock.yaml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Direct adapter now throws instead of soft-failing when no handler is registered, enabling the execution router's fallback chain to continue to OpenClaw. Updated 2 tests that expected the old soft-failure behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bin/skytwin-dev: starts all services (CRDB, API, Web, Worker, OpenClaw) with a single command, auto-builds, migrates, seeds. Supports --stop and --no-ollama flags. Manages PID files for clean shutdown. - bin/skytwin-install: detects OS (macOS/Linux/WSL) and installs only missing dependencies (Node via nvm, pnpm via corepack, Docker, Ollama). Never overwrites existing installs. Pulls gemma4 model for Ollama. - Root package.json: added "start", "stop", "setup" convenience scripts. - .env.example: expanded with all service ports and optional config. Usage: pnpm setup # install everything (safe, idempotent) pnpm start # start all services pnpm stop # tear down Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove trust tier from user creation request body (must be earned, not declared) - Verify approval ownership before executing (user_id must match) - Run policy evaluator on approved actions (spend limits still apply) - Fix LLM bridge parse failure returning success:true → now success:false - Delete divergent server.ts duplicate (server.mjs is canonical, model already drifted) - Fix bridge port default 4100→3456 to match .env.example and skytwin-dev - Add 30s timeout on Ollama fetch to prevent hung requests - Fix desktop service restart counter resetting before health check - Cap seed-preferences at 100 and use Promise.all for parallel writes - Persist execution step info instead of empty arrays for rollback support - Move TwinService to router-level scope in users route Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 tasks
jayzalowitz
added a commit
that referenced
this pull request
Apr 26, 2026
Closes #75. Three additive guards on the safety kernel: 1. ExecutionRouter throws InvariantViolationError when called without a RiskAssessment or with a CandidateAction whose id does not match the assessment. Pins Safety Invariants #1 and #7 at the boundary, so a future caller that bypasses the decision pipeline cannot silently auto-execute. (+4 unit tests) 2. DecisionMaker.whatWouldIDo no longer leaks blocked candidates as alternativeActions when policy denies every candidate. Returns an empty alternatives array and surfaces the blocking reason via policyNotes so the prediction reflects what the user could actually take. (+1 unit test pinning the no-leak contract) 3. POST /api/events/ingest emits a decision:blocked-by-policy SSE event when no action was selected and no approval was created, so users see the policy result instead of silent ingestion. (+1 unit test) Production call sites (apps/api/src/routes/events.ts:230, apps/api/src/routes/approvals.ts:264) already build matching RiskAssessments — guards are inert for them, active against new orphan callers. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz
added a commit
that referenced
this pull request
May 5, 2026
…line (#148 v1) (#153) Closes #148 v1 — final phase-2 piece for the assistant epic. Chat now routes detected action intents ("archive that email", "schedule a meeting") through the existing decision pipeline. Saying an action in chat creates an ApprovalRequest on the existing #/approvals page; conversational messages still go through the LLM chat path unchanged. Conservative v1 — chat-driven actions ALWAYS land in approvals, never auto-execute, even when the engine returns autoExecute=true. Free-text is too ambiguous to bypass the approval step on the first cut. Phase 2 of #148 lifts this when we have an LLM-confidence score + per-user opt-in. Safety invariants — all upheld: - #1 (no auto-exec without policy): every intent runs through DecisionMaker.evaluate() → PolicyEvaluator.evaluate(). No bypass. - #2 (always log explanations): ExplanationGenerator.generate() persists for every chat-driven decision. Persist failure logs but doesn't abort. - #3 (trust tiers): pulled from user record, never from chat input. - #4-#7 inherited from DecisionMaker. Pieces: - @skytwin/assistant: detectIntent (rule-based regex/keyword classifier, tolerant to short/ambiguous messages, false-positive guarded) + ActionRouter port + AssistantService.routeIntent. Package stays free of decision-engine + db deps. - apps/api/src/routes/assistant.ts: buildActionRouter() factory wires TwinService + PolicyEvaluator + DecisionMaker + ExplanationGenerator + LabelInferencePort. Synthetic DecisionObject from chat intent. Persists ApprovalRequest. Emits approval:new SSE. - POST /messages branches on intent BEFORE the LLM call. Both sync JSON and SSE response paths supported. - pages/assistant.js: action footer renders approval-link or blocked notice based on metadata.intentRoute. CSS styled with theme variables. Tests: 16 new (12 intent classifier + 4 routeIntent). Full suite green across 40 packages; lint clean. Phase 2 epic is now complete: #146 streaming + #147 context + #149 multi-turn + #148 action routing. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 5, 2026
jayzalowitz
added a commit
that referenced
this pull request
May 5, 2026
Picks up the P1/P2 findings deferred from #154 (the hardening pass). 6 of the remaining UX review findings closed. Settings cleanup (P1 #7, #9) - Theme switcher relocated from page header (where it looked like a breadcrumb pill) to a dedicated "Visual theme" card in Settings. - AI provider section now titled "AI brain — needed for Chat (optional otherwise)" instead of just "(optional)" — the Chat feature requires it, the old title was misleading. Chat → Settings deep-link (P1 #9) - When POST /messages returns 409 "no AI provider configured", the chat bubble explains the dependency + footer link to Settings → AI brain. Previously the user got "No AI provider configured" with no path forward. Onboarding modal dimmer (P2 #15) - Bumped overlay rgba(0,0,0,.85) → .92 + backdrop-filter: blur(4px) so the sidebar/page behind the modal is properly muted (was bleeding through the glass effect at .85). Console error spam reduction (P2 #20) - New isApiKnownOffline() in api-client.js. Badge-poll loop in app.js backs off 10s → 60s when API is known down. Pre-fix produced 110+ console errors/min against a dead server. Date input theming on Audit (P1 #11) - New .themed-date class so native date inputs match the dark glass aesthetic (background, border, color-scheme for picker icon). - Webkit calendar-picker-indicator filter inverts the icon glyph on dark themes so it's actually visible. Tests: no new unit tests (browser-only). Backend suite still green across 40 packages. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 25, 2026
Closed
jayzalowitz
added a commit
that referenced
this pull request
May 25, 2026
… closed) (#417) * P1.1 #371: stop fabricating synthetic RiskAssessments at exec boundaries Closes #371. Safety Invariant #7 requires every CandidateAction carry a RiskAssessment that the router actually consumes. Pre-fix, BOTH execution boundaries (events.ts auto-execute, approvals.ts approve-execute) discarded the decision-maker's per-dimension assessment and constructed a fresh one: - events.ts re-derived from explanation.riskTier (a flat enum) and broadcast that single tier across all six dimensions. A candidate the decision-maker assessed HIGH on financial impact was routed LOW. - approvals.ts hardcoded LOW on every dimension with the comment "user-approved = lower risk." A human click does not move the underlying risk dimensions — the adapter selection then picked a less-guarded adapter than the decision-maker intended. Fix is on the consumer side; no changes to decision-maker, repository, or execution-router invariants: - events.ts: read `outcome.riskAssessment` directly (already attached by decision-maker.ts:263). Fall back to `decisionRepositoryAdapter.getRiskAssessment(actionId)` if absent. If both null, FAIL CLOSED: escalate to manual approval rather than fabricate. Drops the now-unused DimensionAssessment / RiskTier / RiskDimension imports. - events.ts approval-create payload now stamps the original `outcome.selectedAction.id` into the stored candidate_action JSONB, so the approve handler can recover the linkage. - approvals.ts: preserve the original candidate id from the stored action (was generating a fresh UUID, losing the lookup linkage). Look up the persisted assessment via decisionRepositoryAdapter. If missing, HTTP 409 risk_assessment_missing — same fail-closed posture as events.ts. Legacy approval rows from before this PR don't carry an `id` on the stored candidate_action and will hit the 409. The recovery is to re-trigger the decision so a fresh assessment is persisted. Tests: - events-routes.test.ts: mock decisionRepositoryAdapter.getRiskAssessment to return a baseline LOW assessment so existing autoExecute-path tests proceed (mocks didn't include riskAssessment on the outcome). - feedback-loop.test.ts: same mock + add valid UUID id to the candidate_action fixture rows so the approve path's lookup succeeds. - All 713 tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * P1.1 (post-Copilot): move risk-assessment 409 to preflight; echo id in test mocks Copilot review on PR #417 caught two concerns: 1. **The 409 risk_assessment_missing branch fired AFTER side effects.** approvalRepository.respond() had already marked the row 'approved', feedback was recorded, the episode was written, the memory port recorded the episode — then the 409 left an "approved" row with no execution and no way to retry (status conflict on the next attempt). Fix: pre-check the persisted RiskAssessment immediately after the dual-confirm gate, before any state mutation. The lookup uses `existing.candidate_action.id` (which events.ts now stamps) and stores both `preflightCandidateId` + `preflightRiskAssessment` for the later execute block to reuse. The execute block's redundant lookup is removed; it asserts non-null on the preflight result because the early 409 already returned for the missing case. 2. **Test mocks returned a fixed actionId, masking the router's actionId-match invariant.** Fix: both events-routes.test.ts and feedback-loop.test.ts now use `mockImplementation(async (actionId) => ({ actionId, ... }))` so the returned assessment echoes the input, and the `action.id === riskAssessment.actionId` invariant in execution-router can't silently pass on a mismatched fixture. All 713 API tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz
added a commit
that referenced
this pull request
Jun 6, 2026
Correctness: - deadline urgency: stale (past-relative-to-now) deadlines no longer read as critical; far-out deadlines no longer DOWNGRADE a type's default urgency (#1/#2) - security markers curated to specific phrases — kill false positives on shipping notices / "welcome back" / articles (#3); marker check also applied on the LLM path so escalate-only holds regardless of classifier (safety defense-in-depth) - digest emits signalRefs[] so citation chips actually render (#4) - scope gate now covers calendar RSVP/invite write actions (#5) - commitment extractor: clause-level negation (keep real commitments sharing a sentence with "if I…") (#6); "by <person>" no longer a deadline hint (#7) - entity resolver compares full normalized string, not the truncated slug (#10) Hardening/robustness: - demo-guard isLocalDbTarget: exact host match, not substring (#8) - provisionNewUser is genuinely best-effort (try/catch) — never 500s after the user row exists - briefing-generator pinned to prompt v1 until it consumes v2 structured output (avoids requesting+discarding todos/topics); v2 deterministic_fallback fixed - briefing test mock provides userRepository.getLocale so the LLM-prose path is actually exercised (#13) Regression tests added for each. Full suite green (70/70 tasks). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jayzalowitz
added a commit
that referenced
this pull request
Jun 10, 2026
) (#488) * feat(decision-engine): SignalText multi-source accessor + capability matrix (spec 07, #480) Normalize any RawSignal (email/calendar/filesystem/voice) into a channel-agnostic SignalText so commitment/deadline/security/cluster/entity capabilities are source-agnostic. Extends AuthoringTier with authored_originated/received_shared; adds a tested capability×source coverage matrix. Foundation for #475/#476/#479. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(api,db): observer default + new-user provisioning + seedUpsert (spec 10, #483) - LOCKED: new users default to trust_tier 'observer' (users.ts) — matches DB default + CLAUDE.md; resolves the 3-way conflict that forced 'suggest'. - provisionNewUser: eager empty twin profile + conservative autonomy defaults (no spend caps, so the built-in NO_SPEND_WITHOUT_LIMIT gate blocks spend until the user sets a budget — safe by construction). - seedUpsert/buildUpsertSql: shared, tested idempotent upsert helper for re-runnable seeds (used by spec 09). Existing seed.ts already idempotent. Part C (promotion soak-floor hoursInCurrentTier + tier-ladder intro) still TODO. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(worker,db): enforce promotion soak-floor via hoursInCurrentTier (spec 10 Part C, #483) Daily promotion-eligibility job now populates hoursInCurrentTier (from last tier change or account creation), so the engine actually enforces minDurationInTierHours (24h observer->suggest, etc). Closes the documented gap where the floor was skipped in the auto path. Fail-safe 0 keeps a promotion blocked when time can't be derived. Tier-ladder intro UI folds into spec 08. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): deadline extraction feeds urgency (spec 03, #476) extractDeadline parses absolute/relative dates (chrono-node) from any text-bearing signal (SignalText-compatible) and returns the earliest credible FUTURE deadline. situation-interpreter.enrichDeadline stamps rawEvent.deadline when the connector didn't, so the existing assessUrgency consumer finally gets fed. Rejects past dates + no-match. v1 leaves per-user-timezone resolution to spec 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): commitment extraction from authored content (spec 02, #475) extractCommitments surfaces the user's own stated obligations ("I'll send the draft tomorrow" -> "Send the draft tomorrow") from authored SignalText. Gated to authoredByUser + the commitments source allowlist (safety invariant #8: never from inbound content). Rule extractor handles modal forms, excludes questions/past/third-party/hypotheticals, dedups, and emits a deadlineHint for spec 03. CommitmentStrategy seam left for an LLM path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): inbound security-alert classifier, escalate-only (spec 06, #479) Adds SituationType.SECURITY_ALERT (enums.ts). classifySituation matches inbound account-security markers FIRST (precedence over finance/email), urgency=high, domain=security. The candidate generator emits ONLY a human-review escalation that says "open the provider directly" with link-free parameters — never an auto-executable action, never a URL from the untrusted body (safety invariant #8). Provenance stays untrusted_external regardless of claimed sender. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): signal topic clustering for the digest (spec 04, #477) clusterSignals groups awareness signals into life-domain topic clusters for the Topics section. Anchors to known domains (beats the reference product's mis-filing), guarantees complete + non-overlapping partition, caps cluster count with overflow merged into "More updates" (logged via onMerge). Deterministic fallback ships; ClusterStrategy seam for an LLM path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): source-coverage model for graceful degradation (spec 13, #487) computeCoverage evaluates the capability x source matrix against a user's connected accounts -> per-capability available/partial/unavailable + the sources that would unlock each, plus a coldStart flag (zero sources, distinct from connected-but-quiet). Excludes mock sources. Drives "connect X to unlock Y" transparency; UI affordances render in spec 08. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(policy,decision-engine): access-faithful gates — scope + hidden (spec 11, #485) Scope gate (policy-engine): requiredWriteScope/hasWriteScope/applyScopeGate. Wired into DecisionMaker.generateCandidates — when grantedScopes is supplied, un-granted write candidates (send/calendar) downgrade to a human-review "grant access" item. Fail-safe NOT granted (safety invariant #8). Visibility filter (decision-engine): isHidden/filterVisible — the single hide predicate the digest routes input through (briefing-generator wiring lands with spec 01). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(db,worker,decision-engine): locale & timezone faithfulness (spec 12, #486) Migration 063 adds users.language + users.timezone. userRepository.getLocale + resolveLanguage/resolveTimezone/isNonEnglish helpers (safe fallbacks: en / UTC with a logged-default flag). Briefing prose locale now reads the user profile instead of hardcoded 'en'. isNonEnglish is the LLM-vs-rule routing signal for the extractors (degraded-marker wiring is a follow-up on 02/03/06). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(db): launch demo fixture — opt-in, isolated, guarded (spec 09, #482) assertDemoSafe (3-gate invariant #0): explicit-only, prod hard-blocked + non-local needs override, identity isolation via is_demo (migration 064). Never wired into bin/skytwin-dev/auto-seed — can't run for a real or new user. demo-fixture.ts guards then upserts the reserved demo user + ingests a synthetic source-varied corpus (email/calendar/file/voice) through /api/events/ingest; --reset deletes is_demo rows only. `pnpm demo:fixture`. Guard fully unit-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine,policy-prompts): digest to-do/FYI split (spec 01, #474) buildDigest partitions items into action-required to-dos (urgency-ordered, capped) vs domain-clustered topics, with no overlap. Composes the epic: filters hidden content first (spec 11), clusters topics (spec 04), carries sourceType+deadline for the UI (spec 07/03). New briefing-prose v2 prompt emits the two-section structured payload (todos + topics). The structured_payload column + repo read + render land with spec 08 (UI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine): entity extraction + cross-signal resolution (spec 05, #478) extractEntities pulls people (emails) + orgs (suffix-tagged) from SignalText. resolveEntities links mentions to stable entityIds — exact email key for people (never fuzzy), token-overlap floor for orgs, conservative mint-on-doubt so a false merge can't corrupt the graph. linkEntitiesAcrossSignals aggregates "every signal touching X". Persistence reuses MemoryPort.recordEntity; the getSignalsForEntity port method is the remaining integration seam. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(web,api): digest UI — two-bucket, source-aware, cited (spec 08, #481) twin-briefing.js renders the structured digest: To-dos above Topics, each row with a source-type chip (email/calendar/file/voice) + citation chips that open the in-app signal detail (never an external URL — safety #8). Reuses the existing singleton-delegator + hash-gate + data-action conventions (new open-signal action). Falls back to prose when structured is null (back-compat). API /latest passes through structured (nullable, forward-compatible). CSS reuses card/badge tokens. Mobile BriefingScreen mirror is the remaining part of this spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(post-/review): address review findings across the epic Correctness: - deadline urgency: stale (past-relative-to-now) deadlines no longer read as critical; far-out deadlines no longer DOWNGRADE a type's default urgency (#1/#2) - security markers curated to specific phrases — kill false positives on shipping notices / "welcome back" / articles (#3); marker check also applied on the LLM path so escalate-only holds regardless of classifier (safety defense-in-depth) - digest emits signalRefs[] so citation chips actually render (#4) - scope gate now covers calendar RSVP/invite write actions (#5) - commitment extractor: clause-level negation (keep real commitments sharing a sentence with "if I…") (#6); "by <person>" no longer a deadline hint (#7) - entity resolver compares full normalized string, not the truncated slug (#10) Hardening/robustness: - demo-guard isLocalDbTarget: exact host match, not substring (#8) - provisionNewUser is genuinely best-effort (try/catch) — never 500s after the user row exists - briefing-generator pinned to prompt v1 until it consumes v2 structured output (avoids requesting+discarding todos/topics); v2 deterministic_fallback fixed - briefing test mock provides userRepository.getLocale so the LLM-prose path is actually exercised (#13) Regression tests added for each. Full suite green (70/70 tasks). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): collapse prose under a disclosure when the digest renders (design-review) Showing the structured two-bucket digest AND the full prose was the same briefing twice. When structured is present, the prose moves under a "Full briefing" <details> as the long-form view; falls back to inline prose when there's no structured payload. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(decision-engine,web): power view — inline technical depth (spec 14) One digest, two depths. Default stays the clean view (non-technical users unaffected); a discoverable header "Power view" toggle (persisted) + per-item "Details" expander reveal the depth SkyTwin already computes — provenance, confidence %, urgency reason, why-it-didn't-auto-run (scope/tier/policy), real source refs, and the explanation — plus a coverage panel ("what I can see, connect X to unlock Y"). Not buried in settings. buildDigestItemDetail is the pure view-model (raw codes -> human strings), unit tested. UI follows the singleton-delegator/hash-gate/data-action conventions. Digest payload carries optional per-item detail + coverage (generator populates). Verified rendering via a headless-browser screenshot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(design): lock design system — calm command center, premium iris (DESIGN.md) Source of truth grounded in a full element-and-state inventory of the digest surfaces. Cool-neutral base (refines existing #0f1117 tokens; rejected the warm/brown direction), iris #7C72E8 as the SINGLE accent meaning "needs you / act", Fraunces voice + Geist + Geist Mono, action-vs-awareness hierarchy. Catalogs every element + EVERY state including the gaps never rendered before: cold-start, scope-blocked grant-access, loading, error, prose-fallback, distinct security treatment, provenance in default view. CLAUDE.md now points UI + /qa + /design-review at it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(web): implement DESIGN.md in the digest — iris, two-zone, gap states (spec 15) Wires the locked design system into the real digest UI: - Load Fraunces (twin voice) + Geist + Geist Mono (index.html) - Iris #7C72E8 as the single accent = "needs you / act"; killed the CAPS source-chip soup -> one neutral source mark + a single "·N sources" citation; provenance as a dot (neutral, never accent) - Action zone (to-dos: checkbox + inline Draft/Snooze/Verify/Grant, hover-reveal, always-on for security + touch) vs awareness zone (topics: lighter, no edge) - Twin voice (Fraunces) + value line ("✓ N handled · M need you · K to catch up") - Power view detail panel + coverage panel restyled to the system - GAP STATES now designed: loading skeleton, empty-quiet, cold-start ("connect a source"), prose-fallback disclosure, distinct security treatment, scope-blocked "Grant access". Verified via headless-browser render of the real CSS. Row-action wiring (draft/snooze/verify) routes/acknowledges until the act layer lands. App-wide token adoption (vs digest-scoped iris) is a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): make DOMContentLoaded handler async — SPA-breaking syntax error (pre-existing) app.js:856 registered a non-async DOMContentLoaded handler, but the pairToken branch (line ~904) uses `await fetch(...)` → "Unexpected reserved word" at parse time, which aborts ALL app initialization. Every page rendered as an empty #page-content shell. Present on origin/main; web JS has no type-check or tests, so it shipped silently. One-word fix (() => → async () =>); verified by booting the seeded app and touring dashboard/decisions/approvals/settings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(api,web): render the to-do/FYI digest live end-to-end (parity) The digest existed as tested modules but never rendered in the running app: the briefing generator produces no structured payload, so /latest returned null and the UI fell back to "No briefing content yet". This closes that seam so the AI-inbox parity (to-dos vs topics, multi-source) actually shows. - live-digest.ts: compute the structured digest from a user's recent decisions — read each decision's RawSignal through toSignalText (spec 07) for real, source-agnostic titles, partition via buildDigest (spec 01/04), attach power-view detail (spec 14) and coverage (spec 13). - twin-briefings /latest: when no structured_payload is stored, compute the digest live (best-effort; degrades to prose on error) and synthesize a briefing envelope so the page renders parity today. Forward-compatible: a stored payload still wins once the worker writes one. - dashboard: Home leads with a read-only digest hero (action zone first, DESIGN.md) linking to the full interactive /briefing; stop showing the "connect Google" nag once the twin has produced decisions. - index.html: first-class "Briefing" nav link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(api,web,db): show "Needs you" for pending-approval decisions The decisions log mapped auto_executed=false to "You OK'd", which mislabels a decision still awaiting approval (notably an escalated security alert) as already approved. Surface the outcome's requires_approval through the API and add a distinct "Needs you" state so the log matches the Approvals page. - decision-repository.getOutcomesForDecisions: also select requires_approval. - decisions route: return requiresApproval per decision. - decisions.js: Auto / Needs you / You OK'd / Pending, in that order. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(api): describePreference never renders "[object Object]" A structured preference value (e.g. a brand-preference object) fell through to String(value) and rendered as "[object Object]" in the dashboard "What I've learned" summaries. Render arrays/objects readably instead. Adds a regression test covering objects, nested objects, arrays, booleans, strings, numbers, and null. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): hide read controls on the live-computed briefing The live digest (no stored row) carries the sentinel id 'live'; its "Mark as read" button POSTed to /briefings/live/read and 400'd on the UUID check. Gate the New badge + Mark-as-read on a persisted briefing so the control only shows when there's a real row to mark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(api,decision-engine): make power-view digest detail meaningful The power-view detail panel rendered noise: "URGENCY: Default for security", "REFS: email: 77538186" (an internal id slice), "WHY: Account notice" (just the title again), and no confidence at all. Feed it real technical depth instead: - confidence: pull decision_outcomes.confidence -> a real percentage. - source ref: the actual sender/organizer/file ("email: no-reply@accounts.example"), not an opaque decision-id slice. - urgencyReason: a real driver ("Security alert — always sent to you", "New invite — awaiting your RSVP", "Routine — no deadline detected") via a new optional urgencyReason override on buildDigestItemDetail, instead of the generic "Default for <domain>". - drop the redundant explanation (it duplicated the title). - honest whyNotAutoExecuted: use the engine's real escalation_reason, and only fall back to the trust-tier gate when the item genuinely required approval — no fabricated "trust_tier:observer" on escalate-only items. - normalizeUrgency: map the DB default 'normal' to 'medium', not 'low' (silent demotion). - name the recent-decisions window; drop the redundant maxTodos override. Adds a DB-mocked buildLiveDigest suite (cold start, to-do mapping + detail, malformed raw_event, provenance fail-safe, handledCount) plus normalizeUrgency and urgencyReasonFor helper tests. Fixes the sections-fold test's @skytwin/db mock to define query so the live-digest path resolves cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): don't suppress connect heroes once the twin has data Gating the Connect-Google/Connect-Gmail heroes on `hasAnyData` hid the onboarding CTA for users who have decisions but haven't connected Gmail (the "Calendar connected, Gmail not yet" segment) — the heroes already self-suppress when actually connected, so the extra gate only hurt real users. Revert to gating on tourMode only. Also drop a dead `t.kind === 'security'` branch in the Home digest hero (buildDigest never sets kind). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(digest): show what each item says + the recommended next step The digest told you a title and a pile of system metadata (origin, confidence, "why it escalated") but not the two things that actually matter: what the item says and what to do about it. Surface both, sourced from data we already had: - body: the real content (email snippet, event description, file excerpt, transcript) via toSignalText, rendered as a one-line preview under each title — visible by default, not buried in the power view. - suggestedAction: the twin's recommended next step, taken from the pipeline's selected candidate action ("Accept this calendar invitation", "Review this security alert in the provider's official app — don't click links in the message"), with sensible fallbacks for escalate-only situations. UI: the to-do/topic rows now lead with title -> what it says; the power-view detail leads with the actionable "suggested" step, and the trust metadata (origin/confidence/refs) drops below it. The Home hero shows the content line plus an iris "→ next step" so it's actionable without opening anything. Carries body through DigestItem/DigestTodo/DigestTopicItem + buildDigest, and adds suggestedAction to DigestItemDetail. Tests cover body extraction, the pipeline-selected action, and the security/RSVP fallbacks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(digest): clean, user-facing next step on every item Two gaps from the last pass: some suggestions were the rule-based engine's raw internal text ("Apply appropriate labels to this email", "Escalate to user: Decision needed regarding: transcript"), and the suggestion only showed in the power-view detail — so in the default view most items had no visible next step. - suggestedActionFor now maps the structured selected action TYPE to plain English ("Accept the invite, or decline / propose another time", "Nothing needed — I'll file it", "Take a look and tell me what to do"), with a security-specific instruction and situation fallbacks. Every item gets a clean, user-facing step — no engine internals leak through. - The "→ next step" now renders in the row itself for every item (to-do and topic), visible without the power view. The power-view detail drops back to the trust/technical metadata (origin, confidence, refs) it's meant for. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(digest): plain-language detail — drop the system vocabulary The detail panel was accurate but spoke the way the system names things, not the way a person asks. A non-technical user can't parse "ORIGIN: Inbound — untrusted", "REFS", "NOT AUTO-RUN", a bare "CONFIDENCE: 80%", or "From your twin" — and "untrusted" reads as a threat rather than "you didn't write this". Rephrase everything user-facing: - provenance: "Inbound — untrusted" -> "From someone else"; "From your twin" -> "From your assistant"; fail-safe stays "someone else". - block reasons: "trust level (observer) asks me to check" -> "You've asked me to check with you before I act"; "From untrusted content" -> "It came from someone else, so I want your OK first". No internal codes leak. - detail labels: origin/confidence/urgency/not-auto-run/refs become "where it's from / written by / how sure I am / why now / why I'm asking you". - source ref: a real sender or a friendly "your calendar"/"a voice note", not an id slice or a filename echo. Default view was already plain; this brings the power view to the same bar so "advanced" doesn't mean "fluent in our nouns". Tests updated to assert the plain wording and that no jargon leaks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(digest): every expand earns its rows; rename to "Your briefing" Make the detail expansion uniformly useful and cut the filler: - add "when" (relative time) — was missing entirely. - "why now" is explanatory for FYI items too ("Not time-sensitive — just so you're aware") instead of the meaningless "Normal priority". - confidence gets a word: "fairly sure (80%)", "very sure (100%)". - drop the redundant "written by: someone else" (the sender already shows it); keep "written by: you" only when you authored it (genuinely notable). - friendly source when there's no sender ("a voice note", "your files"). Also rename the page "Twin Briefing" → "Your briefing" with a plain subtitle, matching the Home hero — "twin" is our metaphor, not a word a first-timer maps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(live-digest): align urgencyReasonFor assertion with new wording The critical-urgency reason changed to "Urgent — needs your attention now"; update the assertion from /critical/i to /urgent/i. (Caught by the full test run after the per-file runs passed — the prior commit shipped this red.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(decision-engine): persist candidates before risk assessments saveRiskAssessment runs `UPDATE candidate_actions ... WHERE id = ?`, but saveCandidates (the INSERT) ran AFTER it — so the UPDATE hit zero rows, the full RiskAssessment (overallTier/dimensions) was lost, and only the thin `{reasoning}` placeholder survived. At approve time the execute-preflight (getRiskAssessment → parseRiskAssessmentFromRow, which requires overallTier) then returned null → `risk_assessment_missing`, blocking the ENTIRE approve→execute path (no action could ever be executed). Move saveCandidates ahead of the risk-assessment loop so the rows exist when the UPDATEs land. Adds a regression test asserting saveCandidates is invoked before the first saveRiskAssessment (via vi.fn invocationCallOrder). Found via a safe end-to-end execution-stack test (mock adapter + isolated tokenless user + fake email); verified fixed: fresh fake email → approve → execution completed via the (mock) adapter, no risk_assessment_missing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(execution): safe end-to-end execution-stack harness + OpenClaw test docs bin/skytwin-test-execution-stack: a repeatable, no-real-side-effects test of the full execution path (ingest → decide → policy/spend/risk gate → approval → execution router → adapter → result). Two safety layers: an isolated TOKENLESS test user (Direct handlers throw at resolveAccessToken before any Google fetch) + USE_MOCK_IRONCLAW (simulated adapter). Spins up its own mock-mode API on a test port; re-runnable; asserts the stack executed and recorded a result. docs/testing-openclaw.md: how to exercise the OpenClaw adapter safely against local Ollama via the openclaw-bridge (verified working: Ollama installed, bridge completes a fake action end-to-end, simulated, nothing real touched). Notes the router trust-ranking caveat (direct outranks openclaw, so isolate it to see OpenClaw execute) and the OPENCLAW_API_URL config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): setup — don't surface IronClaw credential-sync when it's unreachable The Connect (#/setup) page showed "Not fully synced to IronClaw" + a "Sync to IronClaw" button even when no IronClaw is configured/reachable (the common case), so clicking it failed with a connection error. Gate the sync lookup on ironclawSync.reachable: when IronClaw isn't reachable (no IronClaw, the local mock, or a remote that's down) the sync affordance is hidden entirely — it's an advanced feature that only applies to a real, reachable IronClaw. The execution adapter row still shows its true state (Running / Registered-but-unreachable / Not detected) via renderAdapterStatus. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web,api): Vault page loads under dev-auth bypass (was "API may be offline") The Credential Vault page (#/credential-vault) showed "Unable to load vault status. The API may be offline." on every load: the route's getUserId read only req.user?.id (unset under the localhost dev-auth bypass), with none of the req.query['userId'] fallback every other route has — so /credential-vault/status 400'd with "userId is required". Add the standard session→query→body userId fallback (ownership still gated by requireOwnership when a real session exists), and pass userId on the web's init/rotate/lock/unlock POST bodies so those work under bypass too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(web): setup — optional execution adapters read as optional, not failed An optional, unconfigured execution adapter (IronClaw / OpenClaw) rendered as "Not detected" in the setup page's Live status — which reads like something is broken. For optional engines, that's not a failure: most users never run them (the always-available Direct adapter handles actions). renderAdapterStatus now takes an `optional` flag; an optional adapter that isn't registered shows "Optional — not connected" (calm, muted) instead of "Not detected". Direct still shows "Not detected" if it ever went missing (a real problem). This is the proper fix — correct whether or not a mock IronClaw is running, so no demo crutch is needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: add Codex agent instructions * fix: address inbox intelligence review findings * fix: require approval for missing-scope escalations --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
pnpm test)pnpm build)🤖 Generated with Claude Code