Umbrella: refactor onboarding into a serializable FSM

This issue is the umbrella for the **core onboarding FSM workstream**. It replaces the earlier mixed timeline with one canonical sequence.

## Summary

Refactor NemoClaw onboarding into a **serializable finite state machine** that can be resumed, instrumented, observed by hooks, and eventually extended toward finer-grained recovery.

The core goal is **major step-boundary resume**, not mid-operation resume. Existing onboarding already persists coarse progress in `src/lib/state/onboard-session.ts`; this work formalizes that into an explicit machine and then moves orchestration into state handlers.

## Scope of this issue

This issue tracks two milestones only:

1. **Milestone 1 — Observable FSM shell**: add machine state, events, runtime, and observe-only hooks while preserving the current imperative flow.
2. **Milestone 2 — FSM-owned orchestration**: extract the current onboarding steps into explicit state handlers owned by the machine runner.

Everything after that — richer diagnostics, fine-grained substates, UI/supervisor integration — should be opened as follow-up issues once Milestones 1 and 2 are reviewed.

## Non-goals for this issue

Do **not** attempt full mid-step resume in this workstream.

Out of scope here:

- resume inside gateway startup subprocesses,
- resume inside Docker image builds,
- resume inside sandbox create streams,
- resume inside provider credential upserts,
- resume inside model validation probes,
- resume inside policy application,
- external executable hooks,
- TUI/web/supervisor control plane.

Those belong to follow-up issues after the coarse FSM is stable.

## Current code shape

Important files:

- `src/lib/onboard.ts`
  - Large imperative onboarding flow.
  - Contains most orchestration and resume/reuse logic.
  - Uses `startRecordedStep(...)`, `skippedStepMessage(...)`, and direct `onboardSession` calls.

- `src/lib/state/onboard-session.ts`
  - Persistent JSON session management.
  - Stores current step status, timestamps, selected provider/model/sandbox, failure metadata, GPU/messaging/policy fields, and lock state.

- `src/lib/agent/onboard.ts`
  - Handles non-OpenClaw agent setup.
  - Currently writes `agent_setup` completion/failure directly through `onboardSession`.

- `src/lib/onboard/local-inference-topology.ts`
  - Runtime-derived local inference topology helper.
  - Decides whether the sandbox needs Ollama auth proxy fronting and whether Ollama systemd override repair is needed.
  - This kind of decision should be recomputed on resume, not persisted as durable FSM context.

Current persisted steps include:

- `preflight`
- `gateway`
- `provider_selection`
- `inference`
- `sandbox`
- `openclaw`
- `agent_setup`
- `policies`

`messaging` is displayed as a step today but is embedded in sandbox creation. Keep that behavior for the first FSM milestone.

## Canonical workstream

This is the single roadmap for this issue.

| Milestone | PR group | Purpose | Output |
| --- | ---: | --- | --- |
| 0 | — | Review/design alignment | Agreement on state names, event semantics, skipped-state semantics, and session compatibility |
| 1 | 1 | FSM vocabulary and transition types | New machine types and transition tests; no behavior change |
| 1 | 2 | Structured events around current session mutations | Redacted state/context events emitted from existing session operations |
| 1 | 3 | Session machine snapshot | Backward-compatible persisted `machine` snapshot with normalization from old sessions |
| 1 | 4 | `OnboardRuntime` | Runtime owns transitions, context updates, failure handling, redaction, and event emission |
| 1 | 5 | Route current step-boundary helpers through runtime | Existing flow still mostly imperative, but step boundaries use the runtime |
| 1 | 6 | Observe-only hook API | Hooks can observe redacted events; hooks cannot mutate/veto state |
| 2 | 7 | Extract preflight state handler | Preflight logic becomes an explicit handler |
| 2 | 8 | Extract gateway state handler | Gateway reuse/recreate/start logic becomes an explicit handler |
| 2 | 9 | Extract provider selection and inference handlers | Provider/model selection and inference setup become explicit handlers; split into 9a/9b if needed |
| 2 | 10 | Extract sandbox handler | Sandbox reuse/recreate/create logic becomes an explicit handler; split if needed |
| 2 | 11 | Extract OpenClaw/agent setup, policies, finalization | Final step groups become explicit handlers; split into 11a/11b/11c if needed |

There are **11 PR groups** in scope. A PR group may split into smaller reviewable PRs, but the milestone sequence should stay the same.

## Target machine states

Initial coarse states:

```ts
export type OnboardMachineState =
  | "init"
  | "preflight"
  | "gateway"
  | "provider_selection"
  | "inference"
  | "sandbox"
  | "agent_setup"
  | "openclaw"
  | "policies"
  | "finalizing"
  | "post_verify"
  | "complete"
  | "failed";
```

Initial transition graph:

```text
init -> preflight
preflight -> gateway
gateway -> provider_selection
provider_selection -> inference
inference -> provider_selection     # retry provider/model selection
inference -> sandbox
sandbox -> openclaw
sandbox -> agent_setup
openclaw -> policies
agent_setup -> policies
policies -> finalizing
finalizing -> post_verify
post_verify -> complete
any non-terminal state -> failed
```

Use `finalizing` and `post_verify` so `complete` can eventually mean “all onboarding work and post-onboard UX completed,” not merely “session was marked complete.”

## Initial event vocabulary

```ts
export type OnboardMachineEventType =
  | "onboard.started"
  | "onboard.resumed"
  | "onboard.completed"
  | "onboard.failed"
  | "state.entered"
  | "state.exited"
  | "state.skipped"
  | "state.completed"
  | "state.failed"
  | "state.repair.started"
  | "state.repair.completed"
  | "state.repair.failed"
  | "context.updated"
  | "resume.conflict"
  | "hook.started"
  | "hook.completed"
  | "hook.failed";
```

Later follow-up issues can add command/probe/credential-specific events if needed.

## Required design principles

### 1. Preserve current behavior first

The FSM should describe current onboarding before redesigning it.

Avoid changing:

- prompts,
- defaults,
- gateway reuse behavior,
- sandbox recreation conditions,
- credential migration rules,
- registry write timing,
- policy carry-forward behavior.

### 2. Keep old sessions readable

Existing `~/.nemoclaw/onboard-session.json` files must normalize correctly.

When new `machine` fields are absent, infer state from current fields:

1. `status === "complete"` -> `complete`.
2. `status === "failed"` -> `failed`.
3. in-progress `lastStepStarted` -> that step.
4. completed `lastCompletedStep` -> next logical state.
5. otherwise -> `init`.

### 3. State and events must be redacted

Never persist or emit raw secrets.

Safe to persist/emit:

- provider name,
- model name,
- sandbox name,
- credential env var name,
- redacted endpoint URL,
- selected policy preset names,
- selected messaging channel names.

Unsafe:

- raw API keys,
- bearer tokens,
- QR tokens,
- webhook tokens,
- OAuth tokens,
- unredacted URLs containing query secrets.

### 4. Skipped state does not mean no-op

A skipped state means the primary work did not need to rerun. It may still perform resume validation or repair.

Current examples:

- resumed `preflight` skips full preflight but re-detects GPU and revalidates CDI/sandbox GPU configuration;
- resumed `provider_selection` skips interactive selection but hydrates credentials and may repair the Ollama systemd loopback override for `ollama-local`;
- resumed `gateway`, `sandbox`, and `policies` inspect live state before deciding whether reuse is safe.

If a skipped state performs repair, emit `state.repair.started` / `state.repair.completed` / `state.repair.failed` so diagnostics and hooks can distinguish fast-path skip from skip-plus-repair.

### 5. Persist stable intent, recompute runtime topology

Do not persist environment-derived topology decisions as durable FSM context.

Persist stable intent:

- provider,
- model,
- sandbox name,
- selected channels,
- policy presets,
- credential env var names.

Recompute live topology on every fresh/resume run:

- WSL vs non-WSL behavior,
- Docker Desktop vs native/rootless Docker reachability,
- whether the sandbox needs the Ollama auth proxy,
- whether host systemd overrides need repair,
- gateway and sandbox live health.

## PR group details

### PR 1 — FSM vocabulary and transition types

Add:

- `src/lib/onboard/machine/types.ts`
- `src/lib/onboard/machine/transitions.ts`
- transition tests

No behavior change.

### PR 2 — Structured event emission around current session mutations

Wrap or augment:

- `markStepStarted(...)`
- `markStepComplete(...)`
- `markStepSkipped(...)`
- `markStepFailed(...)`
- `completeSession(...)`

Emit redacted events. Avoid persistent full event logs by default.

### PR 3 — Session machine snapshot

Add a compact machine snapshot to the session, e.g.:

```ts
export interface OnboardMachineSnapshot {
  version: 1;
  state: OnboardMachineState;
  stateEnteredAt: string | null;
  revision: number;
}
```

Normalize old sessions without requiring users to delete state.

### PR 4 — `OnboardRuntime`

Runtime owns:

- transition validation,
- session persistence,
- safe context updates,
- failure handling,
- redaction,
- event emission,
- future hook dispatch.

Prefer async methods from the start so hook support does not require a second API migration.

### PR 5 — Route existing step-boundary helpers through runtime

Keep the current flow mostly intact, but make step boundaries go through the runtime.

Existing `startRecordedStep(...)` can remain as compatibility glue during this PR.

### PR 6 — Observe-only hook API

Add hook support such as:

```ts
export interface OnboardHook {
  onEvent?(event: OnboardEvent): Promise<void> | void;
}
```

Hook failures should warn and emit `hook.failed`, but should not fail onboarding by default.

First external sink should be JSONL, not arbitrary executable hooks.

### PR 7 — Preflight handler

Preserve:

- resume skip behavior,
- GPU re-detection,
- CDI/sandbox GPU validation,
- GPU passthrough persistence.

### PR 8 — Gateway handler

Preserve:

- named gateway reuse,
- Docker-driver gateway state,
- stale metadata cleanup,
- gateway container verification,
- HTTP readiness,
- image drift recreation,
- GPU passthrough compatibility,
- legacy gateway replacement.

### PR 9 — Provider selection and inference handlers

Split into 9a/9b if needed.

Preserve:

- resume skip if provider/model already selected,
- credential hydration,
- `ollama-local` resume-time repair,
- local inference topology decisions through `src/lib/onboard/local-inference-topology.ts`,
- Hermes auth method,
- model-router reconciliation,
- provider upserts,
- local Ollama proxy recovery/fronting via `shouldFrontOllamaWithProxy()`,
- retry from inference back to provider selection.

### PR 10 — Sandbox handler

Preserve:

- web search support/drift handling,
- messaging config hydration/drift handling,
- sandbox reuse state,
- Telegram/WeChat/GPU drift handling,
- sandbox repair/recreate behavior,
- sandbox name prompt/default behavior,
- registry/default sandbox update timing,
- rule that `sandboxName` is not persisted until sandbox creation succeeds.

Keep displayed `messaging` work embedded in sandbox for this milestone.

### PR 11 — OpenClaw/agent setup, policies, finalization

Split into 11a/11b/11c if needed.

Preserve:

- OpenClaw resume skip/setup,
- non-OpenClaw agent health probe setup,
- no direct session writes from `src/lib/agent/onboard.ts`,
- policy preset clamp/apply/resume behavior,
- legacy credential cleanup,
- stale host file cleanup,
- sandbox process recovery,
- deployment verification,
- dashboard printing.

## Acceptance criteria for closing this issue

This issue can close when Milestones 1 and 2 are complete:

1. Onboarding persists an explicit machine state.
2. Existing step-boundary resume behavior still works.
3. Old `onboard-session.json` files normalize correctly.
4. Every major state emits redacted structured events.
5. Observe-only hooks can subscribe to events.
6. Invalid transitions are rejected or impossible through the runtime.
7. Existing onboarding tests pass.
8. New transition/session/event tests cover FSM behavior.
9. Skipped states with resume validation/repair are represented accurately in events and tests.
10. Runtime-derived topology decisions are recomputed on resume rather than persisted.
11. Step handlers own the onboarding flow; `src/lib/onboard.ts` is primarily CLI shell/orchestration.
12. `src/lib/agent/onboard.ts` no longer writes session state directly.
13. No secrets appear in event logs, hook payloads, debug summaries, or session machine fields.

## Follow-up issues after this closes

Open separate issues for these after the core FSM lands:

### Follow-up A — Operational diagnostics

Potential work:

- stable JSONL event log,
- `nemoclaw debug onboard-session`,
- resume eligibility explanation,
- event schema docs in the current Fern MDX docs system,
- test harness for driving the FSM from snapshots.

### Follow-up B — Fine-grained substates and mid-step resume

Potential work:

- gateway substates,
- inference/provider substates,
- sandbox build/create substates,
- policy apply/verify substates,
- per-substate recovery classification: replayable, detect-and-skip, repairable, unsafe/unknown.

### Follow-up C — External orchestration / UI integration

Potential work if there is product need:

- TUI/web progress backed by FSM events,
- remote supervisor integration,
- pause/cancel semantics,
- dry-run plan mode,
- machine-readable CI reports.


Milestone	PR group	Purpose	Output
0	—	Review/design alignment	Agreement on state names, event semantics, skipped-state semantics, and session compatibility
1	1	FSM vocabulary and transition types	New machine types and transition tests; no behavior change
1	2	Structured events around current session mutations	Redacted state/context events emitted from existing session operations
1	3	Session machine snapshot	Backward-compatible persisted `machine` snapshot with normalization from old sessions
1	4	`OnboardRuntime`	Runtime owns transitions, context updates, failure handling, redaction, and event emission
1	5	Route current step-boundary helpers through runtime	Existing flow still mostly imperative, but step boundaries use the runtime
1	6	Observe-only hook API	Hooks can observe redacted events; hooks cannot mutate/veto state
2	7	Extract preflight state handler	Preflight logic becomes an explicit handler
2	8	Extract gateway state handler	Gateway reuse/recreate/start logic becomes an explicit handler
2	9	Extract provider selection and inference handlers	Provider/model selection and inference setup become explicit handlers; split into 9a/9b if needed
2	10	Extract sandbox handler	Sandbox reuse/recreate/create logic becomes an explicit handler; split if needed
2	11	Extract OpenClaw/agent setup, policies, finalization	Final step groups become explicit handlers; split into 11a/11b/11c if needed

Umbrella: refactor onboarding into a serializable FSM #3802

Description

Summary

Scope of this issue

Non-goals for this issue

Current code shape

Canonical workstream

Target machine states

Initial event vocabulary

Required design principles

1. Preserve current behavior first

2. Keep old sessions readable

3. State and events must be redacted

4. Skipped state does not mean no-op

5. Persist stable intent, recompute runtime topology

PR group details

PR 1 — FSM vocabulary and transition types

PR 2 — Structured event emission around current session mutations

PR 3 — Session machine snapshot

PR 4 — OnboardRuntime

PR 5 — Route existing step-boundary helpers through runtime

PR 6 — Observe-only hook API

PR 7 — Preflight handler

PR 8 — Gateway handler

PR 9 — Provider selection and inference handlers

PR 10 — Sandbox handler

PR 11 — OpenClaw/agent setup, policies, finalization

Acceptance criteria for closing this issue

Follow-up issues after this closes

Follow-up A — Operational diagnostics

Follow-up B — Fine-grained substates and mid-step resume

Follow-up C — External orchestration / UI integration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PR 4 — `OnboardRuntime`