This issue is the umbrella for the core onboarding FSM workstream. It replaces the earlier mixed timeline with one canonical sequence.
Summary
Refactor NemoClaw onboarding into a serializable finite state machine that can be resumed, instrumented, observed by hooks, and eventually extended toward finer-grained recovery.
The core goal is major step-boundary resume, not mid-operation resume. Existing onboarding already persists coarse progress in src/lib/state/onboard-session.ts; this work formalizes that into an explicit machine and then moves orchestration into state handlers.
Scope of this issue
This issue tracks two milestones only:
- Milestone 1 — Observable FSM shell: add machine state, events, runtime, and observe-only hooks while preserving the current imperative flow.
- Milestone 2 — FSM-owned orchestration: extract the current onboarding steps into explicit state handlers owned by the machine runner.
Everything after that — richer diagnostics, fine-grained substates, UI/supervisor integration — should be opened as follow-up issues once Milestones 1 and 2 are reviewed.
Non-goals for this issue
Do not attempt full mid-step resume in this workstream.
Out of scope here:
- resume inside gateway startup subprocesses,
- resume inside Docker image builds,
- resume inside sandbox create streams,
- resume inside provider credential upserts,
- resume inside model validation probes,
- resume inside policy application,
- external executable hooks,
- TUI/web/supervisor control plane.
Those belong to follow-up issues after the coarse FSM is stable.
Current code shape
Important files:
Current persisted steps include:
preflight
gateway
provider_selection
inference
sandbox
openclaw
agent_setup
policies
messaging is displayed as a step today but is embedded in sandbox creation. Keep that behavior for the first FSM milestone.
Canonical workstream
This is the single roadmap for this issue.
| Milestone |
PR group |
Purpose |
Output |
| 0 |
— |
Review/design alignment |
Agreement on state names, event semantics, skipped-state semantics, and session compatibility |
| 1 |
1 |
FSM vocabulary and transition types |
New machine types and transition tests; no behavior change |
| 1 |
2 |
Structured events around current session mutations |
Redacted state/context events emitted from existing session operations |
| 1 |
3 |
Session machine snapshot |
Backward-compatible persisted machine snapshot with normalization from old sessions |
| 1 |
4 |
OnboardRuntime |
Runtime owns transitions, context updates, failure handling, redaction, and event emission |
| 1 |
5 |
Route current step-boundary helpers through runtime |
Existing flow still mostly imperative, but step boundaries use the runtime |
| 1 |
6 |
Observe-only hook API |
Hooks can observe redacted events; hooks cannot mutate/veto state |
| 2 |
7 |
Extract preflight state handler |
Preflight logic becomes an explicit handler |
| 2 |
8 |
Extract gateway state handler |
Gateway reuse/recreate/start logic becomes an explicit handler |
| 2 |
9 |
Extract provider selection and inference handlers |
Provider/model selection and inference setup become explicit handlers; split into 9a/9b if needed |
| 2 |
10 |
Extract sandbox handler |
Sandbox reuse/recreate/create logic becomes an explicit handler; split if needed |
| 2 |
11 |
Extract OpenClaw/agent setup, policies, finalization |
Final step groups become explicit handlers; split into 11a/11b/11c if needed |
There are 11 PR groups in scope. A PR group may split into smaller reviewable PRs, but the milestone sequence should stay the same.
Target machine states
Initial coarse states:
export type OnboardMachineState =
| "init"
| "preflight"
| "gateway"
| "provider_selection"
| "inference"
| "sandbox"
| "agent_setup"
| "openclaw"
| "policies"
| "finalizing"
| "post_verify"
| "complete"
| "failed";
Initial transition graph:
init -> preflight
preflight -> gateway
gateway -> provider_selection
provider_selection -> inference
inference -> provider_selection # retry provider/model selection
inference -> sandbox
sandbox -> openclaw
sandbox -> agent_setup
openclaw -> policies
agent_setup -> policies
policies -> finalizing
finalizing -> post_verify
post_verify -> complete
any non-terminal state -> failed
Use finalizing and post_verify so complete can eventually mean “all onboarding work and post-onboard UX completed,” not merely “session was marked complete.”
Initial event vocabulary
export type OnboardMachineEventType =
| "onboard.started"
| "onboard.resumed"
| "onboard.completed"
| "onboard.failed"
| "state.entered"
| "state.exited"
| "state.skipped"
| "state.completed"
| "state.failed"
| "state.repair.started"
| "state.repair.completed"
| "state.repair.failed"
| "context.updated"
| "resume.conflict"
| "hook.started"
| "hook.completed"
| "hook.failed";
Later follow-up issues can add command/probe/credential-specific events if needed.
Required design principles
1. Preserve current behavior first
The FSM should describe current onboarding before redesigning it.
Avoid changing:
- prompts,
- defaults,
- gateway reuse behavior,
- sandbox recreation conditions,
- credential migration rules,
- registry write timing,
- policy carry-forward behavior.
2. Keep old sessions readable
Existing ~/.nemoclaw/onboard-session.json files must normalize correctly.
When new machine fields are absent, infer state from current fields:
status === "complete" -> complete.
status === "failed" -> failed.
- in-progress
lastStepStarted -> that step.
- completed
lastCompletedStep -> next logical state.
- otherwise ->
init.
3. State and events must be redacted
Never persist or emit raw secrets.
Safe to persist/emit:
- provider name,
- model name,
- sandbox name,
- credential env var name,
- redacted endpoint URL,
- selected policy preset names,
- selected messaging channel names.
Unsafe:
- raw API keys,
- bearer tokens,
- QR tokens,
- webhook tokens,
- OAuth tokens,
- unredacted URLs containing query secrets.
4. Skipped state does not mean no-op
A skipped state means the primary work did not need to rerun. It may still perform resume validation or repair.
Current examples:
- resumed
preflight skips full preflight but re-detects GPU and revalidates CDI/sandbox GPU configuration;
- resumed
provider_selection skips interactive selection but hydrates credentials and may repair the Ollama systemd loopback override for ollama-local;
- resumed
gateway, sandbox, and policies inspect live state before deciding whether reuse is safe.
If a skipped state performs repair, emit state.repair.started / state.repair.completed / state.repair.failed so diagnostics and hooks can distinguish fast-path skip from skip-plus-repair.
5. Persist stable intent, recompute runtime topology
Do not persist environment-derived topology decisions as durable FSM context.
Persist stable intent:
- provider,
- model,
- sandbox name,
- selected channels,
- policy presets,
- credential env var names.
Recompute live topology on every fresh/resume run:
- WSL vs non-WSL behavior,
- Docker Desktop vs native/rootless Docker reachability,
- whether the sandbox needs the Ollama auth proxy,
- whether host systemd overrides need repair,
- gateway and sandbox live health.
PR group details
PR 1 — FSM vocabulary and transition types
Add:
src/lib/onboard/machine/types.ts
src/lib/onboard/machine/transitions.ts
- transition tests
No behavior change.
PR 2 — Structured event emission around current session mutations
Wrap or augment:
markStepStarted(...)
markStepComplete(...)
markStepSkipped(...)
markStepFailed(...)
completeSession(...)
Emit redacted events. Avoid persistent full event logs by default.
PR 3 — Session machine snapshot
Add a compact machine snapshot to the session, e.g.:
export interface OnboardMachineSnapshot {
version: 1;
state: OnboardMachineState;
stateEnteredAt: string | null;
revision: number;
}
Normalize old sessions without requiring users to delete state.
PR 4 — OnboardRuntime
Runtime owns:
- transition validation,
- session persistence,
- safe context updates,
- failure handling,
- redaction,
- event emission,
- future hook dispatch.
Prefer async methods from the start so hook support does not require a second API migration.
PR 5 — Route existing step-boundary helpers through runtime
Keep the current flow mostly intact, but make step boundaries go through the runtime.
Existing startRecordedStep(...) can remain as compatibility glue during this PR.
PR 6 — Observe-only hook API
Add hook support such as:
export interface OnboardHook {
onEvent?(event: OnboardEvent): Promise<void> | void;
}
Hook failures should warn and emit hook.failed, but should not fail onboarding by default.
First external sink should be JSONL, not arbitrary executable hooks.
PR 7 — Preflight handler
Preserve:
- resume skip behavior,
- GPU re-detection,
- CDI/sandbox GPU validation,
- GPU passthrough persistence.
PR 8 — Gateway handler
Preserve:
- named gateway reuse,
- Docker-driver gateway state,
- stale metadata cleanup,
- gateway container verification,
- HTTP readiness,
- image drift recreation,
- GPU passthrough compatibility,
- legacy gateway replacement.
PR 9 — Provider selection and inference handlers
Split into 9a/9b if needed.
Preserve:
- resume skip if provider/model already selected,
- credential hydration,
ollama-local resume-time repair,
- local inference topology decisions through
src/lib/onboard/local-inference-topology.ts,
- Hermes auth method,
- model-router reconciliation,
- provider upserts,
- local Ollama proxy recovery/fronting via
shouldFrontOllamaWithProxy(),
- retry from inference back to provider selection.
PR 10 — Sandbox handler
Preserve:
- web search support/drift handling,
- messaging config hydration/drift handling,
- sandbox reuse state,
- Telegram/WeChat/GPU drift handling,
- sandbox repair/recreate behavior,
- sandbox name prompt/default behavior,
- registry/default sandbox update timing,
- rule that
sandboxName is not persisted until sandbox creation succeeds.
Keep displayed messaging work embedded in sandbox for this milestone.
PR 11 — OpenClaw/agent setup, policies, finalization
Split into 11a/11b/11c if needed.
Preserve:
- OpenClaw resume skip/setup,
- non-OpenClaw agent health probe setup,
- no direct session writes from
src/lib/agent/onboard.ts,
- policy preset clamp/apply/resume behavior,
- legacy credential cleanup,
- stale host file cleanup,
- sandbox process recovery,
- deployment verification,
- dashboard printing.
Acceptance criteria for closing this issue
This issue can close when Milestones 1 and 2 are complete:
- Onboarding persists an explicit machine state.
- Existing step-boundary resume behavior still works.
- Old
onboard-session.json files normalize correctly.
- Every major state emits redacted structured events.
- Observe-only hooks can subscribe to events.
- Invalid transitions are rejected or impossible through the runtime.
- Existing onboarding tests pass.
- New transition/session/event tests cover FSM behavior.
- Skipped states with resume validation/repair are represented accurately in events and tests.
- Runtime-derived topology decisions are recomputed on resume rather than persisted.
- Step handlers own the onboarding flow;
src/lib/onboard.ts is primarily CLI shell/orchestration.
src/lib/agent/onboard.ts no longer writes session state directly.
- No secrets appear in event logs, hook payloads, debug summaries, or session machine fields.
Follow-up issues after this closes
Open separate issues for these after the core FSM lands:
Follow-up A — Operational diagnostics
Potential work:
- stable JSONL event log,
nemoclaw debug onboard-session,
- resume eligibility explanation,
- event schema docs in the current Fern MDX docs system,
- test harness for driving the FSM from snapshots.
Follow-up B — Fine-grained substates and mid-step resume
Potential work:
- gateway substates,
- inference/provider substates,
- sandbox build/create substates,
- policy apply/verify substates,
- per-substate recovery classification: replayable, detect-and-skip, repairable, unsafe/unknown.
Follow-up C — External orchestration / UI integration
Potential work if there is product need:
- TUI/web progress backed by FSM events,
- remote supervisor integration,
- pause/cancel semantics,
- dry-run plan mode,
- machine-readable CI reports.
This issue is the umbrella for the core onboarding FSM workstream. It replaces the earlier mixed timeline with one canonical sequence.
Summary
Refactor NemoClaw onboarding into a serializable finite state machine that can be resumed, instrumented, observed by hooks, and eventually extended toward finer-grained recovery.
The core goal is major step-boundary resume, not mid-operation resume. Existing onboarding already persists coarse progress in
src/lib/state/onboard-session.ts; this work formalizes that into an explicit machine and then moves orchestration into state handlers.Scope of this issue
This issue tracks two milestones only:
Everything after that — richer diagnostics, fine-grained substates, UI/supervisor integration — should be opened as follow-up issues once Milestones 1 and 2 are reviewed.
Non-goals for this issue
Do not attempt full mid-step resume in this workstream.
Out of scope here:
Those belong to follow-up issues after the coarse FSM is stable.
Current code shape
Important files:
src/lib/onboard.tsstartRecordedStep(...),skippedStepMessage(...), and directonboardSessioncalls.src/lib/state/onboard-session.tssrc/lib/agent/onboard.tsagent_setupcompletion/failure directly throughonboardSession.src/lib/onboard/local-inference-topology.tsCurrent persisted steps include:
preflightgatewayprovider_selectioninferencesandboxopenclawagent_setuppoliciesmessagingis displayed as a step today but is embedded in sandbox creation. Keep that behavior for the first FSM milestone.Canonical workstream
This is the single roadmap for this issue.
machinesnapshot with normalization from old sessionsOnboardRuntimeThere are 11 PR groups in scope. A PR group may split into smaller reviewable PRs, but the milestone sequence should stay the same.
Target machine states
Initial coarse states:
Initial transition graph:
Use
finalizingandpost_verifysocompletecan eventually mean “all onboarding work and post-onboard UX completed,” not merely “session was marked complete.”Initial event vocabulary
Later follow-up issues can add command/probe/credential-specific events if needed.
Required design principles
1. Preserve current behavior first
The FSM should describe current onboarding before redesigning it.
Avoid changing:
2. Keep old sessions readable
Existing
~/.nemoclaw/onboard-session.jsonfiles must normalize correctly.When new
machinefields are absent, infer state from current fields:status === "complete"->complete.status === "failed"->failed.lastStepStarted-> that step.lastCompletedStep-> next logical state.init.3. State and events must be redacted
Never persist or emit raw secrets.
Safe to persist/emit:
Unsafe:
4. Skipped state does not mean no-op
A skipped state means the primary work did not need to rerun. It may still perform resume validation or repair.
Current examples:
preflightskips full preflight but re-detects GPU and revalidates CDI/sandbox GPU configuration;provider_selectionskips interactive selection but hydrates credentials and may repair the Ollama systemd loopback override forollama-local;gateway,sandbox, andpoliciesinspect live state before deciding whether reuse is safe.If a skipped state performs repair, emit
state.repair.started/state.repair.completed/state.repair.failedso diagnostics and hooks can distinguish fast-path skip from skip-plus-repair.5. Persist stable intent, recompute runtime topology
Do not persist environment-derived topology decisions as durable FSM context.
Persist stable intent:
Recompute live topology on every fresh/resume run:
PR group details
PR 1 — FSM vocabulary and transition types
Add:
src/lib/onboard/machine/types.tssrc/lib/onboard/machine/transitions.tsNo behavior change.
PR 2 — Structured event emission around current session mutations
Wrap or augment:
markStepStarted(...)markStepComplete(...)markStepSkipped(...)markStepFailed(...)completeSession(...)Emit redacted events. Avoid persistent full event logs by default.
PR 3 — Session machine snapshot
Add a compact machine snapshot to the session, e.g.:
Normalize old sessions without requiring users to delete state.
PR 4 —
OnboardRuntimeRuntime owns:
Prefer async methods from the start so hook support does not require a second API migration.
PR 5 — Route existing step-boundary helpers through runtime
Keep the current flow mostly intact, but make step boundaries go through the runtime.
Existing
startRecordedStep(...)can remain as compatibility glue during this PR.PR 6 — Observe-only hook API
Add hook support such as:
Hook failures should warn and emit
hook.failed, but should not fail onboarding by default.First external sink should be JSONL, not arbitrary executable hooks.
PR 7 — Preflight handler
Preserve:
PR 8 — Gateway handler
Preserve:
PR 9 — Provider selection and inference handlers
Split into 9a/9b if needed.
Preserve:
ollama-localresume-time repair,src/lib/onboard/local-inference-topology.ts,shouldFrontOllamaWithProxy(),PR 10 — Sandbox handler
Preserve:
sandboxNameis not persisted until sandbox creation succeeds.Keep displayed
messagingwork embedded in sandbox for this milestone.PR 11 — OpenClaw/agent setup, policies, finalization
Split into 11a/11b/11c if needed.
Preserve:
src/lib/agent/onboard.ts,Acceptance criteria for closing this issue
This issue can close when Milestones 1 and 2 are complete:
onboard-session.jsonfiles normalize correctly.src/lib/onboard.tsis primarily CLI shell/orchestration.src/lib/agent/onboard.tsno longer writes session state directly.Follow-up issues after this closes
Open separate issues for these after the core FSM lands:
Follow-up A — Operational diagnostics
Potential work:
nemoclaw debug onboard-session,Follow-up B — Fine-grained substates and mid-step resume
Potential work:
Follow-up C — External orchestration / UI integration
Potential work if there is product need: