Skip to content

Umbrella: refactor onboarding into a serializable FSM #3802

@cv

Description

@cv

This issue is the umbrella for the core onboarding FSM workstream. It replaces the earlier mixed timeline with one canonical sequence.

Summary

Refactor NemoClaw onboarding into a serializable finite state machine that can be resumed, instrumented, observed by hooks, and eventually extended toward finer-grained recovery.

The core goal is major step-boundary resume, not mid-operation resume. Existing onboarding already persists coarse progress in src/lib/state/onboard-session.ts; this work formalizes that into an explicit machine and then moves orchestration into state handlers.

Scope of this issue

This issue tracks two milestones only:

  1. Milestone 1 — Observable FSM shell: add machine state, events, runtime, and observe-only hooks while preserving the current imperative flow.
  2. Milestone 2 — FSM-owned orchestration: extract the current onboarding steps into explicit state handlers owned by the machine runner.

Everything after that — richer diagnostics, fine-grained substates, UI/supervisor integration — should be opened as follow-up issues once Milestones 1 and 2 are reviewed.

Non-goals for this issue

Do not attempt full mid-step resume in this workstream.

Out of scope here:

  • resume inside gateway startup subprocesses,
  • resume inside Docker image builds,
  • resume inside sandbox create streams,
  • resume inside provider credential upserts,
  • resume inside model validation probes,
  • resume inside policy application,
  • external executable hooks,
  • TUI/web/supervisor control plane.

Those belong to follow-up issues after the coarse FSM is stable.

Current code shape

Important files:

  • src/lib/onboard.ts

    • Large imperative onboarding flow.
    • Contains most orchestration and resume/reuse logic.
    • Uses startRecordedStep(...), skippedStepMessage(...), and direct onboardSession calls.
  • src/lib/state/onboard-session.ts

    • Persistent JSON session management.
    • Stores current step status, timestamps, selected provider/model/sandbox, failure metadata, GPU/messaging/policy fields, and lock state.
  • src/lib/agent/onboard.ts

    • Handles non-OpenClaw agent setup.
    • Currently writes agent_setup completion/failure directly through onboardSession.
  • src/lib/onboard/local-inference-topology.ts

    • Runtime-derived local inference topology helper.
    • Decides whether the sandbox needs Ollama auth proxy fronting and whether Ollama systemd override repair is needed.
    • This kind of decision should be recomputed on resume, not persisted as durable FSM context.

Current persisted steps include:

  • preflight
  • gateway
  • provider_selection
  • inference
  • sandbox
  • openclaw
  • agent_setup
  • policies

messaging is displayed as a step today but is embedded in sandbox creation. Keep that behavior for the first FSM milestone.

Canonical workstream

This is the single roadmap for this issue.

Milestone PR group Purpose Output
0 Review/design alignment Agreement on state names, event semantics, skipped-state semantics, and session compatibility
1 1 FSM vocabulary and transition types New machine types and transition tests; no behavior change
1 2 Structured events around current session mutations Redacted state/context events emitted from existing session operations
1 3 Session machine snapshot Backward-compatible persisted machine snapshot with normalization from old sessions
1 4 OnboardRuntime Runtime owns transitions, context updates, failure handling, redaction, and event emission
1 5 Route current step-boundary helpers through runtime Existing flow still mostly imperative, but step boundaries use the runtime
1 6 Observe-only hook API Hooks can observe redacted events; hooks cannot mutate/veto state
2 7 Extract preflight state handler Preflight logic becomes an explicit handler
2 8 Extract gateway state handler Gateway reuse/recreate/start logic becomes an explicit handler
2 9 Extract provider selection and inference handlers Provider/model selection and inference setup become explicit handlers; split into 9a/9b if needed
2 10 Extract sandbox handler Sandbox reuse/recreate/create logic becomes an explicit handler; split if needed
2 11 Extract OpenClaw/agent setup, policies, finalization Final step groups become explicit handlers; split into 11a/11b/11c if needed

There are 11 PR groups in scope. A PR group may split into smaller reviewable PRs, but the milestone sequence should stay the same.

Target machine states

Initial coarse states:

export type OnboardMachineState =
  | "init"
  | "preflight"
  | "gateway"
  | "provider_selection"
  | "inference"
  | "sandbox"
  | "agent_setup"
  | "openclaw"
  | "policies"
  | "finalizing"
  | "post_verify"
  | "complete"
  | "failed";

Initial transition graph:

init -> preflight
preflight -> gateway
gateway -> provider_selection
provider_selection -> inference
inference -> provider_selection     # retry provider/model selection
inference -> sandbox
sandbox -> openclaw
sandbox -> agent_setup
openclaw -> policies
agent_setup -> policies
policies -> finalizing
finalizing -> post_verify
post_verify -> complete
any non-terminal state -> failed

Use finalizing and post_verify so complete can eventually mean “all onboarding work and post-onboard UX completed,” not merely “session was marked complete.”

Initial event vocabulary

export type OnboardMachineEventType =
  | "onboard.started"
  | "onboard.resumed"
  | "onboard.completed"
  | "onboard.failed"
  | "state.entered"
  | "state.exited"
  | "state.skipped"
  | "state.completed"
  | "state.failed"
  | "state.repair.started"
  | "state.repair.completed"
  | "state.repair.failed"
  | "context.updated"
  | "resume.conflict"
  | "hook.started"
  | "hook.completed"
  | "hook.failed";

Later follow-up issues can add command/probe/credential-specific events if needed.

Required design principles

1. Preserve current behavior first

The FSM should describe current onboarding before redesigning it.

Avoid changing:

  • prompts,
  • defaults,
  • gateway reuse behavior,
  • sandbox recreation conditions,
  • credential migration rules,
  • registry write timing,
  • policy carry-forward behavior.

2. Keep old sessions readable

Existing ~/.nemoclaw/onboard-session.json files must normalize correctly.

When new machine fields are absent, infer state from current fields:

  1. status === "complete" -> complete.
  2. status === "failed" -> failed.
  3. in-progress lastStepStarted -> that step.
  4. completed lastCompletedStep -> next logical state.
  5. otherwise -> init.

3. State and events must be redacted

Never persist or emit raw secrets.

Safe to persist/emit:

  • provider name,
  • model name,
  • sandbox name,
  • credential env var name,
  • redacted endpoint URL,
  • selected policy preset names,
  • selected messaging channel names.

Unsafe:

  • raw API keys,
  • bearer tokens,
  • QR tokens,
  • webhook tokens,
  • OAuth tokens,
  • unredacted URLs containing query secrets.

4. Skipped state does not mean no-op

A skipped state means the primary work did not need to rerun. It may still perform resume validation or repair.

Current examples:

  • resumed preflight skips full preflight but re-detects GPU and revalidates CDI/sandbox GPU configuration;
  • resumed provider_selection skips interactive selection but hydrates credentials and may repair the Ollama systemd loopback override for ollama-local;
  • resumed gateway, sandbox, and policies inspect live state before deciding whether reuse is safe.

If a skipped state performs repair, emit state.repair.started / state.repair.completed / state.repair.failed so diagnostics and hooks can distinguish fast-path skip from skip-plus-repair.

5. Persist stable intent, recompute runtime topology

Do not persist environment-derived topology decisions as durable FSM context.

Persist stable intent:

  • provider,
  • model,
  • sandbox name,
  • selected channels,
  • policy presets,
  • credential env var names.

Recompute live topology on every fresh/resume run:

  • WSL vs non-WSL behavior,
  • Docker Desktop vs native/rootless Docker reachability,
  • whether the sandbox needs the Ollama auth proxy,
  • whether host systemd overrides need repair,
  • gateway and sandbox live health.

PR group details

PR 1 — FSM vocabulary and transition types

Add:

  • src/lib/onboard/machine/types.ts
  • src/lib/onboard/machine/transitions.ts
  • transition tests

No behavior change.

PR 2 — Structured event emission around current session mutations

Wrap or augment:

  • markStepStarted(...)
  • markStepComplete(...)
  • markStepSkipped(...)
  • markStepFailed(...)
  • completeSession(...)

Emit redacted events. Avoid persistent full event logs by default.

PR 3 — Session machine snapshot

Add a compact machine snapshot to the session, e.g.:

export interface OnboardMachineSnapshot {
  version: 1;
  state: OnboardMachineState;
  stateEnteredAt: string | null;
  revision: number;
}

Normalize old sessions without requiring users to delete state.

PR 4 — OnboardRuntime

Runtime owns:

  • transition validation,
  • session persistence,
  • safe context updates,
  • failure handling,
  • redaction,
  • event emission,
  • future hook dispatch.

Prefer async methods from the start so hook support does not require a second API migration.

PR 5 — Route existing step-boundary helpers through runtime

Keep the current flow mostly intact, but make step boundaries go through the runtime.

Existing startRecordedStep(...) can remain as compatibility glue during this PR.

PR 6 — Observe-only hook API

Add hook support such as:

export interface OnboardHook {
  onEvent?(event: OnboardEvent): Promise<void> | void;
}

Hook failures should warn and emit hook.failed, but should not fail onboarding by default.

First external sink should be JSONL, not arbitrary executable hooks.

PR 7 — Preflight handler

Preserve:

  • resume skip behavior,
  • GPU re-detection,
  • CDI/sandbox GPU validation,
  • GPU passthrough persistence.

PR 8 — Gateway handler

Preserve:

  • named gateway reuse,
  • Docker-driver gateway state,
  • stale metadata cleanup,
  • gateway container verification,
  • HTTP readiness,
  • image drift recreation,
  • GPU passthrough compatibility,
  • legacy gateway replacement.

PR 9 — Provider selection and inference handlers

Split into 9a/9b if needed.

Preserve:

  • resume skip if provider/model already selected,
  • credential hydration,
  • ollama-local resume-time repair,
  • local inference topology decisions through src/lib/onboard/local-inference-topology.ts,
  • Hermes auth method,
  • model-router reconciliation,
  • provider upserts,
  • local Ollama proxy recovery/fronting via shouldFrontOllamaWithProxy(),
  • retry from inference back to provider selection.

PR 10 — Sandbox handler

Preserve:

  • web search support/drift handling,
  • messaging config hydration/drift handling,
  • sandbox reuse state,
  • Telegram/WeChat/GPU drift handling,
  • sandbox repair/recreate behavior,
  • sandbox name prompt/default behavior,
  • registry/default sandbox update timing,
  • rule that sandboxName is not persisted until sandbox creation succeeds.

Keep displayed messaging work embedded in sandbox for this milestone.

PR 11 — OpenClaw/agent setup, policies, finalization

Split into 11a/11b/11c if needed.

Preserve:

  • OpenClaw resume skip/setup,
  • non-OpenClaw agent health probe setup,
  • no direct session writes from src/lib/agent/onboard.ts,
  • policy preset clamp/apply/resume behavior,
  • legacy credential cleanup,
  • stale host file cleanup,
  • sandbox process recovery,
  • deployment verification,
  • dashboard printing.

Acceptance criteria for closing this issue

This issue can close when Milestones 1 and 2 are complete:

  1. Onboarding persists an explicit machine state.
  2. Existing step-boundary resume behavior still works.
  3. Old onboard-session.json files normalize correctly.
  4. Every major state emits redacted structured events.
  5. Observe-only hooks can subscribe to events.
  6. Invalid transitions are rejected or impossible through the runtime.
  7. Existing onboarding tests pass.
  8. New transition/session/event tests cover FSM behavior.
  9. Skipped states with resume validation/repair are represented accurately in events and tests.
  10. Runtime-derived topology decisions are recomputed on resume rather than persisted.
  11. Step handlers own the onboarding flow; src/lib/onboard.ts is primarily CLI shell/orchestration.
  12. src/lib/agent/onboard.ts no longer writes session state directly.
  13. No secrets appear in event logs, hook payloads, debug summaries, or session machine fields.

Follow-up issues after this closes

Open separate issues for these after the core FSM lands:

Follow-up A — Operational diagnostics

Potential work:

  • stable JSONL event log,
  • nemoclaw debug onboard-session,
  • resume eligibility explanation,
  • event schema docs in the current Fern MDX docs system,
  • test harness for driving the FSM from snapshots.

Follow-up B — Fine-grained substates and mid-step resume

Potential work:

  • gateway substates,
  • inference/provider substates,
  • sandbox build/create substates,
  • policy apply/verify substates,
  • per-substate recovery classification: replayable, detect-and-skip, repairable, unsafe/unknown.

Follow-up C — External orchestration / UI integration

Potential work if there is product need:

  • TUI/web progress backed by FSM events,
  • remote supervisor integration,
  • pause/cancel semantics,
  • dry-run plan mode,
  • machine-readable CI reports.

Metadata

Metadata

Labels

PRRPRR recommendation trackingarea: architectureArchitecture, design debt, major refactors, or maintainabilityarea: cliCommand line interface, flags, terminal UX, or outputarea: observabilityLogging, metrics, tracing, diagnostics, or debug outputrefactorPR restructures code without intended behavior change
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions