You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Typed health modules (dashboard-health.ts, inference-health.ts)
{ ok, detail }, ChainStatus
Mid-onboard chain health; not wired to user-facing advisory output
Only mechanism #1 is data-first and structured. The other four grew because #1's scope is implicitly "host-time only" — anything mid-flow, post-preflight, or non-fatal had nowhere else to go.
Concrete evidence this hurts
#3159 (fix: warn before gateway start when nvidia.com/gpu CDI spec is missing) correctly extends mechanism #1, but had to hand-roll assertCdiNvidiaGpuSpecPresent in onboard.ts because:
The cached preflight session doesn't capture CDI state, so the --resume path walks back into the same failure the check was meant to prevent.
The current pipeline has no notion of resumeSafe: false — authors must remember to re-wire checks on the resume branch by hand.
blocking: boolean can't express "blocking-but-resumable" (the exact shape of this check).
Each new advisory-adjacent PR pays this cost, and the default path of least resistance is still console.warn or process.exit(1) — which is why we're at 74 + 117 ad-hoc sites.
Presenter (src/lib/advisories/presenter.ts) — formatAdvisories(advisories, "console" | "json") + assertNoBlocking(advisories) — the single fatal gate that replaces the 117 process.exit(1) sites.
What this buys
Adding a new check is a 30-line module, not a cross-file edit into onboard.ts or preflight.ts. Contributors adding "kernel too old for Landlock," "port conflict with VPN," "stale openshell version" all follow one pattern.
Resume correctness becomes declarative.resumeSafe: false → runner re-executes automatically. No more assert…Present helpers per check.
Severity distinctions become expressible. The 74 console.warn sites can migrate to severity: "warning" checks; the 117 process.exit(1) sites to severity: "fatal" / "blocking". Per-site migration, not big-bang.
Monolith growth slows. Every ad-hoc stderr site grew because there was nowhere else to put it. With a registry, "file a check in src/lib/advisories/checks/" is the cheaper path.
Advisories become addressable.nemoclaw doctor --json runs the full registry without starting onboard. Dashboard gets "6 advisories (2 blocking)" tile. Support can say "set NEMOCLAW_SUPPRESS=headless_remote_hint."
Docs linking becomes convention.docsUrl on every advisory closes the loop between error message and troubleshooting.md anchor that currently relies on authors remembering (many don't).
Testability stays high. Each AdvisoryCheck is a pure (ctx) => Advisory | null. No console-spy, no process.exit mock.
Monolith shrinks. Every migrated call site moves logic out of onboard.ts (currently ~10,160 lines) into focused check modules.
Migration path (incremental; no big-bang)
Phase 1 — foundation (one PR, zero behavior change)
src/lib/advisories/ module exists with types, registry, runner, presenter, and tests.
planHostRemediation is implemented on top of the registry with no behaviour change.
onboard.ts has one unified runAdvisories call per phase; no bespoke assert…Present helpers remain in the preflight area.
Contributor docs (CONTRIBUTING.md or equivalent) name the advisory registry as the single place to add new user-facing warnings/errors; the PR review skill enforces this.
At least 20 of the 74 console.warn sites and 20 of the 117 process.exit(1) sites have migrated.
nemoclaw doctor command lands and exercises the full registry without running onboard.
Notes
This epic is architectural, not feature-driven. Value accrues gradually — each phase is a strict improvement and reversible in isolation. The goal is to make "the right thing to do when adding a user-facing warning" the path of least resistance for future contributors, so we stop growing the monolith every time a new host precondition, credential check, or runtime advisory is needed.
Summary
The NemoClaw CLI has five parallel mechanisms for surfacing warnings, advisories, and fatal errors to users, with no shared contract:
RemediationActionpipeline inpreflight.ts{ id, title, kind, reason, commands[], blocking }→printRemediationActionsHostAssessment.notes: string[]"Running under WSL", etc.).includes()console.warn(...)src/lib/**.tsconsole.error(...) + process.exit(1)process.exit(1)sites inonboard.tsalonedashboard-health.ts,inference-health.ts){ ok, detail },ChainStatusOnly mechanism #1 is data-first and structured. The other four grew because #1's scope is implicitly "host-time only" — anything mid-flow, post-preflight, or non-fatal had nowhere else to go.
Concrete evidence this hurts
#3159 (fix: warn before gateway start when
nvidia.com/gpuCDI spec is missing) correctly extends mechanism #1, but had to hand-rollassertCdiNvidiaGpuSpecPresentinonboard.tsbecause:--resumepath walks back into the same failure the check was meant to prevent.resumeSafe: false— authors must remember to re-wire checks on the resume branch by hand.blocking: booleancan't express "blocking-but-resumable" (the exact shape of this check).Each new advisory-adjacent PR pays this cost, and the default path of least resistance is still
console.warnorprocess.exit(1)— which is why we're at 74 + 117 ad-hoc sites.What's missing from the current architecture
blocking: booleancan't distinguish fatal / blocking / warning / info / hint.--resumeis not declarable.docsUrlis not a first-class field, so error → troubleshooting.md linkage relies on authors remembering.console.error + process.exit(1)pattern mean 117 places that can drift.nemoclaw doctorsurfacing — no JSON channel; advisories only exist as stderr side effects of runningonboard.Proposed architecture
A single
src/lib/advisories/module with:With three companion pieces:
src/lib/advisories/registry.ts) — explicit imports of each check module (no auto-discovery; import graph stays traceable).src/lib/advisories/runner.ts) —runAdvisories(checks, ctx, { phase, resuming, suppressed })— filters by phase, honoursskipIf, enforcesresumeSafeon resume paths, applies suppression.src/lib/advisories/presenter.ts) —formatAdvisories(advisories, "console" | "json")+assertNoBlocking(advisories)— the single fatal gate that replaces the 117process.exit(1)sites.What this buys
onboard.tsorpreflight.ts. Contributors adding "kernel too old for Landlock," "port conflict with VPN," "staleopenshellversion" all follow one pattern.resumeSafe: false→ runner re-executes automatically. No moreassert…Presenthelpers per check.console.warnsites can migrate toseverity: "warning"checks; the 117process.exit(1)sites toseverity: "fatal"/"blocking". Per-site migration, not big-bang.src/lib/advisories/checks/" is the cheaper path.nemoclaw doctor --jsonruns the full registry without starting onboard. Dashboard gets "6 advisories (2 blocking)" tile. Support can say "setNEMOCLAW_SUPPRESS=headless_remote_hint."docsUrlon every advisory closes the loop between error message and troubleshooting.md anchor that currently relies on authors remembering (many don't).AdvisoryCheckis a pure(ctx) => Advisory | null. No console-spy, noprocess.exitmock.onboard.ts(currently ~10,160 lines) into focused check modules.Migration path (incremental; no big-bang)
Phase 1 — foundation (one PR, zero behavior change)
src/lib/advisories/{types,registry,runner,presenter}.ts.Phase 2 — migrate the existing pipeline (one PR, zero behavior change)
planHostRemediationas 9AdvisoryCheck<HostAssessment>modules undersrc/lib/advisories/checks/host/.RemediationAction[]as a thin adapter on top ofAdvisory[]so nothing else in the codebase breaks.HostAssessment.notesintoseverity: "info"advisories.src/lib/preflight.tsshrinks by ~200 lines as logic moves into focused check modules.Phase 3 — wire
onboard.ts(one PR)runAdvisories(HOST_CHECKS, host, { phase: "preflight.host", resuming })call each.assertCdiNvidiaGpuSpecPresent(from fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing #3159) — its resume-safety is now declarative on the check module.onboard.ts: target −50 to −100 lines.Phase 4 — opportunistic migration (many small PRs, each good-first-issue)
console.warnsites toseverity: "warning"checks, highest-traffic first (DNS probe, web-search verification, rebuild state preservation, snapshot reconciliation, destroy, connect).process.exit(1)sites toseverity: "fatal"/"blocking"checks, highest-traffic first (credential validation, inference probe, gateway health, DNS probe).onboard.tsacross the phase.Phase 5 — new surface area (one PR)
nemoclaw doctorcommand that runs all checks and prints/JSONs the output without starting onboard.NEMOCLAW_SUPPRESS=<id>,<id>env var and per-user~/.nemoclaw/advisories.jsonsuppression file.Out of scope
Related PRs & issues
fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing— the PR whose workarounds motivated this epic. TheassertCdiNvidiaGpuSpecPresenthelper it introduces goes away in Phase 3.refactor(onboard): extract modules from onboard.ts(WIP/Draft). Phase 4 migrations are natural extraction targets.Acceptance criteria
src/lib/advisories/module exists with types, registry, runner, presenter, and tests.planHostRemediationis implemented on top of the registry with no behaviour change.onboard.tshas one unifiedrunAdvisoriescall per phase; no bespokeassert…Presenthelpers remain in the preflight area.console.warnsites and 20 of the 117process.exit(1)sites have migrated.nemoclaw doctorcommand lands and exercises the full registry without running onboard.Notes
This epic is architectural, not feature-driven. Value accrues gradually — each phase is a strict improvement and reversible in isolation. The goal is to make "the right thing to do when adding a user-facing warning" the path of least resistance for future contributors, so we stop growing the monolith every time a new host precondition, credential check, or runtime advisory is needed.