Skip to content

Epic: unify warnings, advisories, and fatal exits under a single registry #3213

@jyaunches

Description

@jyaunches

Summary

The NemoClaw CLI has five parallel mechanisms for surfacing warnings, advisories, and fatal errors to users, with no shared contract:

# Mechanism Shape Sites
1 RemediationAction pipeline in preflight.ts Structured data: { id, title, kind, reason, commands[], blocking }printRemediationActions Host preflight only (~10 actions)
2 HostAssessment.notes: string[] Free-text strings ("Running under WSL", etc.) Preflight only; consumed via .includes()
3 Ad-hoc console.warn(...) Unstructured strings 74 callsites across src/lib/**.ts
4 Ad-hoc console.error(...) + process.exit(1) Unstructured strings 117 process.exit(1) sites in onboard.ts alone
5 Typed health modules (dashboard-health.ts, inference-health.ts) { ok, detail }, ChainStatus Mid-onboard chain health; not wired to user-facing advisory output

Only mechanism #1 is data-first and structured. The other four grew because #1's scope is implicitly "host-time only" — anything mid-flow, post-preflight, or non-fatal had nowhere else to go.

Concrete evidence this hurts

#3159 (fix: warn before gateway start when nvidia.com/gpu CDI spec is missing) correctly extends mechanism #1, but had to hand-roll assertCdiNvidiaGpuSpecPresent in onboard.ts because:

  • The cached preflight session doesn't capture CDI state, so the --resume path walks back into the same failure the check was meant to prevent.
  • The current pipeline has no notion of resumeSafe: false — authors must remember to re-wire checks on the resume branch by hand.
  • blocking: boolean can't express "blocking-but-resumable" (the exact shape of this check).

Each new advisory-adjacent PR pays this cost, and the default path of least resistance is still console.warn or process.exit(1) — which is why we're at 74 + 117 ad-hoc sites.

What's missing from the current architecture

  • Severity levelsblocking: boolean can't distinguish fatal / blocking / warning / info / hint.
  • Phase — advisories exist across preflight.host, preflight.network, onboard.gateway, onboard.sandbox, onboard.credentials, onboard.inference, runtime.rebuild, runtime.destroy, runtime.shields — the pipeline knows nothing about phase.
  • Resume-safety contract — whether a cached verdict can be trusted on --resume is not declarable.
  • Stable IDs on non-preflight advisories — no way to suppress, link docs, or track prevalence.
  • Docs linkingdocsUrl is not a first-class field, so error → troubleshooting.md linkage relies on authors remembering.
  • Single fatal gate — 117 copies of the console.error + process.exit(1) pattern mean 117 places that can drift.
  • Dashboard / nemoclaw doctor surfacing — no JSON channel; advisories only exist as stderr side effects of running onboard.

Proposed architecture

A single src/lib/advisories/ module with:

// src/lib/advisories/types.ts
export type Severity = "fatal" | "blocking" | "warning" | "info" | "hint";
export type Phase =
  | "preflight.host" | "preflight.network"
  | "onboard.gateway" | "onboard.sandbox" | "onboard.credentials" | "onboard.inference"
  | "runtime.rebuild" | "runtime.destroy" | "runtime.shields";

export interface Advisory {
  id: string;               // stable — for suppression, telemetry, docs anchors
  severity: Severity;
  phase: Phase;
  title: string;
  reason: string;
  commands?: string[];
  docsUrl?: string;         // e.g. "…/troubleshooting.md#unresolvable-cdi-devices"
  resumeSafe: boolean;      // may a cached session trust a prior pass?
  kind?: "manual" | "sudo" | "info";
}

export interface AdvisoryCheck<Ctx> {
  id: string;
  phase: Phase;
  severity: Severity;
  resumeSafe: boolean;
  skipIf?: (ctx: Ctx) => boolean;   // e.g. optedOutGpuPassthrough
  check: (ctx: Ctx) => Advisory | null;
}

With three companion pieces:

  1. Registry (src/lib/advisories/registry.ts) — explicit imports of each check module (no auto-discovery; import graph stays traceable).
  2. Runner (src/lib/advisories/runner.ts) — runAdvisories(checks, ctx, { phase, resuming, suppressed }) — filters by phase, honours skipIf, enforces resumeSafe on resume paths, applies suppression.
  3. Presenter (src/lib/advisories/presenter.ts) — formatAdvisories(advisories, "console" | "json") + assertNoBlocking(advisories) — the single fatal gate that replaces the 117 process.exit(1) sites.

What this buys

  • Adding a new check is a 30-line module, not a cross-file edit into onboard.ts or preflight.ts. Contributors adding "kernel too old for Landlock," "port conflict with VPN," "stale openshell version" all follow one pattern.
  • Resume correctness becomes declarative. resumeSafe: false → runner re-executes automatically. No more assert…Present helpers per check.
  • Severity distinctions become expressible. The 74 console.warn sites can migrate to severity: "warning" checks; the 117 process.exit(1) sites to severity: "fatal" / "blocking". Per-site migration, not big-bang.
  • Monolith growth slows. Every ad-hoc stderr site grew because there was nowhere else to put it. With a registry, "file a check in src/lib/advisories/checks/" is the cheaper path.
  • Advisories become addressable. nemoclaw doctor --json runs the full registry without starting onboard. Dashboard gets "6 advisories (2 blocking)" tile. Support can say "set NEMOCLAW_SUPPRESS=headless_remote_hint."
  • Docs linking becomes convention. docsUrl on every advisory closes the loop between error message and troubleshooting.md anchor that currently relies on authors remembering (many don't).
  • Testability stays high. Each AdvisoryCheck is a pure (ctx) => Advisory | null. No console-spy, no process.exit mock.
  • Monolith shrinks. Every migrated call site moves logic out of onboard.ts (currently ~10,160 lines) into focused check modules.

Migration path (incremental; no big-bang)

Phase 1 — foundation (one PR, zero behavior change)

  • Create src/lib/advisories/{types,registry,runner,presenter}.ts.
  • Add unit tests for the runner's resume-safety / suppression / phase-filter contracts.

Phase 2 — migrate the existing pipeline (one PR, zero behavior change)

  • Rewrite planHostRemediation as 9 AdvisoryCheck<HostAssessment> modules under src/lib/advisories/checks/host/.
  • Keep RemediationAction[] as a thin adapter on top of Advisory[] so nothing else in the codebase breaks.
  • Collapse HostAssessment.notes into severity: "info" advisories.
  • src/lib/preflight.ts shrinks by ~200 lines as logic moves into focused check modules.

Phase 3 — wire onboard.ts (one PR)

Phase 4 — opportunistic migration (many small PRs, each good-first-issue)

  • Migrate the 74 console.warn sites to severity: "warning" checks, highest-traffic first (DNS probe, web-search verification, rebuild state preservation, snapshot reconciliation, destroy, connect).
  • Migrate the 117 process.exit(1) sites to severity: "fatal" / "blocking" checks, highest-traffic first (credential validation, inference probe, gateway health, DNS probe).
  • Target: net −500+ lines from onboard.ts across the phase.

Phase 5 — new surface area (one PR)

  • Add nemoclaw doctor command that runs all checks and prints/JSONs the output without starting onboard.
  • Wire advisories into the dashboard health tile.
  • Add NEMOCLAW_SUPPRESS=<id>,<id> env var and per-user ~/.nemoclaw/advisories.json suppression file.

Out of scope

  • Telemetry emission — if/when NemoClaw adds telemetry, advisories are a natural event source, but that is its own effort.
  • Internationalisation / translation of advisory strings.
  • GUI / control-UI advisory panel — viable downstream consumer once the JSON channel exists, but not part of this epic.

Related PRs & issues

Acceptance criteria

  • src/lib/advisories/ module exists with types, registry, runner, presenter, and tests.
  • planHostRemediation is implemented on top of the registry with no behaviour change.
  • onboard.ts has one unified runAdvisories call per phase; no bespoke assert…Present helpers remain in the preflight area.
  • Contributor docs (CONTRIBUTING.md or equivalent) name the advisory registry as the single place to add new user-facing warnings/errors; the PR review skill enforces this.
  • At least 20 of the 74 console.warn sites and 20 of the 117 process.exit(1) sites have migrated.
  • nemoclaw doctor command lands and exercises the full registry without running onboard.

Notes

This epic is architectural, not feature-driven. Value accrues gradually — each phase is a strict improvement and reversible in isolation. The goal is to make "the right thing to do when adding a user-facing warning" the path of least resistance for future contributors, so we stop growing the monolith every time a new host precondition, credential check, or runtime advisory is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: architectureArchitecture, design debt, major refactors, or maintainability

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions