Skip to content

refactor(arch): introduce Dashboard Delivery Contract — single source of truth for dashboard reachability #2390

@jyaunches

Description

@jyaunches

Summary

Introduce a Dashboard Delivery Contract — a single source of truth for dashboard reachability configuration, a chain-level health verifier, and layered recovery. This replaces the current scattered approach where CORS origins, port forwarding, health probes, and access URLs are derived independently in 4+ files with no shared model.

Motivation

15 closed bugs, same files, same root cause

Over the past 5 weeks, 15 dashboard-related bugs have been filed, fixed as point patches, and closed — almost all touching onboard.ts, Dockerfile, nemoclaw-start.sh, and dashboard.ts:

# Date Bug Point fix
#311 Mar 20 Gateway not running after reboot Manual restart docs
#328 Mar 20 Can't connect Mission Control to gateway CORS + device auth patch
#820 Mar 24 Can't connect to localhost Port forward docs
#20 Apr 3 Remote dashboard binds 127.0.0.1 only Bake CHAT_UI_URL into Docker
#957 Mar 31 Forward binds wrong address, not restored after recreate Fix forward target
#684 Apr 13 Default ports conflict with common services Add configurable ports
#739 Apr 13 Restrictive allowedOrigins, can't update Widen CORS in Dockerfile
#1376 Apr 6 Connect fails after gateway restart Fix TLS cert handling
#1425 Apr 8 Gateway restart regenerates TLS, breaks connections Certificate persistence
#716 Apr 13 WSL2 restart breaks everything Fix reconnect logic
#1690 Apr 10 Destroy kills forward even when other sandbox uses it Guard shared forward
#1925 Apr 16 Custom port not forwarded correctly Inject NEMOCLAW_DASHBOARD_PORT
#1950 Apr 16 Re-onboard doesn't clean orphaned forward Auto-cleanup in onboard
#2020 Apr 17 "Healthy" gateway is actually dead Probe container, not stale metadata
#2167 Apr 23 Status doesn't show dashboard URL Add URL to status output

12 of these 15 would have been prevented by a contract that derives all dashboard config from a single source and verifies the full delivery chain.

6 more open bugs in the same area

# Age Assignee Title
#2042 6 days unassigned Pod restart leaves gateway + forward dead; recovery buried in connect
#2342 today jyaunches Dashboard "Version n/a" / "Health Offline" on Brev Launchable
#1178 23 days senthilr-nv, cjagwani Openclaw UI link intermittently unhealthy on Brev
#2258 1 day unassigned Brev onboard UI: 7 failures (CORS, probe, forward, SSE)
#2174 2 days cjagwani Second onboard crashes on port 18789 conflict
#2100 3 days cjagwani No E2E test for dashboard reachability

The Problem

The dashboard delivery chain has 4 links:

Link 1: Gateway Process      → running inside sandbox on :18789
Link 2: Port Forward          → SSH tunnel: host:18789 → sandbox:18789
Link 3: CORS / Auth Config    → allowedOrigins includes the browser's origin
Link 4: External Routing      → [Brev] nginx + cloudflared → host:18789

Today, each link is configured in a different file, checked in a different way (or not at all), and has no recovery mechanism:

  • CHAT_UI_URL is derived independently in onboard.ts, Dockerfile, nemoclaw-start.sh, and dashboard.ts
  • CORS origins are baked at Dockerfile build time with no runtime detection of the actual access URL
  • Health probes check / (returns 401 with auth enabled) instead of /health
  • Port forwarding is fire-and-forget with no health monitoring
  • No function exists that answers "is the dashboard reachable end-to-end and if not, which link is broken?"
  • Recovery only happens as a hidden side-effect of nemoclaw connect

Proposed Architecture

1. dashboard-contract.ts — Single Source of Truth (~80 lines)

interface DashboardDeliveryChain {
  accessUrl: string;        // auto-detected: Brev public URL, WSL host IP, or loopback
  corsOrigins: string[];    // always includes accessUrl origin + loopback
  forwardTarget: string;    // loopback → port-only; non-loopback → 0.0.0.0:port
  healthEndpoint: string;   // /health — accepts 200 or 401 as "alive"
  port: number;             // from NEMOCLAW_DASHBOARD_PORT or default
}

function buildChain(options?: { chatUiUrl?: string; platform?: string }): DashboardDeliveryChain;

All consumers (Dockerfile, onboard.ts, nemoclaw-start.sh, status, connect) read from this instead of deriving independently.

2. dashboard-health.ts — Chain Verification (~120 lines)

interface ChainStatus {
  healthy: boolean;
  links: {
    gateway: { ok: boolean; detail: string };
    forward: { ok: boolean; detail: string };
    cors: { ok: boolean; detail: string };
    external: { ok: boolean; detail: string };
  };
  diagnosis: string; // human-readable: "CORS allowedOrigins missing https://..."
}

function verifyDashboardChain(sandboxName: string): ChainStatus;

Used by nemoclaw status, onboard.ts (verify before printing success), and the dashboard UI.

3. dashboard-recover.ts — Layered Recovery (~100 lines)

function recoverDashboardChain(sandboxName: string): RecoverResult;

Idempotent, link-aware. Only fixes broken links:

  • Link 1 down → restart gateway inside sandbox
  • Link 2 down → re-establish openshell forward
  • Link 3 wrong → patch CORS with detected accessUrl
  • Link 4 down → diagnose and report (outside our control)

Used by nemoclaw recover (PR #2050), nemoclaw connect, and nemoclaw-start.sh on boot.

Files Changed

File Change Lines
src/lib/dashboard-contract.ts New ~80
src/lib/dashboard-health.ts New ~120
src/lib/dashboard-recover.ts New ~100
src/lib/dashboard.ts Refactor to use contract net reduction
src/lib/onboard.ts Replace scattered CHAT_UI_URL derivation with buildChain() net reduction
scripts/nemoclaw-start.sh Call verify/recover on boot ~10 lines added

3 new files (~300 lines), 3 modified files (net code reduction in onboard.ts).

Open Issues This Would Address

Once landed, the following open issues can be resolved or significantly simplified:

Non-Goals

Sequencing

This refactor touches onboard.ts and nemoclaw-start.sh, which currently have 9 open PRs against them. However, none of those PRs touch the dashboard delivery chain code (CORS, ensureDashboardForward, health probes, CHAT_UI_URL derivation). The contract can land independently or after the current wave of PRs clears.

Recommended approach: land #2342 (Brev Launchable point fix) first as a quick win, then follow with this refactor so the pattern doesn't repeat.

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.VDRLinked to VDR findingarea: cliCommand line interface, flags, terminal UX, or output
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions