Skip to content

[Ubuntu 24.04][DGX Station][CLI&UX] nemoclaw backup-all on host-process gateway with schema drift aborts via generic migration error instead of structured "schema preflight" hint #4430

@hulynn

Description

@hulynn

Description

nemoclaw backup-all (and nemoclaw upgrade-sandboxes) is supposed to detect an OpenShell schema mismatch between the installed CLI and the running gateway and abort with a structured, actionable error block (see T6032121 expected result). On the host-process gateway driver, this detection path is not reached, and a real schema drift surfaces as a generic gateway-recovery error: the user sees a migration stack trace and Failed to query running sandboxes from OpenShell / Ensure OpenShell is running: openshell status — none of the actionable T6032121 lines (OpenShell gateway schema preflight failed…, Refusing to trust…, Run \nemoclaw onboard --resume`…`) appear.

The data-integrity behavior is correct: exit code is non-zero (1) and no partial backup is created. Only the user-facing diagnosis and recovery guidance are missing, so the user has no clear path to fix the drift.

Verified on DGX Station (aarch64 / Ubuntu 24.04) and reproducible on any host that runs NemoClaw with the host-process gateway driver. The legacy k3s-in-Docker (cluster-container) driver path still works as T6032121 specifies because the image-drift detection inspects the openshell-cluster-nemoclaw Docker container, which only exists in that configuration.

Environment

Device:        DGX Station
OS:            Ubuntu 24.04.4 LTS
Architecture:  aarch64
GPUs:          NVIDIA RTX PRO 6000 Blackwell Max-Q + NVIDIA GB300
Node.js:       v22.22.3
npm:           10.9.8
Docker:        Docker Engine 29.x (CDI nvidia runtime)
OpenShell CLI: openshell 0.0.44 (installed by NemoClaw v0.0.53)
NemoClaw:      nemoclaw v0.0.53
OpenClaw:      2026.5.22 (a374c3a) — inside fresh sandbox
Gateway driver: host-process (`openshell-gateway` binary on PATH;
                no `openshell-cluster-nemoclaw` Docker container)

Steps to Reproduce

Pre-condition:

  1. Host running NemoClaw v0.0.53 with at least one sandbox onboarded and reachable: nemoclaw <name> status returns Inference: healthy.
  2. Gateway is the host-process driver — verify:
    docker inspect openshell-cluster-nemoclaw --format '{{.Config.Image}}'
    returns empty / error (cluster container does not exist).

Induce schema drift by swapping in an older openshell-gateway binary so the installed openshell CLI (0.0.44) talks to a 0.0.43 gateway:

  1. Backup current binaries:
    cp $(which openshell) $(which openshell-gateway) /tmp/openshell-backup/
  2. Download older openshell-gateway (v0.0.43):
    curl -fsSL https://github.com/NVIDIA/OpenShell/releases/download/v0.0.43/openshell-gateway-aarch64-unknown-linux-gnu.tar.gz -o /tmp/ogw43.tar.gz
    tar -xzf /tmp/ogw43.tar.gz -C /tmp
  3. Stop any running gateway:
    ps -eo pid,comm | awk '$2 ~ /^openshell-gatew/ {print $1}' | xargs -r kill
  4. Swap the binary:
    cp /tmp/openshell-gateway $(which openshell-gateway)
  5. Confirm drift:
    openshell --version       # → openshell 0.0.44
    sha256sum $(which openshell-gateway) /tmp/openshell-gateway
    # the on-PATH binary should now hash to the v0.0.43 tarball
  6. Run the spec-affected command (gateway auto-starts with the swapped binary):
    nemoclaw backup-all 2>&1 | tee /tmp/backup-drift.log
    echo "EXIT=$?"
  7. Inspect for the six T6032121 spec lines:
    grep -E 'schema preflight failed|Installed OpenShell|Running gateway image|Refusing to trust|No sandbox data was changed|onboard --resume' /tmp/backup-drift.log
  8. Cleanup (restore the working binary and gateway):
    ps -eo pid,comm | awk '$2 ~ /^openshell-gatew/ {print $1}' | xargs -r kill
    cp /tmp/openshell-backup/openshell-gateway $(which openshell-gateway)
    nemoclaw <name> status   # auto-starts gateway with restored binary

Expected Result

Per T6032121:

  • exit code non-zero (1)
  • no partial backup created
  • output contains ALL of:
    • OpenShell gateway schema preflight failed before backing up registered sandboxes.
    • Installed OpenShell: 0.0.44
    • Running gateway image: … (or an equivalent gateway-binary version line)
    • Refusing to trust OpenShell sandbox state while the host CLI and gateway schema may be out of sync.
    • No sandbox data was changed.
    • Run \nemoclaw onboard --resume` to repair or recreate the NemoClaw gateway`

Same upgrade-sandboxes is expected to abort with the schema-preflight header pointing at sandbox upgrade.

Actual Result

$ nemoclaw backup-all
…
    INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
    Error:   × execution error: migration error: migration 5 was previously
      │ applied but is missing in the resolved migrations
✓ Active gateway set to 'nemoclaw'
  Failed to query running sandboxes from OpenShell.
  NemoClaw tried to recover its OpenShell gateway, but recovery did not complete.
  Ensure OpenShell is running: openshell status
$ echo "EXIT=$?"
EXIT=1

$ grep -cE 'schema preflight failed|Installed OpenShell|Running gateway image|Refusing to trust|No sandbox data was changed|onboard --resume' /tmp/backup-drift.log
0   (zero hits — none of the six required spec lines appear)

$ ls ~/.nemoclaw/rebuild-backups/<name>/
2026-05-28T10-09-17-605Z   (only the pre-drift backup — drift run did NOT
                            create a new partial backup; data integrity preserved)

Logs

Key lines emitted by the failing run (CLI stderr/stdout + the gateway server log that gets started under the older binary). Timestamps trimmed for readability:

Gateway server log (older binary trying to start against newer state):

WARN openshell_server::cli: TLS disabled — listening on plaintext HTTP
WARN openshell_server::cli: Neither mTLS (--tls-client-ca) nor OIDC
     (--oidc-issuer) is configured — the gateway has no authentication
     mechanism
INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
Error: × execution error: migration error: migration 5 was previously
       applied but is missing in the resolved migrations
INFO  openshell_server: Shutdown signal received; stopping gateway
WARN  openshell_server::supervisor_session: supervisor session: stream
      error … error: status: Unknown, message: "h2 protocol error: error
      reading a body from connection"
INFO  openshell_server::compute: Stopped Docker sandbox containers
      during gateway shutdown stopped_containers=1

NemoClaw CLI output (what the user sees):

✓ Active gateway set to 'nemoclaw'

[2/8] Starting OpenShell gateway
Starting OpenShell Docker-driver gateway...
✓ Docker-driver gateway is healthy   ← misleading "healthy" line; gateway
                                        actually crashed during migration
✓ Active gateway set to 'nemoclaw'
Failed to query running sandboxes from OpenShell.
NemoClaw tried to recover its OpenShell gateway, but recovery did not
complete.
Ensure OpenShell is running: openshell status

Code Analysis

File: nemoclaw/src/lib/adapters/openshell/gateway-drift.ts

The image-drift detection used by backup-all and upgrade-sandboxes reads the gateway container's image tag via getGatewayClusterImageRef(gatewayName):

type GatewayDriftDeps = {
  getInstalledOpenshellVersion?: () => string | null;
  getGatewayClusterImageRef?: (gatewayName: string) => string | null;
  isGatewayClusterActive?: (gatewayName: string) => boolean;
};

The default implementation calls dockerContainerInspectFormat("{{.Config.Image}}", "openshell-cluster-nemoclaw", ...).

On the host-process gateway driver there is no openshell-cluster-nemoclaw container — the gateway runs as a host process via the openshell-gateway binary. The inspect call returns null and the image-drift comparison short-circuits as "no drift detectable". The downstream protobuf-mismatch path then fires when the CLI tries to talk to the older gateway via gRPC, but the error surfaces through the generic gateway-recovery layer (Failed to query running sandboxes from OpenShell. / recovery did not complete.) instead of the dedicated T6032121 preflight block.

Suggested fix:

  1. Extend gateway-drift.ts with a host-process detection branch: compare openshell --version (CLI) against the running gateway's reported version — either via a /version HTTP endpoint on the gateway, or by reading the openshell-gateway binary's embedded version string at preflight time. If they differ, treat it as image-drift equivalent and emit the same six-line preflight block.
  2. Keep the current data-integrity behavior (exit 1, no partial backup); only add the structured diagnosis on top.
  3. Apply the same handling to nemoclaw upgrade-sandboxes.
  4. Also fix the misleading ✓ Docker-driver gateway is healthy line that the CLI still prints after the gateway crashed during migration — the health check should reject a gateway that just emitted a fatal migration error.

Once landed, T6032121 host-process variant should produce the same spec-defined output as the cluster-container variant — a single source of truth for schema-preflight UX across both gateway drivers.


NVB#6236133

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions