Description
nemoclaw backup-all (and nemoclaw upgrade-sandboxes) is supposed to detect an OpenShell schema mismatch between the installed CLI and the running gateway and abort with a structured, actionable error block (see T6032121 expected result). On the host-process gateway driver, this detection path is not reached, and a real schema drift surfaces as a generic gateway-recovery error: the user sees a migration stack trace and Failed to query running sandboxes from OpenShell / Ensure OpenShell is running: openshell status — none of the actionable T6032121 lines (OpenShell gateway schema preflight failed…, Refusing to trust…, Run \nemoclaw onboard --resume`…`) appear.
The data-integrity behavior is correct: exit code is non-zero (1) and no partial backup is created. Only the user-facing diagnosis and recovery guidance are missing, so the user has no clear path to fix the drift.
Verified on DGX Station (aarch64 / Ubuntu 24.04) and reproducible on any host that runs NemoClaw with the host-process gateway driver. The legacy k3s-in-Docker (cluster-container) driver path still works as T6032121 specifies because the image-drift detection inspects the openshell-cluster-nemoclaw Docker container, which only exists in that configuration.
Environment
Device: DGX Station
OS: Ubuntu 24.04.4 LTS
Architecture: aarch64
GPUs: NVIDIA RTX PRO 6000 Blackwell Max-Q + NVIDIA GB300
Node.js: v22.22.3
npm: 10.9.8
Docker: Docker Engine 29.x (CDI nvidia runtime)
OpenShell CLI: openshell 0.0.44 (installed by NemoClaw v0.0.53)
NemoClaw: nemoclaw v0.0.53
OpenClaw: 2026.5.22 (a374c3a) — inside fresh sandbox
Gateway driver: host-process (`openshell-gateway` binary on PATH;
no `openshell-cluster-nemoclaw` Docker container)
Steps to Reproduce
Pre-condition:
- Host running NemoClaw v0.0.53 with at least one sandbox onboarded and reachable:
nemoclaw <name> status returns Inference: healthy.
- Gateway is the host-process driver — verify:
docker inspect openshell-cluster-nemoclaw --format '{{.Config.Image}}'
returns empty / error (cluster container does not exist).
Induce schema drift by swapping in an older openshell-gateway binary so the installed openshell CLI (0.0.44) talks to a 0.0.43 gateway:
- Backup current binaries:
cp $(which openshell) $(which openshell-gateway) /tmp/openshell-backup/
- Download older openshell-gateway (v0.0.43):
curl -fsSL https://github.com/NVIDIA/OpenShell/releases/download/v0.0.43/openshell-gateway-aarch64-unknown-linux-gnu.tar.gz -o /tmp/ogw43.tar.gz
tar -xzf /tmp/ogw43.tar.gz -C /tmp
- Stop any running gateway:
ps -eo pid,comm | awk '$2 ~ /^openshell-gatew/ {print $1}' | xargs -r kill
- Swap the binary:
cp /tmp/openshell-gateway $(which openshell-gateway)
- Confirm drift:
openshell --version # → openshell 0.0.44
sha256sum $(which openshell-gateway) /tmp/openshell-gateway
# the on-PATH binary should now hash to the v0.0.43 tarball
- Run the spec-affected command (gateway auto-starts with the swapped binary):
nemoclaw backup-all 2>&1 | tee /tmp/backup-drift.log
echo "EXIT=$?"
- Inspect for the six T6032121 spec lines:
grep -E 'schema preflight failed|Installed OpenShell|Running gateway image|Refusing to trust|No sandbox data was changed|onboard --resume' /tmp/backup-drift.log
- Cleanup (restore the working binary and gateway):
ps -eo pid,comm | awk '$2 ~ /^openshell-gatew/ {print $1}' | xargs -r kill
cp /tmp/openshell-backup/openshell-gateway $(which openshell-gateway)
nemoclaw <name> status # auto-starts gateway with restored binary
Expected Result
Per T6032121:
- exit code non-zero (1)
- no partial backup created
- output contains ALL of:
OpenShell gateway schema preflight failed before backing up registered sandboxes.
Installed OpenShell: 0.0.44
Running gateway image: … (or an equivalent gateway-binary version line)
Refusing to trust OpenShell sandbox state while the host CLI and gateway schema may be out of sync.
No sandbox data was changed.
Run \nemoclaw onboard --resume` to repair or recreate the NemoClaw gateway`
Same upgrade-sandboxes is expected to abort with the schema-preflight header pointing at sandbox upgrade.
Actual Result
$ nemoclaw backup-all
…
INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
Error: × execution error: migration error: migration 5 was previously
│ applied but is missing in the resolved migrations
✓ Active gateway set to 'nemoclaw'
Failed to query running sandboxes from OpenShell.
NemoClaw tried to recover its OpenShell gateway, but recovery did not complete.
Ensure OpenShell is running: openshell status
$ echo "EXIT=$?"
EXIT=1
$ grep -cE 'schema preflight failed|Installed OpenShell|Running gateway image|Refusing to trust|No sandbox data was changed|onboard --resume' /tmp/backup-drift.log
0 (zero hits — none of the six required spec lines appear)
$ ls ~/.nemoclaw/rebuild-backups/<name>/
2026-05-28T10-09-17-605Z (only the pre-drift backup — drift run did NOT
create a new partial backup; data integrity preserved)
Logs
Key lines emitted by the failing run (CLI stderr/stdout + the gateway server log that gets started under the older binary). Timestamps trimmed for readability:
Gateway server log (older binary trying to start against newer state):
WARN openshell_server::cli: TLS disabled — listening on plaintext HTTP
WARN openshell_server::cli: Neither mTLS (--tls-client-ca) nor OIDC
(--oidc-issuer) is configured — the gateway has no authentication
mechanism
INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
Error: × execution error: migration error: migration 5 was previously
applied but is missing in the resolved migrations
INFO openshell_server: Shutdown signal received; stopping gateway
WARN openshell_server::supervisor_session: supervisor session: stream
error … error: status: Unknown, message: "h2 protocol error: error
reading a body from connection"
INFO openshell_server::compute: Stopped Docker sandbox containers
during gateway shutdown stopped_containers=1
NemoClaw CLI output (what the user sees):
✓ Active gateway set to 'nemoclaw'
[2/8] Starting OpenShell gateway
Starting OpenShell Docker-driver gateway...
✓ Docker-driver gateway is healthy ← misleading "healthy" line; gateway
actually crashed during migration
✓ Active gateway set to 'nemoclaw'
Failed to query running sandboxes from OpenShell.
NemoClaw tried to recover its OpenShell gateway, but recovery did not
complete.
Ensure OpenShell is running: openshell status
Code Analysis
File: nemoclaw/src/lib/adapters/openshell/gateway-drift.ts
The image-drift detection used by backup-all and upgrade-sandboxes reads the gateway container's image tag via getGatewayClusterImageRef(gatewayName):
type GatewayDriftDeps = {
getInstalledOpenshellVersion?: () => string | null;
getGatewayClusterImageRef?: (gatewayName: string) => string | null;
isGatewayClusterActive?: (gatewayName: string) => boolean;
};
The default implementation calls dockerContainerInspectFormat("{{.Config.Image}}", "openshell-cluster-nemoclaw", ...).
On the host-process gateway driver there is no openshell-cluster-nemoclaw container — the gateway runs as a host process via the openshell-gateway binary. The inspect call returns null and the image-drift comparison short-circuits as "no drift detectable". The downstream protobuf-mismatch path then fires when the CLI tries to talk to the older gateway via gRPC, but the error surfaces through the generic gateway-recovery layer (Failed to query running sandboxes from OpenShell. / recovery did not complete.) instead of the dedicated T6032121 preflight block.
Suggested fix:
- Extend
gateway-drift.ts with a host-process detection branch: compare openshell --version (CLI) against the running gateway's reported version — either via a /version HTTP endpoint on the gateway, or by reading the openshell-gateway binary's embedded version string at preflight time. If they differ, treat it as image-drift equivalent and emit the same six-line preflight block.
- Keep the current data-integrity behavior (exit 1, no partial backup); only add the structured diagnosis on top.
- Apply the same handling to
nemoclaw upgrade-sandboxes.
- Also fix the misleading
✓ Docker-driver gateway is healthy line that the CLI still prints after the gateway crashed during migration — the health check should reject a gateway that just emitted a fatal migration error.
Once landed, T6032121 host-process variant should produce the same spec-defined output as the cluster-container variant — a single source of truth for schema-preflight UX across both gateway drivers.
NVB#6236133
Description
nemoclaw backup-all(andnemoclaw upgrade-sandboxes) is supposed to detect an OpenShell schema mismatch between the installed CLI and the running gateway and abort with a structured, actionable error block (see T6032121 expected result). On the host-process gateway driver, this detection path is not reached, and a real schema drift surfaces as a generic gateway-recovery error: the user sees a migration stack trace andFailed to query running sandboxes from OpenShell/Ensure OpenShell is running: openshell status— none of the actionable T6032121 lines (OpenShell gateway schema preflight failed…,Refusing to trust…,Run \nemoclaw onboard --resume`…`) appear.The data-integrity behavior is correct: exit code is non-zero (1) and no partial backup is created. Only the user-facing diagnosis and recovery guidance are missing, so the user has no clear path to fix the drift.
Verified on DGX Station (aarch64 / Ubuntu 24.04) and reproducible on any host that runs NemoClaw with the host-process gateway driver. The legacy k3s-in-Docker (cluster-container) driver path still works as T6032121 specifies because the image-drift detection inspects the
openshell-cluster-nemoclawDocker container, which only exists in that configuration.Environment
Steps to Reproduce
Pre-condition:
nemoclaw <name> statusreturnsInference: healthy.docker inspect openshell-cluster-nemoclaw --format '{{.Config.Image}}'Induce schema drift by swapping in an older openshell-gateway binary so the installed openshell CLI (0.0.44) talks to a 0.0.43 gateway:
cp /tmp/openshell-gateway $(which openshell-gateway)grep -E 'schema preflight failed|Installed OpenShell|Running gateway image|Refusing to trust|No sandbox data was changed|onboard --resume' /tmp/backup-drift.logExpected Result
Per T6032121:
OpenShell gateway schema preflight failed before backing up registered sandboxes.Installed OpenShell: 0.0.44Running gateway image: …(or an equivalent gateway-binary version line)Refusing to trust OpenShell sandbox state while the host CLI and gateway schema may be out of sync.No sandbox data was changed.Run \nemoclaw onboard --resume` to repair or recreate the NemoClaw gateway`Same
upgrade-sandboxesis expected to abort with the schema-preflight header pointing at sandbox upgrade.Actual Result
Logs
Key lines emitted by the failing run (CLI stderr/stdout + the gateway server log that gets started under the older binary). Timestamps trimmed for readability:
Gateway server log (older binary trying to start against newer state):
NemoClaw CLI output (what the user sees):
Code Analysis
File:
nemoclaw/src/lib/adapters/openshell/gateway-drift.tsThe image-drift detection used by
backup-allandupgrade-sandboxesreads the gateway container's image tag viagetGatewayClusterImageRef(gatewayName):The default implementation calls
dockerContainerInspectFormat("{{.Config.Image}}", "openshell-cluster-nemoclaw", ...).On the host-process gateway driver there is no
openshell-cluster-nemoclawcontainer — the gateway runs as a host process via theopenshell-gatewaybinary. The inspect call returns null and the image-drift comparison short-circuits as "no drift detectable". The downstream protobuf-mismatch path then fires when the CLI tries to talk to the older gateway via gRPC, but the error surfaces through the generic gateway-recovery layer (Failed to query running sandboxes from OpenShell./recovery did not complete.) instead of the dedicated T6032121 preflight block.Suggested fix:
gateway-drift.tswith a host-process detection branch: compareopenshell --version(CLI) against the running gateway's reported version — either via a/versionHTTP endpoint on the gateway, or by reading theopenshell-gatewaybinary's embedded version string at preflight time. If they differ, treat it as image-drift equivalent and emit the same six-line preflight block.nemoclaw upgrade-sandboxes.✓ Docker-driver gateway is healthyline that the CLI still prints after the gateway crashed during migration — the health check should reject a gateway that just emitted a fatal migration error.Once landed, T6032121 host-process variant should produce the same spec-defined output as the cluster-container variant — a single source of truth for schema-preflight UX across both gateway drivers.
NVB#6236133