Description
Description
After an in-place upgrade from NemoClaw v0.0.38 to v0.0.39 on Ubuntu 24.04 aarch64, the host CLI and the OpenShell binary are bumped (v0.0.39 / openshell 0.0.37) but the openshell-cluster Docker container is left on its old image (`nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487`). The new CLI's protobuf schema is incompatible with the still-running cluster controller's old schema, so every CLI ↔ cluster gRPC call fails with `× status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6"`. `nemoclaw status / rebuild / recover` are all broken; the user cannot reach existing sandboxes via the supported tooling, even though kubectl shows the sandbox pods are still healthy. There is also a data-risk window: the installer's pre-upgrade auto-backup phase prints `Skipping (not running)` for every existing sandbox (false — they ARE Running), so the upgrade proceeds with NO sandbox backup taken.
Not a duplicate of NVBug 6168039 (Merc Lau, DGX Spark) — that one is about a stale openclaw-gateway process keeping port 18789 occupied, causing new-sandbox-creation timeout. Different failure point (process vs cluster-image), different downstream (new-sandbox-create vs CLI-fully-broken-for-existing-sandbox). Both are v0.0.38 → v0.0.39 upgrade regressions on aarch64 and may share an upstream root cause ("upgrade flow doesn't fully restart cluster-side components"); cross-reference recommended.
Environment
Device: DGX Station-class workstation (galaxy-sku2-018, 10.176.192.158)
OS: Ubuntu 24.04.4 LTS (kernel 6.17.0-1014-nvidia-64k)
Architecture: aarch64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.2.1
OpenShell CLI: 0.0.37 (post-upgrade; was 0.0.36 pre)
NemoClaw: v0.0.39 (post-upgrade; was v0.0.38 pre) — installed via NEMOCLAW_INSTALL_REF=main since GitHub v0.0.39 tag not yet cut at filing time
OpenClaw: 2026.4.24 (cbcfdf6) — inside sandbox, unchanged across upgrade
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation + NVIDIA GB300, driver 595.58.03
Steps to Reproduce
1. Have a working NemoClaw v0.0.38 install on Ubuntu 24.04 aarch64 with at least one onboarded, healthy sandbox. Confirm:
nemoclaw --version # v0.0.38
openshell --version # 0.0.36
docker ps # nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 healthy
nemoclaw list # sandbox listed
docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods # sandbox pod 1/1 Running
2. Run the in-place upgrade:
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
(or set NEMOCLAW_INSTALL_REF=main if v0.0.39 has not yet been released as a tag.)
3. Observe installer output: pre-upgrade backup prints `Skipping (not running)` for sandboxes that ARE actually Running per kubectl.
4. After install, confirm:
nemoclaw --version # v0.0.39
openshell --version # 0.0.37
docker ps # STILL nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 (OLD)
5. Run:
nemoclaw status
Observe `OpenShell: unknown (unknown)` and the raw protobuf decode error.
6. With NVIDIA_API_KEY set, run:
nemoclaw rebuild --yes
Observe preflight refuses with "Sandbox is not running. Cannot back up state." even though kubectl confirms it Running.
7. Run:
nemoclaw recover
Observe the same protobuf decode error.
8. Verify the sandbox pod is actually healthy (contradicting what nemoclaw CLI sees):
docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods
docker exec openshell-cluster-nemoclaw kubectl -n openshell exec -- openclaw --version
Expected Result
The upgrade flow should either:
(a) pull the new cluster image, stop the old openshell-cluster container, and start the new one (preserving sandbox state), so that the new CLI and the cluster controller speak the same protobuf schema; OR
(b) refuse to upgrade and instruct the user to run a destroy + re-onboard sequence with explicit warnings about state loss.
After upgrade, `nemoclaw status` / `rebuild` / `recover` / `connect` should all work against existing sandboxes. The CLI's view of sandbox running-state should agree with kubectl. The pre-upgrade auto-backup phase should not falsely classify running sandboxes as "not running".
Actual Result
Host CLI is bumped to v0.0.39 + OpenShell binary bumped 0.0.36 → 0.0.37, but the openshell-cluster Docker container stays on its old 0.0.36 image. Every CLI ↔ cluster gRPC call fails:
× status: Internal, message: "failed to decode Protobuf message:
Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
details: [], metadata: MetadataMap { headers: {"content-type":
"application/grpc", "date": "Tue, 12 May 2026 09:55:09 GMT"} }
Downstream impact:
- `nemoclaw status` reports `OpenShell: unknown (unknown)` and prints the raw protobuf decode error; no actionable recovery guidance.
- `nemoclaw rebuild --yes` preflight thinks the sandbox is "not running" and refuses to proceed (false negative — kubectl confirms 1/1 Running).
- `nemoclaw recover` fails with the same protobuf error.
- `openshell sandbox exec` hangs indefinitely (exit 124 under `timeout`).
- `nemoclaw connect --probe-only` hangs indefinitely.
- Installer's own pre-upgrade auto-backup prints `Skipping (not running)` for each Running sandbox — upgrade proceeds with NO sandbox state backed up. This is a real data-risk window for users who trusted the in-place upgrade path.
Sandbox pods inside the cluster are actually healthy (kubectl exec confirms openclaw --version returns), but the user is functionally locked out via the supported CLI surface.
Logs
Full reproduction captured on 2026-05-12 at /home/lab/day0-automation/20260512/SUMMARY.md and per-case logs:
- T5886195.log, T5886196.log, T5948608.log, T5948611.log, T5951688.log, T5951690.log, T6003142.log, T5987912_partial.log
Representative tail (nemoclaw status post-upgrade):
Sandbox: my-assistant
Model: nvidia/nemotron-3-super-120b-a12b
Provider: nvidia-prod
Inference: not verified (gateway/sandbox state not verified)
Host GPU: no
Sandbox GPU: enabled
OpenShell: unknown (unknown)
Policies: none
Connected: no
Permissions: shields down (check `shields status` for details)
Agent: OpenClaw v2026.4.24
Could not verify sandbox 'my-assistant' against the live OpenShell gateway.
Error: × status: Internal, message: "failed to decode Protobuf message:
Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
details: [], metadata: MetadataMap { ... }
Discovered during the v0.0.39 upgrade-compatibility test pass on cases T5886196 (in-place upgrade — PASS but exposed this issue), T6003142 (host CLI ↔ sandbox version alignment — FAIL), T5951690 (rebuild full lifecycle — BLOCKED by the preflight false negative), T5987912 (status/connect agree — BLOCKED post-upgrade).
Workaround (untested in this run): likely `openshell gateway destroy` + `nemoclaw onboard` to rebuild the cluster container with the 0.0.37 image; this loses sandbox state (re-onboard required), so there is no preserving-state workaround currently available.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Upgrade |
[NVB#6168121]
Description
Description
After an in-place upgrade from NemoClaw v0.0.38 to v0.0.39 on Ubuntu 24.04 aarch64, the host CLI and the OpenShell binary are bumped (v0.0.39 / openshell 0.0.37) but the openshell-cluster Docker container is left on its old image (`nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487`). The new CLI's protobuf schema is incompatible with the still-running cluster controller's old schema, so every CLI ↔ cluster gRPC call fails with `× status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6"`. `nemoclaw status / rebuild / recover` are all broken; the user cannot reach existing sandboxes via the supported tooling, even though kubectl shows the sandbox pods are still healthy. There is also a data-risk window: the installer's pre-upgrade auto-backup phase prints `Skipping (not running)` for every existing sandbox (false — they ARE Running), so the upgrade proceeds with NO sandbox backup taken. Not a duplicate of NVBug 6168039 (Merc Lau, DGX Spark) — that one is about a stale openclaw-gateway process keeping port 18789 occupied, causing new-sandbox-creation timeout. Different failure point (process vs cluster-image), different downstream (new-sandbox-create vs CLI-fully-broken-for-existing-sandbox). Both are v0.0.38 → v0.0.39 upgrade regressions on aarch64 and may share an upstream root cause ("upgrade flow doesn't fully restart cluster-side components"); cross-reference recommended.Environment Steps to Reproduce1. Have a working NemoClaw v0.0.38 install on Ubuntu 24.04 aarch64 with at least one onboarded, healthy sandbox. Confirm: nemoclaw --version # v0.0.38 openshell --version # 0.0.36 docker ps # nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 healthy nemoclaw list # sandbox listed docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods # sandbox pod 1/1 Running 2. Run the in-place upgrade: NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash (or set NEMOCLAW_INSTALL_REF=main if v0.0.39 has not yet been released as a tag.) 3. Observe installer output: pre-upgrade backup prints `Skipping (not running)` for sandboxes that ARE actually Running per kubectl. 4. After install, confirm: nemoclaw --version # v0.0.39 openshell --version # 0.0.37 docker ps # STILL nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 (OLD) 5. Run: nemoclaw status Observe `OpenShell: unknown (unknown)` and the raw protobuf decode error. 6. With NVIDIA_API_KEY set, run: nemoclaw rebuild --yes Observe preflight refuses with "Sandbox is not running. Cannot back up state." even though kubectl confirms it Running. 7. Run: nemoclaw recover Observe the same protobuf decode error. 8. Verify the sandbox pod is actually healthy (contradicting what nemoclaw CLI sees): docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods docker exec openshell-cluster-nemoclaw kubectl -n openshell exec -- openclaw --versionExpected Result Actual ResultHost CLI is bumped to v0.0.39 + OpenShell binary bumped 0.0.36 → 0.0.37, but the openshell-cluster Docker container stays on its old 0.0.36 image. Every CLI ↔ cluster gRPC call fails: × status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 12 May 2026 09:55:09 GMT"} } Downstream impact: - `nemoclaw status` reports `OpenShell: unknown (unknown)` and prints the raw protobuf decode error; no actionable recovery guidance. - `nemoclaw rebuild --yes` preflight thinks the sandbox is "not running" and refuses to proceed (false negative — kubectl confirms 1/1 Running). - `nemoclaw recover` fails with the same protobuf error. - `openshell sandbox exec` hangs indefinitely (exit 124 under `timeout`). - `nemoclaw connect --probe-only` hangs indefinitely. - Installer's own pre-upgrade auto-backup prints `Skipping (not running)` for each Running sandbox — upgrade proceeds with NO sandbox state backed up. This is a real data-risk window for users who trusted the in-place upgrade path. Sandbox pods inside the cluster are actually healthy (kubectl exec confirms openclaw --version returns), but the user is functionally locked out via the supported CLI surface.LogsFull reproduction captured on 2026-05-12 at /home/lab/day0-automation/20260512/SUMMARY.md and per-case logs: - T5886195.log, T5886196.log, T5948608.log, T5948611.log, T5951688.log, T5951690.log, T6003142.log, T5987912_partial.log Representative tail (nemoclaw status post-upgrade): Sandbox: my-assistant Model: nvidia/nemotron-3-super-120b-a12b Provider: nvidia-prod Inference: not verified (gateway/sandbox state not verified) Host GPU: no Sandbox GPU: enabled OpenShell: unknown (unknown) Policies: none Connected: no Permissions: shields down (check `shields status` for details) Agent: OpenClaw v2026.4.24 Could not verify sandbox 'my-assistant' against the live OpenShell gateway. Error: × status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6", details: [], metadata: MetadataMap { ... } Discovered during the v0.0.39 upgrade-compatibility test pass on cases T5886196 (in-place upgrade — PASS but exposed this issue), T6003142 (host CLI ↔ sandbox version alignment — FAIL), T5951690 (rebuild full lifecycle — BLOCKED by the preflight false negative), T5987912 (status/connect agree — BLOCKED post-upgrade). Workaround (untested in this run): likely `openshell gateway destroy` + `nemoclaw onboard` to rebuild the cluster container with the 0.0.37 image; this loses sandbox state (re-onboard required), so there is no preserving-state workaround currently available.Bug Details
[NVB#6168121]