Skip to content

[Ubuntu 24.04][Upgrade] v0.0.38 → v0.0.39 in-place upgrade breaks CLI ↔ cluster RPC with protobuf "invalid wire type" decode error #3399

@wangericnv

Description

@wangericnv

Description

Description

After an in-place upgrade from NemoClaw v0.0.38 to v0.0.39 on Ubuntu 24.04 aarch64, the host CLI and the OpenShell binary are bumped (v0.0.39 / openshell 0.0.37) but the openshell-cluster Docker container is left on its old image (`nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487`). The new CLI's protobuf schema is incompatible with the still-running cluster controller's old schema, so every CLI ↔ cluster gRPC call fails with `× status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6"`. `nemoclaw status / rebuild / recover` are all broken; the user cannot reach existing sandboxes via the supported tooling, even though kubectl shows the sandbox pods are still healthy. There is also a data-risk window: the installer's pre-upgrade auto-backup phase prints `Skipping  (not running)` for every existing sandbox (false — they ARE Running), so the upgrade proceeds with NO sandbox backup taken.

Not a duplicate of NVBug 6168039 (Merc Lau, DGX Spark) — that one is about a stale openclaw-gateway process keeping port 18789 occupied, causing new-sandbox-creation timeout. Different failure point (process vs cluster-image), different downstream (new-sandbox-create vs CLI-fully-broken-for-existing-sandbox). Both are v0.0.38 → v0.0.39 upgrade regressions on aarch64 and may share an upstream root cause ("upgrade flow doesn't fully restart cluster-side components"); cross-reference recommended.
Environment
Device:        DGX Station-class workstation (galaxy-sku2-018, 10.176.192.158)
OS:            Ubuntu 24.04.4 LTS (kernel 6.17.0-1014-nvidia-64k)
Architecture:  aarch64
Node.js:       v22.22.2
npm:           10.9.7
Docker:        29.2.1
OpenShell CLI: 0.0.37 (post-upgrade; was 0.0.36 pre)
NemoClaw:      v0.0.39 (post-upgrade; was v0.0.38 pre) — installed via NEMOCLAW_INSTALL_REF=main since GitHub v0.0.39 tag not yet cut at filing time
OpenClaw:      2026.4.24 (cbcfdf6) — inside sandbox, unchanged across upgrade
GPU:           NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation + NVIDIA GB300, driver 595.58.03
Steps to Reproduce
1. Have a working NemoClaw v0.0.38 install on Ubuntu 24.04 aarch64 with at least one onboarded, healthy sandbox. Confirm:
       nemoclaw --version                 # v0.0.38
       openshell --version                # 0.0.36
       docker ps                          # nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 healthy
       nemoclaw list                      # sandbox listed
       docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods   # sandbox pod 1/1 Running

2. Run the in-place upgrade:
       NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
   (or set NEMOCLAW_INSTALL_REF=main if v0.0.39 has not yet been released as a tag.)

3. Observe installer output: pre-upgrade backup prints `Skipping  (not running)` for sandboxes that ARE actually Running per kubectl.

4. After install, confirm:
       nemoclaw --version    # v0.0.39
       openshell --version   # 0.0.37
       docker ps             # STILL nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 (OLD)

5. Run:
       nemoclaw  status
   Observe `OpenShell: unknown (unknown)` and the raw protobuf decode error.

6. With NVIDIA_API_KEY set, run:
       nemoclaw  rebuild --yes
   Observe preflight refuses with "Sandbox is not running. Cannot back up state." even though kubectl confirms it Running.

7. Run:
       nemoclaw  recover
   Observe the same protobuf decode error.

8. Verify the sandbox pod is actually healthy (contradicting what nemoclaw CLI sees):
       docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods
       docker exec openshell-cluster-nemoclaw kubectl -n openshell exec  -- openclaw --version
Expected Result
The upgrade flow should either:
(a) pull the new cluster image, stop the old openshell-cluster container, and start the new one (preserving sandbox state), so that the new CLI and the cluster controller speak the same protobuf schema; OR
(b) refuse to upgrade and instruct the user to run a destroy + re-onboard sequence with explicit warnings about state loss.

After upgrade, `nemoclaw  status` / `rebuild` / `recover` / `connect` should all work against existing sandboxes. The CLI's view of sandbox running-state should agree with kubectl. The pre-upgrade auto-backup phase should not falsely classify running sandboxes as "not running".
Actual Result
Host CLI is bumped to v0.0.39 + OpenShell binary bumped 0.0.36 → 0.0.37, but the openshell-cluster Docker container stays on its old 0.0.36 image. Every CLI ↔ cluster gRPC call fails:

  × status: Internal, message: "failed to decode Protobuf message:
    Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
    details: [], metadata: MetadataMap { headers: {"content-type":
    "application/grpc", "date": "Tue, 12 May 2026 09:55:09 GMT"} }

Downstream impact:
- `nemoclaw  status` reports `OpenShell: unknown (unknown)` and prints the raw protobuf decode error; no actionable recovery guidance.
- `nemoclaw  rebuild --yes` preflight thinks the sandbox is "not running" and refuses to proceed (false negative — kubectl confirms 1/1 Running).
- `nemoclaw  recover` fails with the same protobuf error.
- `openshell sandbox exec` hangs indefinitely (exit 124 under `timeout`).
- `nemoclaw  connect --probe-only` hangs indefinitely.
- Installer's own pre-upgrade auto-backup prints `Skipping  (not running)` for each Running sandbox — upgrade proceeds with NO sandbox state backed up. This is a real data-risk window for users who trusted the in-place upgrade path.

Sandbox pods inside the cluster are actually healthy (kubectl exec confirms openclaw --version returns), but the user is functionally locked out via the supported CLI surface.
Logs
Full reproduction captured on 2026-05-12 at /home/lab/day0-automation/20260512/SUMMARY.md and per-case logs:
- T5886195.log, T5886196.log, T5948608.log, T5948611.log, T5951688.log, T5951690.log, T6003142.log, T5987912_partial.log

Representative tail (nemoclaw status post-upgrade):

  Sandbox: my-assistant
    Model:    nvidia/nemotron-3-super-120b-a12b
    Provider: nvidia-prod
    Inference: not verified (gateway/sandbox state not verified)
    Host GPU: no
    Sandbox GPU: enabled
    OpenShell: unknown (unknown)
    Policies: none
    Connected: no
    Permissions: shields down (check `shields status` for details)
    Agent:    OpenClaw v2026.4.24

  Could not verify sandbox 'my-assistant' against the live OpenShell gateway.
  Error:   × status: Internal, message: "failed to decode Protobuf message:
    Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
    details: [], metadata: MetadataMap { ... }

Discovered during the v0.0.39 upgrade-compatibility test pass on cases T5886196 (in-place upgrade — PASS but exposed this issue), T6003142 (host CLI ↔ sandbox version alignment — FAIL), T5951690 (rebuild full lifecycle — BLOCKED by the preflight false negative), T5987912 (status/connect agree — BLOCKED post-upgrade).

Workaround (untested in this run): likely `openshell gateway destroy` + `nemoclaw onboard` to rebuild the cluster container with the 0.0.37 image; this loses sandbox state (re-onboard required), so there is no preserving-state workaround currently available.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Upgrade

[NVB#6168121]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: cliCommand line interface, flags, terminal UX, or outputplatform: dgx-sparkAffects DGX Spark hardware or workflowsplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions