[Ubuntu 24.04][Upgrade] v0.0.38 → v0.0.39 in-place upgrade breaks CLI ↔ cluster RPC with protobuf "invalid wire type" decode error

## Description

Description
<pre>After an in-place upgrade from NemoClaw v0.0.38 to v0.0.39 on Ubuntu 24.04 aarch64, the host CLI and the OpenShell binary are bumped (v0.0.39 / openshell 0.0.37) but the openshell-cluster Docker container is left on its old image (`nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487`). The new CLI's protobuf schema is incompatible with the still-running cluster controller's old schema, so every CLI ↔ cluster gRPC call fails with `× status: Internal, message: "failed to decode Protobuf message: Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6"`. `nemoclaw status / rebuild / recover` are all broken; the user cannot reach existing sandboxes via the supported tooling, even though kubectl shows the sandbox pods are still healthy. There is also a data-risk window: the installer's pre-upgrade auto-backup phase prints `Skipping (not running)` for every existing sandbox (false — they ARE Running), so the upgrade proceeds with NO sandbox backup taken.

Not a duplicate of NVBug 6168039 (Merc Lau, DGX Spark) — that one is about a stale openclaw-gateway process keeping port 18789 occupied, causing new-sandbox-creation timeout. Different failure point (process vs cluster-image), different downstream (new-sandbox-create vs CLI-fully-broken-for-existing-sandbox). Both are v0.0.38 → v0.0.39 upgrade regressions on aarch64 and may share an upstream root cause ("upgrade flow doesn't fully restart cluster-side components"); cross-reference recommended.
</pre>Environment
<pre>Device: DGX Station-class workstation (galaxy-sku2-018, 10.176.192.158)
OS: Ubuntu 24.04.4 LTS (kernel 6.17.0-1014-nvidia-64k)
Architecture: aarch64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.2.1
OpenShell CLI: 0.0.37 (post-upgrade; was 0.0.36 pre)
NemoClaw: v0.0.39 (post-upgrade; was v0.0.38 pre) — installed via NEMOCLAW_INSTALL_REF=main since GitHub v0.0.39 tag not yet cut at filing time
OpenClaw: 2026.4.24 (cbcfdf6) — inside sandbox, unchanged across upgrade
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation + NVIDIA GB300, driver 595.58.03
</pre>Steps to Reproduce
<pre>1. Have a working NemoClaw v0.0.38 install on Ubuntu 24.04 aarch64 with at least one onboarded, healthy sandbox. Confirm:
 nemoclaw --version # v0.0.38
 openshell --version # 0.0.36
 docker ps # nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 healthy
 nemoclaw list # sandbox listed
 docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods # sandbox pod 1/1 Running

2. Run the in-place upgrade:
 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
 (or set NEMOCLAW_INSTALL_REF=main if v0.0.39 has not yet been released as a tag.)

3. Observe installer output: pre-upgrade backup prints `Skipping (not running)` for sandboxes that ARE actually Running per kubectl.

4. After install, confirm:
 nemoclaw --version # v0.0.39
 openshell --version # 0.0.37
 docker ps # STILL nemoclaw-cluster:0.0.36-fuse-overlayfs-aa8b8487 (OLD)

5. Run:
 nemoclaw status
 Observe `OpenShell: unknown (unknown)` and the raw protobuf decode error.

6. With NVIDIA_API_KEY set, run:
 nemoclaw rebuild --yes
 Observe preflight refuses with "Sandbox is not running. Cannot back up state." even though kubectl confirms it Running.

7. Run:
 nemoclaw recover
 Observe the same protobuf decode error.

8. Verify the sandbox pod is actually healthy (contradicting what nemoclaw CLI sees):
 docker exec openshell-cluster-nemoclaw kubectl -n openshell get pods
 docker exec openshell-cluster-nemoclaw kubectl -n openshell exec -- openclaw --version
</pre>Expected Result
<pre>The upgrade flow should either:
(a) pull the new cluster image, stop the old openshell-cluster container, and start the new one (preserving sandbox state), so that the new CLI and the cluster controller speak the same protobuf schema; OR
(b) refuse to upgrade and instruct the user to run a destroy + re-onboard sequence with explicit warnings about state loss.

After upgrade, `nemoclaw status` / `rebuild` / `recover` / `connect` should all work against existing sandboxes. The CLI's view of sandbox running-state should agree with kubectl. The pre-upgrade auto-backup phase should not falsely classify running sandboxes as "not running".
</pre>Actual Result
<pre>Host CLI is bumped to v0.0.39 + OpenShell binary bumped 0.0.36 → 0.0.37, but the openshell-cluster Docker container stays on its old 0.0.36 image. Every CLI ↔ cluster gRPC call fails:

 × status: Internal, message: "failed to decode Protobuf message:
 Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
 details: [], metadata: MetadataMap { headers: {"content-type":
 "application/grpc", "date": "Tue, 12 May 2026 09:55:09 GMT"} }

Downstream impact:
- `nemoclaw status` reports `OpenShell: unknown (unknown)` and prints the raw protobuf decode error; no actionable recovery guidance.
- `nemoclaw rebuild --yes` preflight thinks the sandbox is "not running" and refuses to proceed (false negative — kubectl confirms 1/1 Running).
- `nemoclaw recover` fails with the same protobuf error.
- `openshell sandbox exec` hangs indefinitely (exit 124 under `timeout`).
- `nemoclaw connect --probe-only` hangs indefinitely.
- Installer's own pre-upgrade auto-backup prints `Skipping (not running)` for each Running sandbox — upgrade proceeds with NO sandbox state backed up. This is a real data-risk window for users who trusted the in-place upgrade path.

Sandbox pods inside the cluster are actually healthy (kubectl exec confirms openclaw --version returns), but the user is functionally locked out via the supported CLI surface.
</pre>Logs
<pre>Full reproduction captured on 2026-05-12 at /home/lab/day0-automation/20260512/SUMMARY.md and per-case logs:
- T5886195.log, T5886196.log, T5948608.log, T5948611.log, T5951688.log, T5951690.log, T6003142.log, T5987912_partial.log

Representative tail (nemoclaw status post-upgrade):

 Sandbox: my-assistant
 Model: nvidia/nemotron-3-super-120b-a12b
 Provider: nvidia-prod
 Inference: not verified (gateway/sandbox state not verified)
 Host GPU: no
 Sandbox GPU: enabled
 OpenShell: unknown (unknown)
 Policies: none
 Connected: no
 Permissions: shields down (check `shields status` for details)
 Agent: OpenClaw v2026.4.24

 Could not verify sandbox 'my-assistant' against the live OpenShell gateway.
 Error: × status: Internal, message: "failed to decode Protobuf message:
 Sandbox.metadata: SandboxResponse.sandbox: invalid wire type value: 6",
 details: [], metadata: MetadataMap { ... }

Discovered during the v0.0.39 upgrade-compatibility test pass on cases T5886196 (in-place upgrade — PASS but exposed this issue), T6003142 (host CLI ↔ sandbox version alignment — FAIL), T5951690 (rebuild full lifecycle — BLOCKED by the preflight false negative), T5987912 (status/connect agree — BLOCKED post-upgrade).

Workaround (untested in this run): likely `openshell gateway destroy` + `nemoclaw onboard` to rebuild the cluster container with the 0.0.37 image; this loses sandbox state (re-onboard required), so there is no preserving-state workaround currently available.
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Upgrade |

---
[NVB#6168121]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ubuntu 24.04][Upgrade] v0.0.38 → v0.0.39 in-place upgrade breaks CLI ↔ cluster RPC with protobuf "invalid wire type" decode error #3399

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Upgrade

[Ubuntu 24.04][Upgrade] v0.0.38 → v0.0.39 in-place upgrade breaks CLI ↔ cluster RPC with protobuf "invalid wire type" decode error #3399

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions