Skip to content

[Ubuntu 22.04][CLI&UX][Recovery] nemoclaw status and list return empty output, exit 0 when container is stopped and gateway port is held #2666

@wangericnv

Description

@wangericnv

Description

Description

When the openshell sandbox container is stopped AND the host-side gateway-published port (8080) is held by a foreign listener, both `nemoclaw  status` and `nemoclaw list` return exit 0 with NO STDOUT AT ALL — no sandbox info, no gateway-down block, no diagnostic. The user gets zero signal that anything is wrong. This is the exact "CLI does not clearly explain state transitions" gap called out in VDR feedback for NemoClaw v0.0.28.

For comparison, when JUST the Docker daemon is stopped (no port-conflict), the same commands print a clean, actionable gateway-down block ("the selected NemoClaw gateway is still refusing connections after restart" + the openshell/rebuild hints). So the regression is specifically in the (container stopped + port held) state shape, not in the gateway-down state shape.
Environment
Device:        Ubuntu server (4u2g-0315), local-mercl@10.63.134.125
OS:            Ubuntu 22.04 (Linux 5.15.0-171-generic)
Architecture:  x86_64
Node.js:       Not captured (validation host SSH temporarily locked out)
npm:           Not captured
Docker:        Docker CE 29.4.0
OpenShell CLI: openshell 0.0.32
NemoClaw:      v0.0.28
OpenClaw:      v2026.4.9
Steps to Reproduce
1. Confirm baseline: sandbox 'my-assist' running and healthy.
     nemoclaw my-assist status         # prints sandbox info + Policy block
     docker ps                          # openshell-cluster-nemoclaw "Up ... (healthy)"

2. Stop the openshell container without auto-restart kicking in:
     docker update --restart=no openshell-cluster-nemoclaw
     docker stop openshell-cluster-nemoclaw

3. Hold the host-side gateway-published port (8080) with a foreign listener
   so Docker's eventual restart attempt cannot bind it:
     nc -lk 0.0.0.0 8080 &

4. Verify divergence: container is stopped, port is held by nc:
     docker ps --format 'table {{.Names}}\t{{.Status}}'    # empty
     ss -ltnp '( sport = :8080 )'                            # nc owns the port

5. Run, in this state:
     nemoclaw my-assist status     ; echo "EXIT=$?"
     nemoclaw list                 ; echo "EXIT=$?"
Expected Result
Both commands print an actionable, named-layer message and let the user choose a recovery path:

`nemoclaw my-assist status` should:
- Print the sandbox header (name / model / provider / agent version) first, as it always does, AND
- Print a clearly delimited block naming the failing layer
  (Docker daemon up but container exited; or container exit + foreign-port-conflict; or gateway not reachable), AND
- Include the standard recovery hints already used elsewhere
  ("Retry `openshell gateway start --name nemoclaw` and verify
   `openshell status` is healthy before reconnecting." /
   "If the gateway never becomes healthy, rebuild the gateway and
   then recreate the affected sandbox.").

`nemoclaw list` should ALWAYS list registered sandboxes from
~/.nemoclaw/sandboxes.json regardless of runtime state. The registry
is host-side and container-independent; runtime issues must not
suppress the listing.

For reference, the SAME commands behave correctly when ONLY Docker
is stopped (no foreign port-holder):
- `nemoclaw  status` prints sandbox info, then:
    "Sandbox 'X' may still exist, but the selected NemoClaw gateway
     is still refusing connections after restart."
    "Server Status … Error: × client error (Connect) … Connection refused"
    "Retry `openshell gateway start --name nemoclaw` … "
- `nemoclaw list` prints the registered sandbox line.
Actual Result
Step 5, in the (container stopped + port 8080 held by nc) state:

  $ nemoclaw my-assist status
  EXIT=0
  (no other output — STDOUT and STDERR both empty)

  $ nemoclaw list
  EXIT=0
  (no other output — STDOUT and STDERR both empty)

The exit code is 0, so a script or watchdog wrapping these commands
sees them as "success" while the user sees nothing.

Recovery (kill nc, `docker update --restart=unless-stopped`, `docker
start openshell-cluster-nemoclaw`, wait ~35 s) restores both commands
to normal output. So the silent-empty state is real but transient
and only appears in this specific (no-container + port-held)
combination — important detail for whoever investigates.
Logs
Captured during validation of test 9999102 in
5.3.8-upgrade-compatibility.md (MR
https://gitlab-master.nvidia.com/cloud-service-qa/nemoclaw/nemoclaw-test-cases/-/merge_requests/100).

Full session output saved at
/home/lab/.claude/projects/-home-lab/c13cf6a2-16aa-476f-a31f-215c775c0950/tool-results/
on local-mercl@10.63.134.125 — available on request.

For comparison, contrast with test 9999101 step 4b (Docker-only-down)
recorded in the same MR, where status produces a full actionable
block on the same machine in the same session.

SECONDARY OBSERVATION (may share root cause; flagging here, can split
to a separate bug if triage prefers): on the same machine, holding
127.0.0.1:18789 with `nc -lk 0.0.0.0 18789` and running
  nemoclaw onboard --non-interactive --control-ui-port 18789
caused the preflight stage to report:
  "✓ Port 18789 already owned by healthy NemoClaw runtime
   (NemoClaw dashboard)"
and proceed to attempt to reuse the foreign listener as if it were
the NemoClaw control UI. Preflight only checks port bindability,
not the listener's protocol identity. This may indicate a shared
"port-state classification" code path with the silent-empty status
issue, or it may be independent — flagging both for whoever picks
this up.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6125389]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.VDRLinked to VDR finding

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions