Skip to content

feat(status): classify failing layer when gateway probe fails (#2666 follow-up) #3271

@cjagwani

Description

@cjagwani

Background

#2666 (closed by #3270) fixed the user-visible silent-exit-0 regression: nemoclaw <name> status and nemoclaw list now always produce output and never return exit 0 with empty stdout/stderr. That closes the bug as filed.

The original AC for status also asked for one piece of UX polish that was deliberately split out of the bug fix to keep its scope tight:

Print a clearly delimited block naming the failing layer (Docker daemon up but container exited; or container exit + foreign-port-conflict; or gateway not reachable)

What status prints today in the gateway-failure path is the existing generic message + printGatewayLifecycleHint text. That's actionable, but it doesn't distinguish the three named layers the reporter called out.

Proposal

Add a small failing-layer classifier called from src/lib/actions/sandbox/status.ts and src/lib/actions/sandbox/gateway-state.ts:printGatewayLifecycleHint that prints a layer-named header before the existing actionable hints.

Detect, in order:

  1. docker_unreachabledocker info fails or times out. Daemon down or socket inaccessible.
  2. container_exited_port_conflictdocker ps --filter name=openshell-cluster-nemoclaw shows no running container, docker ps -a shows it exited, AND something is listening on the gateway port (host port held by a foreign process).
  3. container_exited — same as 2 but no foreign listener on the gateway port.
  4. gateway_unreachable — container is running but the gateway API does not respond (current generic case).

For each layer, print a one-line "what's wrong" header followed by the existing recovery hints from printGatewayLifecycleHint.

Implementation notes

  • New helper src/lib/actions/sandbox/gateway-failure-classifier.ts (or extend gateway-state.ts).
  • Port probe via Node's net.connect() with a short timeout — works cross-platform (Linux/macOS/WSL) without depending on ss / lsof / netstat.
  • Container probe via docker ps + docker ps -a with short timeouts; gracefully degrade to unknown if Docker is itself unreachable (already covered by step 1).
  • Unit-testable in isolation: classifier takes injected runners for docker info, docker ps, port-probe so tests can simulate each layer.
  • Subprocess test (extending test/repro-2666-silent-list-status.test.ts) per layer.

Out of scope

  • The container name openshell-cluster-nemoclaw is hard-coded in NemoClaw's gateway start path; treat the same string as the fixed probe target. If we ever parameterize the gateway name, classifier can read from the same source.
  • nemoclaw list does not need layer classification — its contract is "always show the registry," which fix(cli): keep status and list output visible when gateway probe fails (#2666) #3270 already delivers.

Definition of done

  • status prints a clearly-named layer header in each of the four classified states.
  • Classifier has unit tests per layer.
  • Repro subprocess test extended to assert the named layer appears for the (container-stopped + foreign-port-holder) scenario.

Surfaced from #2666 / #3270.

Metadata

Metadata

Assignees

Labels

VDRLinked to VDR finding
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions