Skip to content

fix(status): surface gateway-down state with non-zero exit code#3402

Merged
ericksoa merged 4 commits into
mainfrom
fix/3386-status-exit-code
May 13, 2026
Merged

fix(status): surface gateway-down state with non-zero exit code#3402
ericksoa merged 4 commits into
mainfrom
fix/3386-status-exit-code

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

nemoclaw status previously exited 0 even when the named gateway was unreachable or attached to a different sandbox. Shell scripts and CI couldn't detect that degraded state from $?. This PR extends ShowStatusCommandDeps with a getGatewayHealth probe, prints a gateway: down [state] (reason) diagnostic between the sandbox list and the host-service list, and sets process.exitCode = 1 when the gateway is unhealthy. The exit code is set rather than thrown so JSON callers (status --json) keep working and the rest of the report still renders. The check is suppressed when no sandboxes are registered so a clean machine keeps its 0-exit.

Related Issue

Fixes #3386

Changes

  • Add a GatewayHealth type and extend ShowStatusCommandDeps with an optional getGatewayHealth dep in src/lib/inventory/index.ts.
  • showStatusCommand emits a gateway: down [state] (reason) diagnostic and sets process.exitCode = 1 when the gateway is unhealthy, suggesting openshell gateway start --name nemoclaw or nemoclaw onboard --resume to recover. Gated on sandboxes.length > 0 so an empty machine stays at 0-exit.
  • buildStatusCommandDeps in src/lib/status-command-deps.ts wires the production probe via getNamedGatewayLifecycleState, mapping each unhealthy lifecycle state to a human-readable reason and falling back to a soft probe_error state if the OpenShell CLI is unreachable.
  • Four new showStatusCommand cases in src/lib/inventory/index.test.ts: unhealthy gateway → diagnostic line + process.exitCode = 1; healthy gateway → no diagnostic, exit code 0; legacy callers with no getGatewayHealth dep keep exit 0; no registered sandboxes → probe never called and exit stays 0.
  • The status fixture in test/cli.test.ts now mocks a healthy gateway probe so its exit-code assertion stays at 0 with the new gate in place.
  • Documented the new exit-code behaviour in the nemoclaw status section of docs/reference/commands.md.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes
  • `npm test` passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • `make docs` builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Status command now probes gateway health, shows a "gateway: down [state] (reason)" line with recovery steps when unreachable, and includes a gatewayHealth field in JSON output.
    • When unhealthy, the command sets a non-zero exit code; the check is skipped if no sandboxes are registered.
  • Tests

    • Added coverage for healthy/unhealthy/missing-probe and no-sandboxes scenarios.
  • Documentation

    • Updated status command docs to describe gateway health reporting and recovery steps.

Review Change Stack

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 31c92e15-b457-4527-a181-f19167db6426

📥 Commits

Reviewing files that changed from the base of the PR and between 43896e6 and 59d5b45.

📒 Files selected for processing (5)
  • docs/reference/commands.md
  • src/lib/commands/status.ts
  • src/lib/inventory/index.ts
  • src/lib/status-command-deps.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/reference/commands.md
  • src/lib/status-command-deps.ts

📝 Walkthrough

Walkthrough

Adds gateway health probing to the nemoclaw status command. When sandboxes exist, it probes gateway health, logs "gateway: down [state] (reason)" and recovery hints if unhealthy, sets process.exitCode = 1, and includes a normalized gatewayHealth field in JSON status output.

Changes

Gateway Health Checking

Layer / File(s) Summary
Gateway health contract and status report wiring
src/lib/inventory/index.ts
Adds exported GatewayHealth, extends ShowStatusCommandDeps with optional getGatewayHealth, adds `gatewayHealth: GatewayHealth
showStatusCommand integration and CLI JSON exit logic
src/lib/inventory/index.ts, src/lib/commands/status.ts
Inserts optional gateway-health probe into showStatusCommand between sandbox listing and host-service status; logs gateway: down and sets process.exitCode = 1 when unhealthy and sandboxes exist. status --json captures report and sets exit code based on report.gatewayHealth.
Gateway probe implementation and wiring
src/lib/status-command-deps.ts
Implements probeGatewayHealth() mapping getNamedGatewayLifecycleState() to GatewayHealth (healthy vs. named/unreachable states, human-readable reasons), returns probe_error on probe exceptions, and exposes getGatewayHealth from buildStatusCommandDeps().
Tests, CLI stubs, and docs
src/lib/inventory/index.test.ts, test/cli.test.ts, docs/reference/commands.md
Unit tests cover unhealthy/healthy/omitted/no-sandbox probe behaviors and exit-code preservation; CLI tests stub gateway probe commands and assert gatewayHealth JSON + exit code; docs updated to document gateway: down [state] (reason) output and exit semantics.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI
  participant ShowStatus as showStatusCommand
  participant Probe as probeGatewayHealth / getGatewayHealth
  participant GatewayState as getNamedGatewayLifecycleState
  CLI->>ShowStatus: run `nemoclaw status`
  ShowStatus->>Probe: call getGatewayHealth() if sandboxes exist
  Probe->>GatewayState: invoke getNamedGatewayLifecycleState()
  GatewayState-->>Probe: lifecycle state (e.g. healthy_named / named_unreachable / ...)
  Probe-->>ShowStatus: GatewayHealth { healthy, state, reason? }
  ShowStatus->>CLI: print sandbox list, diagnostics (gateway: down...), set process.exitCode if unhealthy
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I sniffed the gateway, heard a worried sound,
A downed route where packets once were bound.
I thumped my foot, the exit code did change,
Now status speaks up when things go strange.
Hop, hop—diagnostics fixed, hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(status): surface gateway-down state with non-zero exit code' accurately and concisely describes the main change—adding gateway health detection to the status command with appropriate exit code handling.
Linked Issues check ✅ Passed The PR implementation fully addresses issue #3386: it adds explicit 'gateway: down' diagnostics, sets non-zero exit code when gateway is unhealthy, suppresses checks on clean systems, and extends both CLI and JSON output paths.
Out of Scope Changes check ✅ Passed All changes are scoped to the linked issue #3386: gateway health detection and exit code handling in status/inventory modules, test coverage, CLI fixtures, and documentation updates. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3386-status-exit-code

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: deployment-services-e2e, gateway-health-honest-e2e
Optional E2E: cloud-e2e, sandbox-operations-e2e, diagnostics-e2e

Dispatch hint: deployment-services-e2e,gateway-health-honest-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required E2E

  • deployment-services-e2e: This job is the only E2E that calls top-level nemoclaw status while intentionally transitioning gateway-adjacent state (nemoclaw tunnel start/stop). It explicitly checks the status exit code (status_rc) in a polling loop after tunnel stop. With this PR, an unhealthy named gateway now forces exit 1, which can flip the TC-DEPLOY-01c post-stop path from pass to failure. Must validate that nemoclaw status still exits 0 under normal deployment-services flow.
  • gateway-health-honest-e2e: Directly exercises the named-gateway lifecycle state (openshell status / openshell gateway info -g nemoclaw) that the new getGatewayHealth probe depends on. Regression coverage for [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111 is in the same code neighborhood (getNamedGatewayLifecycleState). Ensures probe semantics still hold under a crashed/zombie gateway scenario and the new status exit-code path behaves honestly.

Optional E2E

  • cloud-e2e: Broad smoke that includes onboard → status plumbing. Uses nemoclaw <name> status (per-sandbox, unaffected), but is the lowest-risk confidence check that host-level nemoclaw status still renders expected output after a real onboard wiring up the named gateway.
  • sandbox-operations-e2e: Includes TC-SBX-06 gateway auto-recovery via nemoclaw <name> status under a killed gateway container. Per-sandbox status is technically unaffected, but this scenario exercises the same underlying getNamedGatewayLifecycleState() probe indirectly; useful confidence check.
  • diagnostics-e2e: commands.md was touched and nemoclaw debug/status diagnostics overlap with status surface. Optional parity check; no required failure mode expected from this PR.

New E2E recommendations

  • cli-status-exit-code (medium): No existing E2E asserts the new contract: nemoclaw status must exit 1 (and emit the gateway: down diagnostic) when the named gateway is unhealthy but at least one sandbox is registered, while --json callers and the no-sandbox case continue to exit 0. Unit tests cover the in-process function, but the real OpenShell probe path through buildStatusCommandDepsgetNamedGatewayLifecycleState is not covered end-to-end.
    • Suggested test: Add a lightweight E2E (e.g. test/e2e/test-status-exit-code.sh) that: (1) onboards a minimal sandbox, (2) kills/hides the named gateway container or sabotages openshell status to simulate named_unreachable, (3) runs nemoclaw status and asserts non-zero $? plus presence of the gateway: down [named_unreachable] line, (4) reruns with --json and asserts exit 0 with the sandbox list intact, (5) verifies a fresh machine with no registered sandboxes exits 0.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: deployment-services-e2e,gateway-health-honest-e2e

@laitingsheng laitingsheng marked this pull request as draft May 12, 2026 13:26
@copy-pr-bot

copy-pr-bot Bot commented May 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@laitingsheng laitingsheng marked this pull request as ready for review May 12, 2026 13:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/reference/commands.md (1)

909-909: ⚡ Quick win

Break this long sentence into multiple shorter sentences.

Line 909 is a single 120+ word sentence with multiple clauses.
Breaking it into 3–4 shorter sentences improves readability and makes future diffs cleaner.

Suggested structure:

  • Sentence 1: State the trigger conditions (when gateway is unreachable).
  • Sentence 2: Describe the diagnostic output format.
  • Sentence 3: Explain the recovery suggestions.
  • Sentence 4: Document the exit code behavior and rationale.

As per coding guidelines: "One sentence per line in source (makes diffs readable)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 909, Split the long single sentence into
3–4 shorter sentences: first state the trigger conditions (when at least one
sandbox is registered and the named NemoClaw gateway is unreachable, unhealthy,
or attached to a different sandbox), second describe the diagnostic output
format (`gateway: down [state] (reason)` printed between the sandbox list and
the host-service list), third list the recovery suggestions (`openshell gateway
start --name nemoclaw` and `nemoclaw onboard --resume`), and finally document
the exit code behavior (exits with code `1` so shell scripts and CI can detect
the degraded state from `$?`).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/reference/commands.md`:
- Line 909: Split the long single sentence into 3–4 shorter sentences: first
state the trigger conditions (when at least one sandbox is registered and the
named NemoClaw gateway is unreachable, unhealthy, or attached to a different
sandbox), second describe the diagnostic output format (`gateway: down [state]
(reason)` printed between the sandbox list and the host-service list), third
list the recovery suggestions (`openshell gateway start --name nemoclaw` and
`nemoclaw onboard --resume`), and finally document the exit code behavior (exits
with code `1` so shell scripts and CI can detect the degraded state from `$?`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f66ad7f1-0b1f-43c4-bbb1-853a6de292c2

📥 Commits

Reviewing files that changed from the base of the PR and between 345e9b7 and 43896e6.

📒 Files selected for processing (1)
  • docs/reference/commands.md

laitingsheng added a commit that referenced this pull request May 12, 2026
…te PRs

Status exit-code work (#3386) is now tracked by #3402; stale-gateway cleanup
(#3397, #3398) by #3405. This commit reverts those changes here so that
this PR retains only the dashboard port + rollback work for #3260.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed current head 59d5b45. Required checks are passing, the status JSON gateway-health gap is covered, unresolved review threads are clear, and remaining in-progress jobs are non-required. Approving for squash merge.

@ericksoa ericksoa merged commit 8166365 into main May 13, 2026
30 of 31 checks passed
@ericksoa ericksoa deleted the fix/3386-status-exit-code branch May 13, 2026 00:46
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: observability Logging, metrics, tracing, diagnostics, or debug output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: observability Logging, metrics, tracing, diagnostics, or debug output bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][CLI&UX] nemoclaw status/connect EXIT=0 when gateway port held; status omits down state

3 participants