fix(status): surface gateway-down state with non-zero exit code#3402
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds gateway health probing to the nemoclaw status command. When sandboxes exist, it probes gateway health, logs "gateway: down [state] (reason)" and recovery hints if unhealthy, sets process.exitCode = 1, and includes a normalized gatewayHealth field in JSON status output. ChangesGateway Health Checking
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant ShowStatus as showStatusCommand
participant Probe as probeGatewayHealth / getGatewayHealth
participant GatewayState as getNamedGatewayLifecycleState
CLI->>ShowStatus: run `nemoclaw status`
ShowStatus->>Probe: call getGatewayHealth() if sandboxes exist
Probe->>GatewayState: invoke getNamedGatewayLifecycleState()
GatewayState-->>Probe: lifecycle state (e.g. healthy_named / named_unreachable / ...)
Probe-->>ShowStatus: GatewayHealth { healthy, state, reason? }
ShowStatus->>CLI: print sandbox list, diagnostics (gateway: down...), set process.exitCode if unhealthy
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/reference/commands.md (1)
909-909: ⚡ Quick winBreak this long sentence into multiple shorter sentences.
Line 909 is a single 120+ word sentence with multiple clauses.
Breaking it into 3–4 shorter sentences improves readability and makes future diffs cleaner.Suggested structure:
- Sentence 1: State the trigger conditions (when gateway is unreachable).
- Sentence 2: Describe the diagnostic output format.
- Sentence 3: Explain the recovery suggestions.
- Sentence 4: Document the exit code behavior and rationale.
As per coding guidelines: "One sentence per line in source (makes diffs readable)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.md` at line 909, Split the long single sentence into 3–4 shorter sentences: first state the trigger conditions (when at least one sandbox is registered and the named NemoClaw gateway is unreachable, unhealthy, or attached to a different sandbox), second describe the diagnostic output format (`gateway: down [state] (reason)` printed between the sandbox list and the host-service list), third list the recovery suggestions (`openshell gateway start --name nemoclaw` and `nemoclaw onboard --resume`), and finally document the exit code behavior (exits with code `1` so shell scripts and CI can detect the degraded state from `$?`).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/reference/commands.md`:
- Line 909: Split the long single sentence into 3–4 shorter sentences: first
state the trigger conditions (when at least one sandbox is registered and the
named NemoClaw gateway is unreachable, unhealthy, or attached to a different
sandbox), second describe the diagnostic output format (`gateway: down [state]
(reason)` printed between the sandbox list and the host-service list), third
list the recovery suggestions (`openshell gateway start --name nemoclaw` and
`nemoclaw onboard --resume`), and finally document the exit code behavior (exits
with code `1` so shell scripts and CI can detect the degraded state from `$?`).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f66ad7f1-0b1f-43c4-bbb1-853a6de292c2
📒 Files selected for processing (1)
docs/reference/commands.md
Summary
nemoclaw statuspreviously exited 0 even when the named gateway was unreachable or attached to a different sandbox. Shell scripts and CI couldn't detect that degraded state from$?. This PR extendsShowStatusCommandDepswith agetGatewayHealthprobe, prints agateway: down [state] (reason)diagnostic between the sandbox list and the host-service list, and setsprocess.exitCode = 1when the gateway is unhealthy. The exit code is set rather than thrown so JSON callers (status --json) keep working and the rest of the report still renders. The check is suppressed when no sandboxes are registered so a clean machine keeps its 0-exit.Related Issue
Fixes #3386
Changes
GatewayHealthtype and extendShowStatusCommandDepswith an optionalgetGatewayHealthdep insrc/lib/inventory/index.ts.showStatusCommandemits agateway: down [state] (reason)diagnostic and setsprocess.exitCode = 1when the gateway is unhealthy, suggestingopenshell gateway start --name nemoclawornemoclaw onboard --resumeto recover. Gated onsandboxes.length > 0so an empty machine stays at 0-exit.buildStatusCommandDepsinsrc/lib/status-command-deps.tswires the production probe viagetNamedGatewayLifecycleState, mapping each unhealthy lifecycle state to a human-readable reason and falling back to a softprobe_errorstate if the OpenShell CLI is unreachable.showStatusCommandcases insrc/lib/inventory/index.test.ts: unhealthy gateway → diagnostic line +process.exitCode = 1; healthy gateway → no diagnostic, exit code 0; legacy callers with nogetGatewayHealthdep keep exit 0; no registered sandboxes → probe never called and exit stays 0.test/cli.test.tsnow mocks a healthy gateway probe so its exit-code assertion stays at 0 with the new gate in place.nemoclaw statussection ofdocs/reference/commands.md.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Tests
Documentation