Skip to content

fix(status): exit 1 and suppress healthy when gateway is down#2626

Closed
rluo8 wants to merge 2 commits into
NVIDIA:mainfrom
rluo8:fix/2595-status-degraded-exit
Closed

fix(status): exit 1 and suppress healthy when gateway is down#2626
rluo8 wants to merge 2 commits into
NVIDIA:mainfrom
rluo8:fix/2595-status-degraded-exit

Conversation

@rluo8

@rluo8 rluo8 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

nemoclaw <sandbox> status could print Inference: healthy before verifying that the selected NemoClaw gateway and sandbox were actually reachable. When the OpenShell gateway was down/refusing connections, the command later printed recovery/error guidance but still exited 0, misleading users and scripts.

This PR checks the reconciled sandbox/gateway state before printing inference health. If the live sandbox/gateway state cannot be verified, status no longer prints Inference: healthy and exits non-zero.

Related Issue

Fixes #2595

Changes

  • src/nemoclaw.ts — Moved getReconciledSandboxGatewayState() ahead of inference/provider health rendering in sandboxStatus().
  • src/nemoclaw.ts — Only prints Inference: healthy when lookup.state === "present".
  • src/nemoclaw.ts — Makes degraded/unverified status paths exit 1 instead of silently returning success.
  • test/cli.test.ts — Updated degraded gateway/status tests to expect non-zero exit codes.
  • test/cli.test.ts — Added regression assertion that gateway-refused status output does not contain Inference: healthy.

Type of Change

  • [√ ] Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • [√ ] npm test passes
  • [√ ] Tests added or updated for new or changed behavior
  • [√ ] No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

  • AI-assisted — tool: Claude Code

Signed-off-by: rluo8 ruluo@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Ensure non-present gateway/sandbox states return a failing exit code and stop further inference checks.
    • Prevent inference health probing and parsing when gateway state can't be verified.
    • Immediately handle various reconciled failure states and surface guidance before exiting.
    • Automatically remove orphaned sandbox registry entries and clear default when appropriate.
  • Tests

    • Updated CLI tests to assert failing exit codes and to validate self-healing cleanup and output behavior.

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

sandboxStatus now reconciles sandbox vs gateway presence up front, skips inference probing unless the reconciler reports the gateway present, and exits with code 1 immediately for all non-present reconciliation states; redundant reconciler invocation was removed.

Changes

Cohort / File(s) Summary
Core Status Logic
src/nemoclaw.ts
Call getReconciledSandboxGatewayState up front; only parse inference get and run probeProviderHealth when lookup.state === "present"; remove redundant reconciler call; early-exit (code 1) for non-present states.
Status Command Tests
test/cli.test.ts
Update tests to expect exit code 1 for gateway failure/non-present scenarios; ensure "Inference: healthy" is not shown when gateway unreachable; add test for self-healing that removes stale registry entries and sets defaultSandbox to null when gateway reports sandbox absent.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as "CLI\n(nemoclaw status)"
  participant Reconciler as "Reconciler\n(getReconciledSandboxGatewayState)"
  participant Gateway as "Live Gateway\n(OpenClaw/OpenShell)"
  participant Provider as "Inference Provider\n(external API)"
  participant Registry as "Local Registry\n(sandboxes.json)"

  CLI->>Reconciler: request sandbox vs gateway state
  Reconciler-->>CLI: lookup.state (present / missing / wrong_gateway_active / ...)
  alt lookup.state == "present"
    CLI->>Gateway: verify gateway reachable
    Gateway-->>CLI: reachable
    CLI->>Provider: probeProviderHealth / parse inference get
    Provider-->>CLI: probe result (healthy / unhealthy)
    CLI->>CLI: print healthy/inference info\nexit 0 (if healthy)
  else lookup.state != "present"
    CLI->>CLI: print guidance/error for specific state
    alt state indicates stale registry entry
      CLI->>Registry: remove orphaned sandbox\nset defaultSandbox=null
      Registry-->>CLI: write ack
    end
    CLI-->>CLI: exit 1
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I checked the gate before I cheered,

No "healthy" shout when the path is cleared.
Orphans pruned, the registry neat,
Exit one sings truth, not deceit.
Hoppity hop—status now complete! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main fix: preventing 'healthy' output and ensuring non-zero exit when the gateway is down.
Linked Issues check ✅ Passed The PR implementation addresses all primary coding objectives from issue #2595: verifying gateway state before reporting health, suppressing 'Inference: healthy' when state is not present, and exiting with code 1 on degraded states.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #2595: early gateway verification in sandboxStatus(), conditional health reporting, early exit on non-present states, and corresponding test updates.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@rluo8 rluo8 requested a review from cv April 29, 2026 01:04

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/nemoclaw.ts (1)

1695-1875: Recommended validation run: execute the targeted E2E recovery suites for this status behavior change.

Given this touches sandbox lifecycle/status correctness around gateway recovery, running the recommended E2E jobs would reduce regression risk.

As per coding guidelines, "E2E test recommendation: sandbox-survival-e2e — gateway restart recovery; sandbox-operations-e2e — process recovery after gateway kill; skip-permissions-e2e — permissive policy activation."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1695 - 1875, This change affects sandbox
lifecycle logic in sandboxStatus (and related calls like
getReconciledSandboxGatewayState and parseSandboxPhase), so run the recommended
E2E suites locally/CI: sandbox-survival-e2e (gateway restart recovery),
sandbox-operations-e2e (process recovery after gateway kill), and
skip-permissions-e2e (permissive policy activation); capture failures, fix any
regressions in the sandboxStatus flows (recovery handling, lookup.state
branches, and registry/session updates), and attach the E2E results to the PR
(or add/enable these suites in CI) before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 1695-1875: This change affects sandbox lifecycle logic in
sandboxStatus (and related calls like getReconciledSandboxGatewayState and
parseSandboxPhase), so run the recommended E2E suites locally/CI:
sandbox-survival-e2e (gateway restart recovery), sandbox-operations-e2e (process
recovery after gateway kill), and skip-permissions-e2e (permissive policy
activation); capture failures, fix any regressions in the sandboxStatus flows
(recovery handling, lookup.state branches, and registry/session updates), and
attach the E2E results to the PR (or add/enable these suites in CI) before
merging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef0efda6-7be3-4cda-be9c-392fdc9b9292

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5f196 and 2b60791.

📒 Files selected for processing (2)
  • src/nemoclaw.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/cli.test.ts

@wscurran wscurran added bug Something fails against expected or documented behavior NemoClaw CLI VDR Linked to VDR finding labels Apr 29, 2026
@cv cv added v0.0.32 and removed v0.0.31 labels Apr 30, 2026
ericksoa added a commit that referenced this pull request May 3, 2026
## Summary

Ports the useful behavior from #2626 onto current main without depending
on #2881 or #2822.

- reconcile sandbox/gateway state before probing or printing inference
provider health
- suppress `Inference: healthy` unless the selected sandbox/gateway is
verified present
- exit non-zero for degraded or unverified sandbox/gateway status states
- remove only the selected stale registry entry when status verifies
that named sandbox is absent from a healthy NemoClaw gateway; status
does not sweep unrelated registry-only entries
- keep `nemoclaw list` rendering registered sandboxes when runtime
inference probing is degraded

Fixes #2595
Refs #2666
Refs #2604

## Validation

- `npm run build:cli`
- `npx vitest run test/cli.test.ts -t "status|runtime inference
probing"`
- `npx vitest run test/cli.test.ts -t
"gateway|inference|orphan|verified"`
- `npx vitest run test/cli.test.ts`
- `npm run typecheck:cli`
- `npm run typecheck`
- `npx biome lint src/nemoclaw.ts test/cli.test.ts`
- `git diff --check`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Automatic generation and application of an initial sandbox create
policy based on enabled messaging channels; merged policy presets are
applied at sandbox boot.

* **Bug Fixes**
* Status now only reports inference when gateway/sandbox state is
verified; otherwise shows "Inference: not verified..." and exits
non-zero for unverified states.
* Status removes the selected stale/orphaned registry entry only when
that named sandbox is verified absent from a healthy NemoClaw gateway;
it does not sweep unrelated registry-only entries. List/status
gracefully fall back when inference probing is degraded.

* **Tests**
* CLI, e2e, and unit tests updated to expect non-zero exits and
verification-gated inference reporting; policy merge behavior covered.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa

ericksoa commented May 3, 2026

Copy link
Copy Markdown
Contributor

Thank you for the fix here. This PR correctly identified that nemoclaw status should not report healthy inference or exit successfully when the selected sandbox/gateway cannot be verified.

The maintained version of this behavior has now landed in #2884, which closed #2595. That merged path ports the useful pieces from this PR onto current main: it verifies the selected sandbox/gateway before inference probing, suppresses Inference: healthy for unverified/degraded states, exits non-zero for those status paths, and includes current tests for the stale registry/orphan cleanup behavior.

I am closing this as superseded by #2884 so we do not keep a conflicting older branch open, but the issue and behavior were materially advanced by this PR. Thanks again for the contribution.

@ericksoa ericksoa closed this May 3, 2026
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: observability Logging, metrics, tracing, diagnostics, or debug output bug-fix PR fixes a bug or regression and removed NemoClaw CLI bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: observability Logging, metrics, tracing, diagnostics, or debug output bug-fix PR fixes a bug or regression VDR Linked to VDR finding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][macOS][CLI&UX] nemoclaw status reports "Inference: healthy" while gateway is down, exits 0

5 participants