Skip to content

fix(status): require verified gateway before healthy inference#2884

Merged
ericksoa merged 8 commits into
mainfrom
codex/2595-status-current-main-20260502
May 3, 2026
Merged

fix(status): require verified gateway before healthy inference#2884
ericksoa merged 8 commits into
mainfrom
codex/2595-status-current-main-20260502

Conversation

@ericksoa

@ericksoa ericksoa commented May 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Ports the useful behavior from #2626 onto current main without depending on #2881 or #2822.

  • reconcile sandbox/gateway state before probing or printing inference provider health
  • suppress Inference: healthy unless the selected sandbox/gateway is verified present
  • exit non-zero for degraded or unverified sandbox/gateway status states
  • remove only the selected stale registry entry when status verifies that named sandbox is absent from a healthy NemoClaw gateway; status does not sweep unrelated registry-only entries
  • keep nemoclaw list rendering registered sandboxes when runtime inference probing is degraded

Fixes #2595
Refs #2666
Refs #2604

Validation

  • npm run build:cli
  • npx vitest run test/cli.test.ts -t "status|runtime inference probing"
  • npx vitest run test/cli.test.ts -t "gateway|inference|orphan|verified"
  • npx vitest run test/cli.test.ts
  • npm run typecheck:cli
  • npm run typecheck
  • npx biome lint src/nemoclaw.ts test/cli.test.ts
  • git diff --check

Summary by CodeRabbit

  • New Features

    • Automatic generation and application of an initial sandbox create policy based on enabled messaging channels; merged policy presets are applied at sandbox boot.
  • Bug Fixes

    • Status now only reports inference when gateway/sandbox state is verified; otherwise shows "Inference: not verified..." and exits non-zero for unverified states.
    • Status removes the selected stale/orphaned registry entry only when that named sandbox is verified absent from a healthy NemoClaw gateway; it does not sweep unrelated registry-only entries. List/status gracefully fall back when inference probing is degraded.
  • Tests

    • CLI, e2e, and unit tests updated to expect non-zero exits and verification-gated inference reporting; policy merge behavior covered.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

sandboxStatus() now reconciles sandbox/gateway state before probing inference. If the reconciled lookup is not present, live inference probing and provider health checks are skipped, inferenceHealth is set to null, output shows Inference: not verified (gateway/sandbox state not verified), and the command exits with code 1 for degraded cases; tests/e2e updated accordingly.

Changes

Status: Verify Gateway/Sandbox Before Inference Probe

Layer / File(s) Summary
State Reconciliation / Probe Gate
src/nemoclaw.ts
Calls getReconciledSandboxGatewayState(..., { getState: getSandboxGatewayStateForStatus }) and uses lookup.state to decide probe eligibility.
Conditional Inference Probing
src/nemoclaw.ts
captureOpenshellForStatus(["inference","get"]) and parseGatewayInference(...) are executed only when lookup.state === "present".
Inference Health Assignment
src/nemoclaw.ts
inferenceHealth computed via probeProviderHealth only for lookup.state === "present" and provider is a string; otherwise set to null.
CLI Output Messaging
src/nemoclaw.ts
When lookup is not present, prints Inference: not verified (gateway/sandbox state not verified) instead of provider health details.
Exit Behavior
src/nemoclaw.ts
Non-present reconciliation branches (wrong_gateway_active, missing, identity_drift, gateway_unreachable_after_restart, gateway_missing_after_restart, fallback) now call process.exit(1) after guidance/output.
Tests — Status Command
test/cli.test.ts
Updated many status tests to expect exit code 1 for gateway-degraded/transport/auth/timeouts; added test asserting Inference: healthy only after sandbox+gateway verification and that inference get is not called when not verified.
Tests — List Command
test/cli.test.ts
Added test: when inference probing fails (connection error), list falls back to registered sandboxes from local registry and still returns success.
E2E Adjustments
test/e2e/test-double-onboard.sh
Phase 5 & 6 assertions changed to accept exit code 1 as the expected result when sandbox deleted or gateway stopped.

Sequence Diagram(s)

sequenceDiagram
  participant User as User
  participant CLI as CLI
  participant Gateway as Gateway
  participant Provider as Provider

  User->>CLI: run "nemoclaw <sandbox> status"
  CLI->>Gateway: getReconciledSandboxGatewayState(sandbox, getState=...)
  alt lookup.state == "present"
    CLI->>Gateway: captureOpenshellForStatus(["inference","get"])
    Gateway->>Provider: forward inference probe
    Provider-->>Gateway: inference response
    Gateway-->>CLI: inference response
    CLI-->>User: print Inference: healthy (model/provider)
    CLI-->>CLI: exit 0
  else lookup.state != "present"
    CLI-->>User: print Inference: not verified (gateway/sandbox state not verified)
    CLI--xGateway: skip inference probe
    CLI-->>User: print gateway reconciliation/guidance
    CLI-->>CLI: exit 1
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰
I hopped to check the gateway's door,
No false "healthy" claims anymore.
Verify first, then sing in tune,
Status waits for sun and moon.
Tests clap softly — tidy tune.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: requiring gateway verification before reporting healthy inference status.
Linked Issues check ✅ Passed Changes directly address all objectives from issue #2595: verify gateway state first, suppress healthy inference until verified, and exit non-zero on degradation.
Out of Scope Changes check ✅ Passed All changes align with issue #2595 requirements: gateway reconciliation, inference probing gating, exit code handling, and list fallback behavior are all in-scope.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/2595-status-current-main-20260502

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/cli.test.ts (2)

4464-4497: ⚡ Quick win

Make the degraded-inference test more specific to inference failure.

The fallback script currently returns exit 1 for all non-inference commands, so this can pass under broader runtime failure modes. Keep non-inference commands successful and assert inference get was attempted.

Proposed tightening for path specificity
   it("lists registered sandboxes when runtime inference probing is degraded", () => {
     const home = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-cli-list-runtime-degraded-"));
     const localBin = path.join(home, "bin");
+    const markerFile = path.join(home, "openshell-calls");
@@
     fs.writeFileSync(
       path.join(localBin, "openshell"),
       [
         "#!/usr/bin/env bash",
+        `marker_file=${JSON.stringify(markerFile)}`,
+        'printf \'%s\\n\' "$*" >> "$marker_file"',
         'if [ "$1" = "inference" ] && [ "$2" = "get" ]; then',
         "  echo 'Error: client error (Connect)' >&2",
         "  exit 1",
         "fi",
-        "exit 1",
+        "exit 0",
       ].join("\n"),
       { mode: 0o755 },
     );
@@
     expect(r.out).toContain("model: configured-model");
     expect(r.out).toContain("provider: nvidia-prod");
+    expect(fs.readFileSync(markerFile, "utf8")).toContain("inference get");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 4464 - 4497, The test's fallback openshell
stub currently fails all commands, making the scenario ambiguous; update the
stub used in the "lists registered sandboxes when runtime inference probing is
degraded" test so only the inference probe fails and all other commands succeed.
Specifically, change the openshell script built in the test (the block written
to path.join(localBin, "openshell")) so the branch that matches '"$1" =
"inference" && "$2" = "get"' prints the error and exits non‑zero, while any
other invocation returns success (exit 0). Then tighten assertions in the test
(the it block using runWithEnv) to assert that the inference probe was actually
attempted by checking that the error output from that command is present (e.g.,
expect the command's stderr/out to contain the 'Error: client error (Connect)'
marker) while preserving the existing checks (Sandboxes, alpha *, model,
provider) so the test specifically verifies degraded inference probing rather
than a generic runtime failure.

4177-4208: ⚡ Quick win

Strengthen this case by asserting inference probe is skipped when state is unverified.

Current assertions validate output, but they don’t guarantee openshell inference get was not called. Adding a call marker closes that regression gap.

Proposed test hardening
       const registryDir = path.join(home, ".nemoclaw");
+      const markerFile = path.join(home, "openshell-calls");
@@
       fs.writeFileSync(
         path.join(localBin, "openshell"),
         [
           "#!/usr/bin/env bash",
+          `marker_file=${JSON.stringify(markerFile)}`,
+          'printf \'%s\\n\' "$*" >> "$marker_file"',
           'if [ "$1" = "sandbox" ] && [ "$2" = "get" ] && [ "$3" = "alpha" ]; then',
@@
       expect(statusResult.out).toContain(
         "Inference: not verified (gateway/sandbox state not verified)",
       );
+      expect(fs.readFileSync(markerFile, "utf8")).not.toContain("inference get");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 4177 - 4208, The test should assert that the
inference probe command was not invoked by instrumenting the fake curl used in
the test; modify the curl stub written in the test (the script created under
localBin named "curl") to detect requests that correspond to the CLI's inference
probe (the HTTP path or arguments used by the CLI when calling openshell
inference get) and write a call-marker (e.g., touch a marker file in HOME or
print a unique token) when such a request occurs; then after invoking
runWithEnv("alpha status", ...) assert that the marker was NOT created/printed
(using statusResult and filesystem checks) to ensure openshell inference get was
not called. Ensure you update the assertions around statusResult (the existing
expects) to include the new negative assertion for the marker.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 4464-4497: The test's fallback openshell stub currently fails all
commands, making the scenario ambiguous; update the stub used in the "lists
registered sandboxes when runtime inference probing is degraded" test so only
the inference probe fails and all other commands succeed. Specifically, change
the openshell script built in the test (the block written to path.join(localBin,
"openshell")) so the branch that matches '"$1" = "inference" && "$2" = "get"'
prints the error and exits non‑zero, while any other invocation returns success
(exit 0). Then tighten assertions in the test (the it block using runWithEnv) to
assert that the inference probe was actually attempted by checking that the
error output from that command is present (e.g., expect the command's stderr/out
to contain the 'Error: client error (Connect)' marker) while preserving the
existing checks (Sandboxes, alpha *, model, provider) so the test specifically
verifies degraded inference probing rather than a generic runtime failure.
- Around line 4177-4208: The test should assert that the inference probe command
was not invoked by instrumenting the fake curl used in the test; modify the curl
stub written in the test (the script created under localBin named "curl") to
detect requests that correspond to the CLI's inference probe (the HTTP path or
arguments used by the CLI when calling openshell inference get) and write a
call-marker (e.g., touch a marker file in HOME or print a unique token) when
such a request occurs; then after invoking runWithEnv("alpha status", ...)
assert that the marker was NOT created/printed (using statusResult and
filesystem checks) to ensure openshell inference get was not called. Ensure you
update the assertions around statusResult (the existing expects) to include the
new negative assertion for the marker.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fb9cb7e4-90a1-4b9e-8bdf-32c2a9ecf151

📥 Commits

Reviewing files that changed from the base of the PR and between d361c4b and 9725a3e.

📒 Files selected for processing (2)
  • src/nemoclaw.ts
  • test/cli.test.ts

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25256767831
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 19 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

@ericksoa ericksoa self-assigned this May 2, 2026
@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25257342813
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 12 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ⚠️ cancelled
overlayfs-autofix-e2e ⚠️ cancelled
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ✅ success
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/cli.test.ts (1)

3767-3837: ⚡ Quick win

Strengthen the healthy-path test with probe-order assertions.

Line 3835 currently validates output only. It would be stronger to also assert inference get is invoked after sandbox verification, so this test guards the gating behavior directly.

Proposed test hardening
   it("prints healthy inference only after the sandbox and gateway are verified", () => {
     const home = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-cli-status-healthy-"));
     const localBin = path.join(home, "bin");
+    const markerFile = path.join(home, "openshell-calls");
     fs.mkdirSync(localBin, { recursive: true });
@@
       [
         "#!/usr/bin/env bash",
+        `printf '%s\\n' "$*" >> ${JSON.stringify(markerFile)}`,
         'if [ "$1" = "sandbox" ] && [ "$2" = "get" ] && [ "$3" = "alpha" ]; then',
@@
     expect(r.out).toContain("Inference:");
     expect(r.out).toContain("healthy");
     expect(r.out).not.toContain("not verified");
+    const calls = fs.readFileSync(markerFile, "utf8").trim().split("\n").filter(Boolean);
+    const sandboxGetIdx = calls.indexOf("sandbox get alpha");
+    const inferenceGetIdx = calls.indexOf("inference get");
+    expect(sandboxGetIdx).toBeGreaterThanOrEqual(0);
+    expect(inferenceGetIdx).toBeGreaterThan(sandboxGetIdx);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 3767 - 3837, Update the test to record the
order of openshell invocations and assert that "sandbox get" happens before
"inference get": modify the localBin/openshell stub (used by runWithEnv in this
test) to append distinct markers (e.g., "SANDBOX_GET" and "INFERENCE_GET") to a
temp file in HOME each time those branches run, then after runWithEnv("alpha
status") read that file and assert the marker order (indexOf SANDBOX_GET <
indexOf INFERENCE_GET). Keep references to runWithEnv and the localBin/openshell
stub so the gating behavior is directly verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 3767-3837: Update the test to record the order of openshell
invocations and assert that "sandbox get" happens before "inference get": modify
the localBin/openshell stub (used by runWithEnv in this test) to append distinct
markers (e.g., "SANDBOX_GET" and "INFERENCE_GET") to a temp file in HOME each
time those branches run, then after runWithEnv("alpha status") read that file
and assert the marker order (indexOf SANDBOX_GET < indexOf INFERENCE_GET). Keep
references to runWithEnv and the localBin/openshell stub so the gating behavior
is directly verified.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed01d313-bd91-4e84-85f3-feed209bfdde

📥 Commits

Reviewing files that changed from the base of the PR and between 890e847 and e3f3c02.

📒 Files selected for processing (2)
  • src/nemoclaw.ts
  • test/cli.test.ts

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25257807467
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 20 passed, 1 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ❌ failure
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: token-rotation-e2e. Check run artifacts for logs.

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25257807467
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 21 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25259840292
Branch: codex/2595-status-current-main-20260502
Requested jobs: token-rotation-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ⏭️ skipped

@ericksoa ericksoa added the fix label May 2, 2026
@ericksoa ericksoa requested a review from cv May 2, 2026 20:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/policies.test.ts (1)

617-638: ⚡ Quick win

Add one edge-case test for duplicate + missing preset names.

Please add a case like ["slack", "missing", "slack"] and assert de-dup + missing reporting (appliedPresets: ["slack"], missingPresets: ["missing"]) to lock the helper contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/policies.test.ts` around lines 617 - 638, Add a new test in the
"mergePresetNamesIntoPolicy" describe block that calls
policies.mergePresetNamesIntoPolicy with an array containing duplicates and a
non-existent preset (e.g. ["slack","missing","slack"]); assert that
result.appliedPresets de-duplicates to ["slack"], result.missingPresets lists
["missing"], and result.policy still contains the expected policy content for
the applied preset. Locate the test near the existing spec that uses
mergePresetNamesIntoPolicy and mirror its style/expectations for appliedPresets,
missingPresets, and policy string checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/policies.test.ts`:
- Around line 617-638: Add a new test in the "mergePresetNamesIntoPolicy"
describe block that calls policies.mergePresetNamesIntoPolicy with an array
containing duplicates and a non-existent preset (e.g.
["slack","missing","slack"]); assert that result.appliedPresets de-duplicates to
["slack"], result.missingPresets lists ["missing"], and result.policy still
contains the expected policy content for the applied preset. Locate the test
near the existing spec that uses mergePresetNamesIntoPolicy and mirror its
style/expectations for appliedPresets, missingPresets, and policy string checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc6a6e90-383b-475f-9b55-a0648bdcfc8f

📥 Commits

Reviewing files that changed from the base of the PR and between 5238242 and dfe15f0.

📒 Files selected for processing (5)
  • src/lib/onboard.ts
  • src/lib/policies.ts
  • src/nemoclaw.ts
  • test/onboard.test.ts
  • test/policies.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/nemoclaw.ts

@ericksoa ericksoa merged commit 57abf88 into main May 3, 2026
20 checks passed
miyoungc added a commit that referenced this pull request May 5, 2026
## Summary
Catch up the docs for user-facing changes that landed over the weekend
and today, so the published guidance matches current installer,
onboarding, status, logs, local inference, rebuild backup behavior, the
next docs version selector, and refreshed generated user skills.

## Related Issue
None.

## Changes
- Document WSL Windows-host Ollama onboarding actions, including use,
start, restart, install, and `host.docker.internal` model pulls from
#2800; clarify that onboard owns Ollama install and model pulls from
#2952.
- Document installer fail-fast behavior for non-TTY third-party software
acceptance from #2706.
- Update sandbox-name guidance and Brev deploy validation behavior from
#2948.
- Add `nemoclaw <name> logs --tail/--since` coverage from #2825.
- Add global `nemoclaw status --json` coverage from #2822.
- Document verified-gateway status behavior and non-zero degraded exits
from #2884.
- Clarify that rebuild stops before deleting the original sandbox when
backup fails, including unreadable or root-owned state paths.
- Bump docs switcher metadata from 0.0.33 to 0.0.34 without changing
package versions or creating release tags.
- Regenerate `.agents/skills/nemoclaw-user-*` from docs, including the
new `nemoclaw-user-manage-sandboxes` generated skill and removal of the
stale `nemoclaw-user-workspace` output.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Expanded Ollama/local inference guidance with detailed WSL and
Windows-host workflows, proxy/token behavior, and onboarding options
* Standardized sandbox name validation and updated troubleshooting, CLI,
and deploy docs to surface the rules and validation timing
* Added/rewrote Manage Sandboxes, policy management, backup/restore,
messaging channels, workspace persistence, and CLI selection guides
* Refreshed quickstart/Hermes guidance, skill mappings, and bumped docs
version to 0.0.34
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
cv pushed a commit that referenced this pull request May 14, 2026
…/Inference fields (#3537)

> Re-signed replacement for #3534. Same code, single squashed commit,
this time SSH-signed (the original branch had `commit.gpgsign=false` set
as a local override on my clone — the prior PR's commits ended up
unverified). Force-push is blocked on the original branch, so this is a
fresh branch + new PR; #3534 will be closed in favour of this one.

## Summary

Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14)
re-reported the same symptom on v0.0.38 / v0.0.41 — `nemoclaw status`
prints `● cloudflared (stopped)` in all three failure modes (no PID
file, garbage PID, dead/wrong-process PID) with no cause and no
remediation. The bare command also omits the `Connected:` / `Inference:`
labeled fields the original bug requested. Both symptoms are addressed
here.

## Root cause

- **Cloudflared diagnostic.** `src/lib/actions/sandbox/doctor.ts`
already distinguished three states and emitted matching hints.
`showStatus()` in `src/lib/tunnel/services.ts` was written before that
and only had `isRunning() ? ok : "(stopped)"` — every failure mode
collapsed into the same un-actionable line.
- **Bare-status fields.** PR #2884 added `Inference:` / `Connected:`
labels to the per-sandbox `nemoclaw <name> status` but left bare
`nemoclaw status` showing only the model in parens.

## What this PR does

1. **Share the cloudflared state machine.** New
`readCloudflaredState(pidDir)` in `src/lib/tunnel/services.ts` returns a
discriminated union `{ kind: "running" | "stopped" | "stale-pid-file" |
"stale-pid-process" }`. `showStatus()` switches on it and emits a
coloured marker + one-line remediation. `cloudflaredDoctorCheck()`
consumes the same function and translates each state into its
`DoctorCheck`, removing duplicated PID-file / `/proc/<pid>/cmdline`
logic.

2. **Remediation wording — `no cloudflared process; restart with ...`.**
All three failure modes lead with the cause and point at `nemoclaw
tunnel start`. Both reporters asked for that exact shape. `nemoclaw
tunnel start` already handles all three states because `isRunning()`
returns false and `startService` proceeds and overwrites any stale PID
file — so one command recovers in every case. The same wording is used
in `doctor.ts` for consistency.

3. **Bare-status `Inference:` and `Connected:` lines.**
`showStatusCommand` in `src/lib/inventory/index.ts` now prints labeled
`Inference: <provider> / <model>` and `Connected: yes (N session) / no`
under each sandbox row. Provider/model prefer live gateway values for
the default sandbox (consistent with the existing `(model)` rendering,
#2369). `getActiveSessionCount` is wired through
`buildStatusCommandDeps`, mirroring the cached SSH-process probe already
used by `buildListCommandDeps`.

4. **Tests** (15 new). Three cases each for `showStatus` failure-mode
rendering, five cases for `readCloudflaredState`, seven for bare-status
`Inference:` / `Connected:` rendering across live/stored/missing dep
variants. All existing cli-doctor tests still pass against the
refactored shared function.

## Behavioural diff

Before (v0.0.41 baseline):

```
● cloudflared  (stopped)        # identical output for all three failure modes
```

After:

```
● cloudflared  (stopped)
    no cloudflared process; run `nemoclaw tunnel start` to start it

● cloudflared  (stale PID file)
    no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it

● cloudflared  (stale PID 999999999)
    no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it
```

Bare status sandbox row:

```
# Before
test-sandbox * (qwen2.5:7b) :18789

# After
test-sandbox * (qwen2.5:7b) :18789
  Inference: ollama-local / qwen2.5:7b
  Connected: no
```

## Brev reproduction — 3× per case, baseline and fix

(Run on the prior PR #3534 branch, same code as here.) Fresh
`n2d-standard-2` (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked
registry, three PID-dir states 3× each. Re-installed from this branch.

<details>
<summary><b>Baseline v0.0.41 — 9/9 runs reproduce the bug</b></summary>

```
### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)
```
</details>

<details>
<summary><b>Fix branch — 9/9 runs emit a state-specific
remediation</b></summary>

Same three-case 3× harness, with the new wording. Identical output
across reruns within each case — no flake.
</details>

## Out of scope (filed/to-file as follow-ups)

- **Auto-restart on crash.** Needs a watchdog daemon / systemd unit /
`while true; cloudflared || sleep 5` shim. Real design work, separate
PR. Manual `pkill cloudflared` is not preventable from inside the CLI
either way; the right answer there is actionable status output, which
this PR provides.
- **Intermittent cloudflared exits in nightly E2E (#3494).** Different
scope — CI-infra fault attribution. #3517 covers it; touches disjoint
files.

## Test plan

- [x] `npm run build:cli` clean
- [x] `npx tsc -p tsconfig.cli.json` clean
- [x] `vitest run src/lib/tunnel/services.test.ts` — 23 pass
- [x] `vitest run src/lib/inventory/index.test.ts` — 31 pass
- [x] `vitest run test/cli.test.ts -t doctor` — 4/4 pass (covers
refactored `cloudflaredDoctorCheck`)
- [x] Brev `n2d-standard-2` repro: 9/9 baseline reproduces; 9/9 fix
shows the new diagnostic + remediation lines, no flake
- [x] Commit SSH-signed and verified by GitHub (`reason: valid`)

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Status command now displays inference provider and model per sandbox.
  * Status command includes active SSH session count when available.

* **Bug Fixes**
* Enhanced cloudflared health checks with refined state detection
(running, stopped, stale PID).
  * Improved remediation hints for cloudflared diagnostic messages.

* **Tests**
* Added tests for expanded status output with provider, model, and
session information.
* Added tests for cloudflared state detection across various scenarios.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3537)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][macOS][CLI&UX] nemoclaw status reports "Inference: healthy" while gateway is down, exits 0

3 participants