fix(status): require verified gateway before healthy inference by ericksoa · Pull Request #2884 · NVIDIA/NemoClaw

ericksoa · 2026-05-02T16:42:11Z

Summary

Ports the useful behavior from #2626 onto current main without depending on #2881 or #2822.

reconcile sandbox/gateway state before probing or printing inference provider health
suppress Inference: healthy unless the selected sandbox/gateway is verified present
exit non-zero for degraded or unverified sandbox/gateway status states
remove only the selected stale registry entry when status verifies that named sandbox is absent from a healthy NemoClaw gateway; status does not sweep unrelated registry-only entries
keep nemoclaw list rendering registered sandboxes when runtime inference probing is degraded

Fixes #2595
Refs #2666
Refs #2604

Validation

npm run build:cli
npx vitest run test/cli.test.ts -t "status|runtime inference probing"
npx vitest run test/cli.test.ts -t "gateway|inference|orphan|verified"
npx vitest run test/cli.test.ts
npm run typecheck:cli
npm run typecheck
npx biome lint src/nemoclaw.ts test/cli.test.ts
git diff --check

Summary by CodeRabbit

New Features
- Automatic generation and application of an initial sandbox create policy based on enabled messaging channels; merged policy presets are applied at sandbox boot.
Bug Fixes
- Status now only reports inference when gateway/sandbox state is verified; otherwise shows "Inference: not verified..." and exits non-zero for unverified states.
- Status removes the selected stale/orphaned registry entry only when that named sandbox is verified absent from a healthy NemoClaw gateway; it does not sweep unrelated registry-only entries. List/status gracefully fall back when inference probing is degraded.
Tests
- CLI, e2e, and unit tests updated to expect non-zero exits and verification-gated inference reporting; policy merge behavior covered.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-05-02T16:42:23Z

📝 Walkthrough

Walkthrough

sandboxStatus() now reconciles sandbox/gateway state before probing inference. If the reconciled lookup is not present, live inference probing and provider health checks are skipped, inferenceHealth is set to null, output shows Inference: not verified (gateway/sandbox state not verified), and the command exits with code 1 for degraded cases; tests/e2e updated accordingly.

Changes

Status: Verify Gateway/Sandbox Before Inference Probe

Layer / File(s)	Summary
State Reconciliation / Probe Gate `src/nemoclaw.ts`	Calls `getReconciledSandboxGatewayState(..., { getState: getSandboxGatewayStateForStatus })` and uses `lookup.state` to decide probe eligibility.
Conditional Inference Probing `src/nemoclaw.ts`	`captureOpenshellForStatus(["inference","get"])` and `parseGatewayInference(...)` are executed only when `lookup.state === "present"`.
Inference Health Assignment `src/nemoclaw.ts`	`inferenceHealth` computed via `probeProviderHealth` only for `lookup.state === "present"` and provider is a string; otherwise set to `null`.
CLI Output Messaging `src/nemoclaw.ts`	When lookup is not `present`, prints `Inference: not verified (gateway/sandbox state not verified)` instead of provider health details.
Exit Behavior `src/nemoclaw.ts`	Non-`present` reconciliation branches (`wrong_gateway_active`, `missing`, `identity_drift`, `gateway_unreachable_after_restart`, `gateway_missing_after_restart`, fallback) now call `process.exit(1)` after guidance/output.
Tests — Status Command `test/cli.test.ts`	Updated many status tests to expect exit code `1` for gateway-degraded/transport/auth/timeouts; added test asserting `Inference: healthy` only after sandbox+gateway verification and that `inference get` is not called when not verified.
Tests — List Command `test/cli.test.ts`	Added test: when inference probing fails (connection error), `list` falls back to registered sandboxes from local registry and still returns success.
E2E Adjustments `test/e2e/test-double-onboard.sh`	Phase 5 & 6 assertions changed to accept exit code `1` as the expected result when sandbox deleted or gateway stopped.

Sequence Diagram(s)

sequenceDiagram
  participant User as User
  participant CLI as CLI
  participant Gateway as Gateway
  participant Provider as Provider

  User->>CLI: run "nemoclaw <sandbox> status"
  CLI->>Gateway: getReconciledSandboxGatewayState(sandbox, getState=...)
  alt lookup.state == "present"
    CLI->>Gateway: captureOpenshellForStatus(["inference","get"])
    Gateway->>Provider: forward inference probe
    Provider-->>Gateway: inference response
    Gateway-->>CLI: inference response
    CLI-->>User: print Inference: healthy (model/provider)
    CLI-->>CLI: exit 0
  else lookup.state != "present"
    CLI-->>User: print Inference: not verified (gateway/sandbox state not verified)
    CLI--xGateway: skip inference probe
    CLI-->>User: print gateway reconciliation/guidance
    CLI-->>CLI: exit 1
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰
I hopped to check the gateway's door,
No false "healthy" claims anymore.
Verify first, then sing in tune,
Status waits for sun and moon.
Tests clap softly — tidy tune.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: requiring gateway verification before reporting healthy inference status.
Linked Issues check	✅ Passed	Changes directly address all objectives from issue `#2595`: verify gateway state first, suppress healthy inference until verified, and exit non-zero on degradation.
Out of Scope Changes check	✅ Passed	All changes align with issue `#2595` requirements: gateway reconciliation, inference probing gating, exit code handling, and list fallback behavior are all in-scope.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/2595-status-current-main-20260502

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

test/cli.test.ts (2)

4464-4497: ⚡ Quick win

Make the degraded-inference test more specific to inference failure.

The fallback script currently returns exit 1 for all non-inference commands, so this can pass under broader runtime failure modes. Keep non-inference commands successful and assert inference get was attempted.

Proposed tightening for path specificity

   it("lists registered sandboxes when runtime inference probing is degraded", () => {
     const home = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-cli-list-runtime-degraded-"));
     const localBin = path.join(home, "bin");
+    const markerFile = path.join(home, "openshell-calls");
@@
     fs.writeFileSync(
       path.join(localBin, "openshell"),
       [
         "#!/usr/bin/env bash",
+        `marker_file=${JSON.stringify(markerFile)}`,
+        'printf \'%s\\n\' "$*" >> "$marker_file"',
         'if [ "$1" = "inference" ] && [ "$2" = "get" ]; then',
         "  echo 'Error: client error (Connect)' >&2",
         "  exit 1",
         "fi",
-        "exit 1",
+        "exit 0",
       ].join("\n"),
       { mode: 0o755 },
     );
@@
     expect(r.out).toContain("model: configured-model");
     expect(r.out).toContain("provider: nvidia-prod");
+    expect(fs.readFileSync(markerFile, "utf8")).toContain("inference get");
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 4464 - 4497, The test's fallback openshell
stub currently fails all commands, making the scenario ambiguous; update the
stub used in the "lists registered sandboxes when runtime inference probing is
degraded" test so only the inference probe fails and all other commands succeed.
Specifically, change the openshell script built in the test (the block written
to path.join(localBin, "openshell")) so the branch that matches '"$1" =
"inference" && "$2" = "get"' prints the error and exits non‑zero, while any
other invocation returns success (exit 0). Then tighten assertions in the test
(the it block using runWithEnv) to assert that the inference probe was actually
attempted by checking that the error output from that command is present (e.g.,
expect the command's stderr/out to contain the 'Error: client error (Connect)'
marker) while preserving the existing checks (Sandboxes, alpha *, model,
provider) so the test specifically verifies degraded inference probing rather
than a generic runtime failure.

4177-4208: ⚡ Quick win

Strengthen this case by asserting inference probe is skipped when state is unverified.

Current assertions validate output, but they don’t guarantee openshell inference get was not called. Adding a call marker closes that regression gap.

Proposed test hardening

       const registryDir = path.join(home, ".nemoclaw");
+      const markerFile = path.join(home, "openshell-calls");
@@
       fs.writeFileSync(
         path.join(localBin, "openshell"),
         [
           "#!/usr/bin/env bash",
+          `marker_file=${JSON.stringify(markerFile)}`,
+          'printf \'%s\\n\' "$*" >> "$marker_file"',
           'if [ "$1" = "sandbox" ] && [ "$2" = "get" ] && [ "$3" = "alpha" ]; then',
@@
       expect(statusResult.out).toContain(
         "Inference: not verified (gateway/sandbox state not verified)",
       );
+      expect(fs.readFileSync(markerFile, "utf8")).not.toContain("inference get");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 4177 - 4208, The test should assert that the
inference probe command was not invoked by instrumenting the fake curl used in
the test; modify the curl stub written in the test (the script created under
localBin named "curl") to detect requests that correspond to the CLI's inference
probe (the HTTP path or arguments used by the CLI when calling openshell
inference get) and write a call-marker (e.g., touch a marker file in HOME or
print a unique token) when such a request occurs; then after invoking
runWithEnv("alpha status", ...) assert that the marker was NOT created/printed
(using statusResult and filesystem checks) to ensure openshell inference get was
not called. Ensure you update the assertions around statusResult (the existing
expects) to include the new negative assertion for the marker.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 4464-4497: The test's fallback openshell stub currently fails all
commands, making the scenario ambiguous; update the stub used in the "lists
registered sandboxes when runtime inference probing is degraded" test so only
the inference probe fails and all other commands succeed. Specifically, change
the openshell script built in the test (the block written to path.join(localBin,
"openshell")) so the branch that matches '"$1" = "inference" && "$2" = "get"'
prints the error and exits non‑zero, while any other invocation returns success
(exit 0). Then tighten assertions in the test (the it block using runWithEnv) to
assert that the inference probe was actually attempted by checking that the
error output from that command is present (e.g., expect the command's stderr/out
to contain the 'Error: client error (Connect)' marker) while preserving the
existing checks (Sandboxes, alpha *, model, provider) so the test specifically
verifies degraded inference probing rather than a generic runtime failure.
- Around line 4177-4208: The test should assert that the inference probe command
was not invoked by instrumenting the fake curl used in the test; modify the curl
stub written in the test (the script created under localBin named "curl") to
detect requests that correspond to the CLI's inference probe (the HTTP path or
arguments used by the CLI when calling openshell inference get) and write a
call-marker (e.g., touch a marker file in HOME or print a unique token) when
such a request occurs; then after invoking runWithEnv("alpha status", ...)
assert that the marker was NOT created/printed (using statusResult and
filesystem checks) to ensure openshell inference get was not called. Ensure you
update the assertions around statusResult (the existing expects) to include the
new negative assertion for the marker.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fb9cb7e4-90a1-4b9e-8bdf-32c2a9ecf151

📥 Commits

Reviewing files that changed from the base of the PR and between d361c4b and 9725a3e.

📒 Files selected for processing (2)

src/nemoclaw.ts
test/cli.test.ts

github-actions · 2026-05-02T17:18:56Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25256767831
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 19 passed, 0 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	⚠️ cancelled
upgrade-stale-sandbox-e2e	⚠️ cancelled

…lishable-update

github-actions · 2026-05-02T17:33:11Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25257342813
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 12 passed, 0 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
deployment-services-e2e	⚠️ cancelled
diagnostics-e2e	⚠️ cancelled
docs-validation-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	⚠️ cancelled
overlayfs-autofix-e2e	⚠️ cancelled
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	⚠️ cancelled
sandbox-operations-e2e	⚠️ cancelled
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	⚠️ cancelled
snapshot-commands-e2e	✅ success
token-rotation-e2e	⚠️ cancelled
upgrade-stale-sandbox-e2e	⚠️ cancelled

coderabbitai

🧹 Nitpick comments (1)

test/cli.test.ts (1)

3767-3837: ⚡ Quick win

Strengthen the healthy-path test with probe-order assertions.

Line 3835 currently validates output only. It would be stronger to also assert inference get is invoked after sandbox verification, so this test guards the gating behavior directly.

Proposed test hardening

   it("prints healthy inference only after the sandbox and gateway are verified", () => {
     const home = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-cli-status-healthy-"));
     const localBin = path.join(home, "bin");
+    const markerFile = path.join(home, "openshell-calls");
     fs.mkdirSync(localBin, { recursive: true });
@@
       [
         "#!/usr/bin/env bash",
+        `printf '%s\\n' "$*" >> ${JSON.stringify(markerFile)}`,
         'if [ "$1" = "sandbox" ] && [ "$2" = "get" ] && [ "$3" = "alpha" ]; then',
@@
     expect(r.out).toContain("Inference:");
     expect(r.out).toContain("healthy");
     expect(r.out).not.toContain("not verified");
+    const calls = fs.readFileSync(markerFile, "utf8").trim().split("\n").filter(Boolean);
+    const sandboxGetIdx = calls.indexOf("sandbox get alpha");
+    const inferenceGetIdx = calls.indexOf("inference get");
+    expect(sandboxGetIdx).toBeGreaterThanOrEqual(0);
+    expect(inferenceGetIdx).toBeGreaterThan(sandboxGetIdx);
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.ts` around lines 3767 - 3837, Update the test to record the
order of openshell invocations and assert that "sandbox get" happens before
"inference get": modify the localBin/openshell stub (used by runWithEnv in this
test) to append distinct markers (e.g., "SANDBOX_GET" and "INFERENCE_GET") to a
temp file in HOME each time those branches run, then after runWithEnv("alpha
status") read that file and assert the marker order (indexOf SANDBOX_GET <
indexOf INFERENCE_GET). Keep references to runWithEnv and the localBin/openshell
stub so the gating behavior is directly verified.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 3767-3837: Update the test to record the order of openshell
invocations and assert that "sandbox get" happens before "inference get": modify
the localBin/openshell stub (used by runWithEnv in this test) to append distinct
markers (e.g., "SANDBOX_GET" and "INFERENCE_GET") to a temp file in HOME each
time those branches run, then after runWithEnv("alpha status") read that file
and assert the marker order (indexOf SANDBOX_GET < indexOf INFERENCE_GET). Keep
references to runWithEnv and the localBin/openshell stub so the gating behavior
is directly verified.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed01d313-bd91-4e84-85f3-feed209bfdde

📥 Commits

Reviewing files that changed from the base of the PR and between 890e847 and e3f3c02.

📒 Files selected for processing (2)

src/nemoclaw.ts
test/cli.test.ts

github-actions · 2026-05-02T18:04:18Z

Selective E2E Results — ❌ Some jobs failed

Run: 25257807467
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 20 passed, 1 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	❌ failure
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: token-rotation-e2e. Check run artifacts for logs.

github-actions · 2026-05-02T18:42:59Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25257807467
Branch: codex/2595-status-current-main-20260502
Requested jobs: all (no filter)
Summary: 21 passed, 0 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

github-actions · 2026-05-02T19:57:56Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25259840292
Branch: codex/2595-status-current-main-20260502
Requested jobs: token-rotation-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-compatible-endpoint-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	⏭️ skipped

coderabbitai

🧹 Nitpick comments (1)

test/policies.test.ts (1)
617-638: ⚡ Quick win

Add one edge-case test for duplicate + missing preset names.

Please add a case like ["slack", "missing", "slack"] and assert de-dup + missing reporting (appliedPresets: ["slack"], missingPresets: ["missing"]) to lock the helper contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/policies.test.ts` around lines 617 - 638, Add a new test in the
"mergePresetNamesIntoPolicy" describe block that calls
policies.mergePresetNamesIntoPolicy with an array containing duplicates and a
non-existent preset (e.g. ["slack","missing","slack"]); assert that
result.appliedPresets de-duplicates to ["slack"], result.missingPresets lists
["missing"], and result.policy still contains the expected policy content for
the applied preset. Locate the test near the existing spec that uses
mergePresetNamesIntoPolicy and mirror its style/expectations for appliedPresets,
missingPresets, and policy string checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/policies.test.ts`:
- Around line 617-638: Add a new test in the "mergePresetNamesIntoPolicy"
describe block that calls policies.mergePresetNamesIntoPolicy with an array
containing duplicates and a non-existent preset (e.g.
["slack","missing","slack"]); assert that result.appliedPresets de-duplicates to
["slack"], result.missingPresets lists ["missing"], and result.policy still
contains the expected policy content for the applied preset. Locate the test
near the existing spec that uses mergePresetNamesIntoPolicy and mirror its
style/expectations for appliedPresets, missingPresets, and policy string checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc6a6e90-383b-475f-9b55-a0648bdcfc8f

📥 Commits

Reviewing files that changed from the base of the PR and between 5238242 and dfe15f0.

📒 Files selected for processing (5)

src/lib/onboard.ts
src/lib/policies.ts
src/nemoclaw.ts
test/onboard.test.ts
test/policies.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/nemoclaw.ts

## Summary Catch up the docs for user-facing changes that landed over the weekend and today, so the published guidance matches current installer, onboarding, status, logs, local inference, rebuild backup behavior, the next docs version selector, and refreshed generated user skills. ## Related Issue None. ## Changes - Document WSL Windows-host Ollama onboarding actions, including use, start, restart, install, and `host.docker.internal` model pulls from #2800; clarify that onboard owns Ollama install and model pulls from #2952. - Document installer fail-fast behavior for non-TTY third-party software acceptance from #2706. - Update sandbox-name guidance and Brev deploy validation behavior from #2948. - Add `nemoclaw <name> logs --tail/--since` coverage from #2825. - Add global `nemoclaw status --json` coverage from #2822. - Document verified-gateway status behavior and non-zero degraded exits from #2884. - Clarify that rebuild stops before deleting the original sandbox when backup fails, including unreadable or root-owned state paths. - Bump docs switcher metadata from 0.0.33 to 0.0.34 without changing package versions or creating release tags. - Regenerate `.agents/skills/nemoclaw-user-*` from docs, including the new `nemoclaw-user-manage-sandboxes` generated skill and removal of the stale `nemoclaw-user-workspace` output. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ---  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Made with [Cursor](https://cursor.com)  ## Summary by CodeRabbit * **Documentation** * Expanded Ollama/local inference guidance with detailed WSL and Windows-host workflows, proxy/token behavior, and onboarding options * Standardized sandbox name validation and updated troubleshooting, CLI, and deploy docs to surface the rules and validation timing * Added/rewrote Manage Sandboxes, policy management, backup/restore, messaging channels, workspace persistence, and CLI selection guides * Refreshed quickstart/Hermes guidance, skill mappings, and bumped docs version to 0.0.34  --------- Co-authored-by: Cursor <cursoragent@cursor.com>

@wangericnv

…/Inference fields (#3537) > Re-signed replacement for #3534. Same code, single squashed commit, this time SSH-signed (the original branch had `commit.gpgsign=false` set as a local override on my clone — the prior PR's commits ended up unverified). Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3534 will be closed in favour of this one. ## Summary Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14) re-reported the same symptom on v0.0.38 / v0.0.41 — `nemoclaw status` prints `● cloudflared (stopped)` in all three failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation. The bare command also omits the `Connected:` / `Inference:` labeled fields the original bug requested. Both symptoms are addressed here. ## Root cause - **Cloudflared diagnostic.** `src/lib/actions/sandbox/doctor.ts` already distinguished three states and emitted matching hints. `showStatus()` in `src/lib/tunnel/services.ts` was written before that and only had `isRunning() ? ok : "(stopped)"` — every failure mode collapsed into the same un-actionable line. - **Bare-status fields.** PR #2884 added `Inference:` / `Connected:` labels to the per-sandbox `nemoclaw <name> status` but left bare `nemoclaw status` showing only the model in parens. ## What this PR does 1. **Share the cloudflared state machine.** New `readCloudflaredState(pidDir)` in `src/lib/tunnel/services.ts` returns a discriminated union `{ kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }`. `showStatus()` switches on it and emits a coloured marker + one-line remediation. `cloudflaredDoctorCheck()` consumes the same function and translates each state into its `DoctorCheck`, removing duplicated PID-file / `/proc/<pid>/cmdline` logic. 2. **Remediation wording — `no cloudflared process; restart with ...`.** All three failure modes lead with the cause and point at `nemoclaw tunnel start`. Both reporters asked for that exact shape. `nemoclaw tunnel start` already handles all three states because `isRunning()` returns false and `startService` proceeds and overwrites any stale PID file — so one command recovers in every case. The same wording is used in `doctor.ts` for consistency. 3. **Bare-status `Inference:` and `Connected:` lines.** `showStatusCommand` in `src/lib/inventory/index.ts` now prints labeled `Inference: <provider> / <model>` and `Connected: yes (N session) / no` under each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing `(model)` rendering, #2369). `getActiveSessionCount` is wired through `buildStatusCommandDeps`, mirroring the cached SSH-process probe already used by `buildListCommandDeps`. 4. **Tests** (15 new). Three cases each for `showStatus` failure-mode rendering, five cases for `readCloudflaredState`, seven for bare-status `Inference:` / `Connected:` rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function. ## Behavioural diff Before (v0.0.41 baseline): ``` ● cloudflared (stopped) # identical output for all three failure modes ``` After: ``` ● cloudflared (stopped) no cloudflared process; run `nemoclaw tunnel start` to start it ● cloudflared (stale PID file) no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it ● cloudflared (stale PID 999999999) no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it ``` Bare status sandbox row: ``` # Before test-sandbox * (qwen2.5:7b) :18789 # After test-sandbox * (qwen2.5:7b) :18789 Inference: ollama-local / qwen2.5:7b Connected: no ``` ## Brev reproduction — 3× per case, baseline and fix (Run on the prior PR #3534 branch, same code as here.) Fresh `n2d-standard-2` (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked registry, three PID-dir states 3× each. Re-installed from this branch. <details> <summary><b>Baseline v0.0.41 — 9/9 runs reproduce the bug</b></summary> ``` ### Case: stopped — no PID file --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ### Case: stale-pid-file — garbage PID contents --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ### Case: stale-pid-process — PID is dead --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ``` </details> <details> <summary><b>Fix branch — 9/9 runs emit a state-specific remediation</b></summary> Same three-case 3× harness, with the new wording. Identical output across reruns within each case — no flake. </details> ## Out of scope (filed/to-file as follow-ups) - **Auto-restart on crash.** Needs a watchdog daemon / systemd unit / `while true; cloudflared || sleep 5` shim. Real design work, separate PR. Manual `pkill cloudflared` is not preventable from inside the CLI either way; the right answer there is actionable status output, which this PR provides. - **Intermittent cloudflared exits in nightly E2E (#3494).** Different scope — CI-infra fault attribution. #3517 covers it; touches disjoint files. ## Test plan - [x] `npm run build:cli` clean - [x] `npx tsc -p tsconfig.cli.json` clean - [x] `vitest run src/lib/tunnel/services.test.ts` — 23 pass - [x] `vitest run src/lib/inventory/index.test.ts` — 31 pass - [x] `vitest run test/cli.test.ts -t doctor` — 4/4 pass (covers refactored `cloudflaredDoctorCheck`) - [x] Brev `n2d-standard-2` repro: 9/9 baseline reproduces; 9/9 fix shows the new diagnostic + remediation lines, no flake - [x] Commit SSH-signed and verified by GitHub (`reason: valid`) Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>  ## Summary by CodeRabbit ## Release Notes * **New Features** * Status command now displays inference provider and model per sandbox. * Status command includes active SSH session count when available. * **Bug Fixes** * Enhanced cloudflared health checks with refined state detection (running, stopped, stale PID). * Improved remediation hints for cloudflared diagnostic messages. * **Tests** * Added tests for expanded status output with provider, model, and session information. * Added tests for cloudflared state detection across various scenarios.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3537)   Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

fix(status): require verified gateway before healthy inference

9725a3e

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

test(e2e): expect degraded status exit in double onboard

890e847

ericksoa added the v0.0.34 label May 2, 2026

ericksoa self-assigned this May 2, 2026

ericksoa added 2 commits May 2, 2026 10:30

test(cli): tighten status degraded probes

1fb75a5

Merge remote-tracking branch 'origin/main' into codex/2595-status-pub…

e3f3c02

…lishable-update

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

test(cli): assert healthy status probe order

5238242

Merge branch 'main' into codex/2595-status-current-main-20260502

87cc03d

fix(onboard): include slack policy at sandbox boot

877c769

ericksoa added the fix label May 2, 2026

ericksoa requested a review from cv May 2, 2026 20:02

Merge branch 'main' into codex/2595-status-current-main-20260502

dfe15f0

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

cv approved these changes May 2, 2026

View reviewed changes

ericksoa merged commit 57abf88 into main May 3, 2026
20 checks passed

ericksoa mentioned this pull request May 3, 2026

fix(status): exit 1 and suppress healthy when gateway is down #2626

Closed

9 tasks

miyoungc mentioned this pull request May 4, 2026

docs: refresh documentation and user-skills for 0.0.34 #3003

Merged

12 tasks

wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(status): require verified gateway before healthy inference#2884

fix(status): require verified gateway before healthy inference#2884
ericksoa merged 8 commits into
mainfrom
codex/2595-status-current-main-20260502

ericksoa commented May 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericksoa commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 2, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented May 2, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented May 2, 2026 •

edited

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading