Skip to content

fix(cli): keep sandbox status on direct runtime path#2881

Merged
cv merged 4 commits into
mainfrom
fix/diagnostics-status-hang
May 2, 2026
Merged

fix(cli): keep sandbox status on direct runtime path#2881
cv merged 4 commits into
mainfrom
fix/diagnostics-status-hang

Conversation

@ericksoa

@ericksoa ericksoa commented May 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • route nemoclaw <sandbox> status through the Oclif sandbox:status command path
  • remove sandbox status from the legacy dispatch target set
  • harden nemoclaw <sandbox> status so live OpenShell probes cannot hang the command when descendant openshell/ssh processes keep captured pipes open
  • keep status useful from local registry data when a live sandbox status probe times out, and avoid live SSH agent-version probing from the informational status path
  • keep status --help on the public sandbox-scoped usage path
  • add regression coverage for the child-pipe hang shape that reproduced in diagnostics-e2e

Evidence

  • diagnostics-e2e timed out on main run 25244372371 at a8dfa272 after TC-DIAG-05: openclaw.json readable inside sandbox, hanging at Checking nemoclaw status from host... until the 45-minute job timeout
  • the same timeout repeated on main run 25245181101 at 32d3ab3e, again hanging at Checking nemoclaw status from host... with orphan openshell/ssh children killed by job cleanup
  • focused diagnostics-only rerun 25252679721 at d73371da reproduced the same failure: TC-DIAG-05 passed the in-sandbox config read at 13:27:44, then hung at Checking nemoclaw status from host... until GitHub cancelled at 13:56:55; cleanup killed the outer timeout/bash plus two openshell/ssh pairs
  • first PR rerun 25253547462 on 84b4955c confirmed that routing alone was insufficient: it hit the same Checking nemoclaw status from host... hang and the same orphan openshell/ssh cleanup
  • fixed PR rerun 25254068156 on 2956cd0e passed diagnostics-e2e; TC-DIAG-05 moved from Checking nemoclaw status from host... at 14:41:31Z to PASS TC-DIAG-05: nemoclaw status shows Model field at 14:41:32Z, and the job ended PASS: 8 / FAIL: 0
  • Oclif-routing PR rerun 25254765479 on 3224a5c5 passed diagnostics-e2e; TC-DIAG-05 moved from Checking nemoclaw status from host... at 15:16:11Z to PASS TC-DIAG-05: nemoclaw status shows Model field at 15:16:11Z, and the job ended PASS: 8 / FAIL: 0
  • last observed passing diagnostics job before the regression was run 25204743855 at b83ffe20, where the same status check completed in about 6 seconds
  • the failure is a NemoClaw status-path regression: our status command allowed synchronous child-process probes to leave descendant processes holding captured pipes open past the nominal timeout

Oclif conformance

  • resolveSandboxOclifDispatch("<name>", "status", []) now returns { kind: "oclif", commandId: "sandbox:status", args: ["<name>"] }
  • LegacyDispatch no longer includes status
  • runDispatchResult() no longer has a legacy status branch
  • SandboxStatusCommand remains the Oclif command surface and invokes the current status runtime implementation through the existing runtime bridge pattern used by migrated commands in this repo
  • public help stays stable: nemoclaw <name> status --help still renders <name> status rather than exposing the internal sandbox:status id

Validation

  • npm test -- src/lib/openshell.test.ts src/lib/sandbox-version.test.ts src/lib/legacy-oclif-dispatch.test.ts
  • npm run build:cli
  • npm test -- test/cli.test.ts -t "keeps status bounded|keeps registry entries when status hits|recovers status after gateway runtime|status --help|status shows|sandbox inspection help"
  • npm run typecheck:cli
  • npm run lint -- src/lib/legacy-oclif-dispatch.ts src/lib/legacy-oclif-dispatch.test.ts src/nemoclaw.ts
  • npm run format:check -- src/lib/legacy-oclif-dispatch.ts src/lib/legacy-oclif-dispatch.test.ts src/nemoclaw.ts
  • git diff --check
  • focused diagnostics-e2e run 25254068156 on 2956cd0e: pass
  • focused diagnostics-e2e run 25254765479 on Oclif-routing SHA 3224a5c5: pass

Summary by CodeRabbit

Release Notes

  • New Features

    • Added async command execution with timeout support for sandboxes.
  • Improvements

    • Sandbox status checks now use a dedicated probe path with configurable timeouts to prevent hanging.
    • Status operations gracefully handle long-running background processes.
  • Tests

    • Added test coverage for async operations and status probe timeout behavior.

@ericksoa ericksoa added fix ci-failure Auto-created by nemoclaw-diagnosis skill labels May 2, 2026
@ericksoa ericksoa self-assigned this May 2, 2026
@coderabbitai

coderabbitai Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c412920-3ffd-45ad-9007-873e0ed06273

📥 Commits

Reviewing files that changed from the base of the PR and between 3224a5c and 79dcda1.

📒 Files selected for processing (1)
  • src/nemoclaw.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/nemoclaw.ts

📝 Walkthrough

Walkthrough

Adds async, timeout-aware OpenShell capture and probes; refactors sandbox status to use these status-probe helpers and configurable timeouts; introduces a skip-probe option for agent-version checks; and adds tests validating timeout behavior and related CLI flows.

Changes

Sandbox status probing & async OpenShell capture

Layer / File(s) Summary
Async Capture Foundation
src/lib/openshell.ts
New captureOpenshellCommandAsync (async spawn), OpenshellSpawn type, and CaptureOpenshellAsyncOptions with killGraceMs/spawnImpl. Adds timeoutError and signalProcessTree helpers and timeout-driven SIGTERM→SIGKILL termination with output/error/signal metadata.
Async Capture Tests
src/lib/openshell.test.ts
Adds Vitest case that runs a long-lived child, invokes captureOpenshellCommandAsync with a short timeout and killGraceMs, and asserts quick completion with timeout metadata (ETIMEDOUT, signal).
Status-Probe Helpers
src/nemoclaw.ts (lines ~202–401)
Adds getStatusProbeTimeoutMs, captureOpenshellForStatus, isCommandTimeout, executeSandboxExecCommandForStatus, and isSandboxGatewayRunningForStatus to run async probes and detect timeouts/errors.
Live Policy Extraction
src/nemoclaw.ts (lines ~839–892)
Extracts live policy merge into mergeLivePolicyIntoSandboxOutput and updates getSandboxGatewayState to use it.
Timeout-aware Gateway State Lookup
src/nemoclaw.ts (lines ~906–946)
Adds async getSandboxGatewayStateForStatus returning status_probe_timeout on probe timeout or the usual present/missing/gateway_error/unknown_error classifications. Defines SandboxGatewayStateLookup type for pluggable lookups.
Reconciler Pluggability
src/nemoclaw.ts (lines ~1063–1082)
Refactors getReconciledSandboxGatewayState to accept an optional { getState } lookup so callers can inject status-probe or normal lookups.
Sandbox Status Integration
src/nemoclaw.ts (lines ~2207–2392)
sandboxStatus now uses captureOpenshellForStatus for inference probes, calls getReconciledSandboxGatewayState(..., { getState: getSandboxGatewayStateForStatus }), skips agent probe via sandboxVersion.checkAgentVersion(..., { skipProbe: true }), and replaces recovery-oriented OpenClaw probing with isSandboxGatewayRunningForStatus.
Agent Version Skip Option
src/lib/sandbox-version.ts
checkAgentVersion signature extended with opts.skipProbe?: boolean; when set (and not forced) it short-circuits and returns { detectionMethod: "unavailable", sandboxVersion: null, isStale: false }.
CLI / Integration Tests
test/cli.test.ts, src/lib/sandbox-version.test.ts, src/lib/legacy-oclif-dispatch.test.ts
Adds CLI test ensuring status completes quickly when a live probe leaves child pipes open; adds checkAgentVersion skip-probe test; adds resolveSandboxOclifDispatch test.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant SandboxStatus
    participant CaptureAsync as captureOpenshellCommandAsync
    participant ProcessTree as signalProcessTree
    participant Gateway as getSandboxGatewayStateForStatus

    Caller->>SandboxStatus: sandboxStatus(sandboxName)
    SandboxStatus->>CaptureAsync: captureOpenshellForStatus(["inference","get"], opts)
    CaptureAsync->>CaptureAsync: spawn(binary, args) and buffer stdout/stderr
    alt Timeout triggered
        CaptureAsync->>ProcessTree: signalProcessTree(SIGTERM)
        Note over ProcessTree: Terminate process group (non-Windows) or child (Windows)
        CaptureAsync->>ProcessTree: signalProcessTree(SIGKILL) (after killGraceMs)
        CaptureAsync-->>SandboxStatus: result with ETIMEDOUT error and signal metadata
    else Completes
        CaptureAsync-->>SandboxStatus: result with stdout/stderr
    end

    SandboxStatus->>Gateway: getSandboxGatewayStateForStatus(sandboxName)
    Gateway->>CaptureAsync: run status probe with timeout
    alt Probe times out
        Gateway-->>SandboxStatus: { state: "status_probe_timeout", ... }
    else Probe succeeds
        Gateway->>Gateway: mergeLivePolicyIntoSandboxOutput()
        Gateway-->>SandboxStatus: { state: "present"/"missing"/... }
    end

    SandboxStatus-->>Caller: aggregated status output (includes timeout notice if applicable)
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐰 A rabbit taps keys while the probes run and spin,

Timeouts keep children from tangling the bin.
SIGTERM, then SIGKILL — a tidy farewell,
Status returns swift with a timeout-tale to tell.
Hooray for probes that finish — hop, hop, and grin!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing the sandbox status command to work with direct runtime paths without hanging.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/diagnostics-status-hang

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/legacy-oclif-dispatch.ts`:
- Around line 108-110: In the "status" branch of legacy-oclif-dispatch.ts,
reject unexpected args by checking if actionArgs is non-empty and return a usage
error (instead of always routing to legacy); keep the existing help check via
hasHelpFlag(actionArgs) and, if actionArgs contains anything, return the same
kind of usage/usage-error object used elsewhere (e.g., the pattern used for
other commands) so callers get a usage error rather than silently falling
through to { kind: "legacy", target: "status" }.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66608be5-38dd-4059-ac2e-feab262a92c1

📥 Commits

Reviewing files that changed from the base of the PR and between d73371d and 84b4955.

📒 Files selected for processing (3)
  • src/lib/legacy-oclif-dispatch.test.ts
  • src/lib/legacy-oclif-dispatch.ts
  • src/nemoclaw.ts

Comment thread src/lib/legacy-oclif-dispatch.ts Outdated
@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25253547462
Branch: fix/diagnostics-status-hang
Requested jobs: diagnostics-e2e
Summary: 0 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25254068156
Branch: fix/diagnostics-status-hang
Requested jobs: diagnostics-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ✅ success
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/nemoclaw.ts (1)

1061-1118: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the status path read-only here.

Passing getSandboxGatewayStateForStatus into this helper does not make status safe yet: the missing branch still goes through reconcileMissingAgainstNamedGateway(...), and the gateway_error branch still calls recoverNamedGatewayRuntime(). Those paths use the normal sync OpenShell probes and recovery actions, so sandboxStatus() can still select/start gateways and eventually hit the registry-removal path downstream on a missing result.

Suggested direction
 async function getReconciledSandboxGatewayState(
   sandboxName: string,
-  opts: { getState?: SandboxGatewayStateLookup } = {},
+  opts: {
+    getState?: SandboxGatewayStateLookup;
+    allowRecovery?: boolean;
+  } = {},
 ) {
   const getState = opts.getState ?? getSandboxGatewayState;
+  const allowRecovery = opts.allowRecovery ?? true;
   let lookup = await getState(sandboxName);

   if (lookup.state === "present") {
     return lookup;
   }
   if (lookup.state === "missing") {
+    if (!allowRecovery) return lookup;
     return reconcileMissingAgainstNamedGateway(sandboxName, lookup);
   }

   if (lookup.state === "gateway_error") {
+    if (!allowRecovery) return lookup;
     const recovery = await recoverNamedGatewayRuntime();
     if (recovery.recovered) {
       const retried = await getState(sandboxName);

Then have sandboxStatus() call this with recovery disabled and render missing/gateway_error as informational output instead of mutating local state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1061 - 1118, getReconciledSandboxGatewayState
currently performs reconciliation and recovery (calls
reconcileMissingAgainstNamedGateway and recoverNamedGatewayRuntime) which
mutates system state even when invoked from the read-only status path; add an
explicit option like opts.readOnly (default false) or opts.allowRecovery and,
when readOnly === true, skip calling reconcileMissingAgainstNamedGateway and
recoverNamedGatewayRuntime and simply return the original lookup for "missing"
and "gateway_error" cases (or map them to informational outputs) so no
probes/restarts are triggered. Update callers (notably sandboxStatus) to call
getReconciledSandboxGatewayState(sandboxName, { getState:
getSandboxGatewayStateForStatus, readOnly: true }) and render the returned
"missing"/"gateway_error" as informational rather than mutating state. Ensure
symbols referenced: getReconciledSandboxGatewayState,
reconcileMissingAgainstNamedGateway, recoverNamedGatewayRuntime, sandboxStatus,
getSandboxGatewayStateForStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/nemoclaw.ts`:
- Around line 1061-1118: getReconciledSandboxGatewayState currently performs
reconciliation and recovery (calls reconcileMissingAgainstNamedGateway and
recoverNamedGatewayRuntime) which mutates system state even when invoked from
the read-only status path; add an explicit option like opts.readOnly (default
false) or opts.allowRecovery and, when readOnly === true, skip calling
reconcileMissingAgainstNamedGateway and recoverNamedGatewayRuntime and simply
return the original lookup for "missing" and "gateway_error" cases (or map them
to informational outputs) so no probes/restarts are triggered. Update callers
(notably sandboxStatus) to call getReconciledSandboxGatewayState(sandboxName, {
getState: getSandboxGatewayStateForStatus, readOnly: true }) and render the
returned "missing"/"gateway_error" as informational rather than mutating state.
Ensure symbols referenced: getReconciledSandboxGatewayState,
reconcileMissingAgainstNamedGateway, recoverNamedGatewayRuntime, sandboxStatus,
getSandboxGatewayStateForStatus.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a195a385-c0db-4e80-bc83-20f9ee215a8b

📥 Commits

Reviewing files that changed from the base of the PR and between 2956cd0 and 3224a5c.

📒 Files selected for processing (2)
  • src/lib/legacy-oclif-dispatch.test.ts
  • src/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/legacy-oclif-dispatch.test.ts

@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25254765479
Branch: fix/diagnostics-status-hang
Requested jobs: diagnostics-e2e
Summary: 1 passed, 0 failed, 21 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ✅ success
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@ericksoa ericksoa removed the v0.0.33 label May 2, 2026
@ericksoa ericksoa requested a review from cv May 2, 2026 15:34
@cv cv merged commit 596e18e into main May 2, 2026
18 checks passed
ericksoa added a commit that referenced this pull request May 3, 2026
## Summary

Ports the useful behavior from #2626 onto current main without depending
on #2881 or #2822.

- reconcile sandbox/gateway state before probing or printing inference
provider health
- suppress `Inference: healthy` unless the selected sandbox/gateway is
verified present
- exit non-zero for degraded or unverified sandbox/gateway status states
- remove only the selected stale registry entry when status verifies
that named sandbox is absent from a healthy NemoClaw gateway; status
does not sweep unrelated registry-only entries
- keep `nemoclaw list` rendering registered sandboxes when runtime
inference probing is degraded

Fixes #2595
Refs #2666
Refs #2604

## Validation

- `npm run build:cli`
- `npx vitest run test/cli.test.ts -t "status|runtime inference
probing"`
- `npx vitest run test/cli.test.ts -t
"gateway|inference|orphan|verified"`
- `npx vitest run test/cli.test.ts`
- `npm run typecheck:cli`
- `npm run typecheck`
- `npx biome lint src/nemoclaw.ts test/cli.test.ts`
- `git diff --check`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Automatic generation and application of an initial sandbox create
policy based on enabled messaging channels; merged policy presets are
applied at sandbox boot.

* **Bug Fixes**
* Status now only reports inference when gateway/sandbox state is
verified; otherwise shows "Inference: not verified..." and exits
non-zero for unverified states.
* Status removes the selected stale/orphaned registry entry only when
that named sandbox is verified absent from a healthy NemoClaw gateway;
it does not sweep unrelated registry-only entries. List/status
gracefully fall back when inference probing is degraded.

* **Tests**
* CLI, e2e, and unit tests updated to expect non-zero exits and
verification-gated inference reporting; policy merge behavior covered.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@wscurran wscurran added area: ci CI workflows, checks, release automation, or GitHub Actions bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: ci CI workflows, checks, release automation, or GitHub Actions bug-fix PR fixes a bug or regression ci-failure Auto-created by nemoclaw-diagnosis skill

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants