fix(cli): keep sandbox status on direct runtime path#2881
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds async, timeout-aware OpenShell capture and probes; refactors sandbox status to use these status-probe helpers and configurable timeouts; introduces a skip-probe option for agent-version checks; and adds tests validating timeout behavior and related CLI flows. ChangesSandbox status probing & async OpenShell capture
Sequence DiagramsequenceDiagram
participant Caller
participant SandboxStatus
participant CaptureAsync as captureOpenshellCommandAsync
participant ProcessTree as signalProcessTree
participant Gateway as getSandboxGatewayStateForStatus
Caller->>SandboxStatus: sandboxStatus(sandboxName)
SandboxStatus->>CaptureAsync: captureOpenshellForStatus(["inference","get"], opts)
CaptureAsync->>CaptureAsync: spawn(binary, args) and buffer stdout/stderr
alt Timeout triggered
CaptureAsync->>ProcessTree: signalProcessTree(SIGTERM)
Note over ProcessTree: Terminate process group (non-Windows) or child (Windows)
CaptureAsync->>ProcessTree: signalProcessTree(SIGKILL) (after killGraceMs)
CaptureAsync-->>SandboxStatus: result with ETIMEDOUT error and signal metadata
else Completes
CaptureAsync-->>SandboxStatus: result with stdout/stderr
end
SandboxStatus->>Gateway: getSandboxGatewayStateForStatus(sandboxName)
Gateway->>CaptureAsync: run status probe with timeout
alt Probe times out
Gateway-->>SandboxStatus: { state: "status_probe_timeout", ... }
else Probe succeeds
Gateway->>Gateway: mergeLivePolicyIntoSandboxOutput()
Gateway-->>SandboxStatus: { state: "present"/"missing"/... }
end
SandboxStatus-->>Caller: aggregated status output (includes timeout notice if applicable)
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/legacy-oclif-dispatch.ts`:
- Around line 108-110: In the "status" branch of legacy-oclif-dispatch.ts,
reject unexpected args by checking if actionArgs is non-empty and return a usage
error (instead of always routing to legacy); keep the existing help check via
hasHelpFlag(actionArgs) and, if actionArgs contains anything, return the same
kind of usage/usage-error object used elsewhere (e.g., the pattern used for
other commands) so callers get a usage error rather than silently falling
through to { kind: "legacy", target: "status" }.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 66608be5-38dd-4059-ac2e-feab262a92c1
📒 Files selected for processing (3)
src/lib/legacy-oclif-dispatch.test.tssrc/lib/legacy-oclif-dispatch.tssrc/nemoclaw.ts
Selective E2E Results — ✅ All requested jobs passedRun: 25253547462
|
Selective E2E Results — ✅ All requested jobs passedRun: 25254068156
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/nemoclaw.ts (1)
1061-1118:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep the status path read-only here.
Passing
getSandboxGatewayStateForStatusinto this helper does not makestatussafe yet: themissingbranch still goes throughreconcileMissingAgainstNamedGateway(...), and thegateway_errorbranch still callsrecoverNamedGatewayRuntime(). Those paths use the normal sync OpenShell probes and recovery actions, sosandboxStatus()can still select/start gateways and eventually hit the registry-removal path downstream on amissingresult.Suggested direction
async function getReconciledSandboxGatewayState( sandboxName: string, - opts: { getState?: SandboxGatewayStateLookup } = {}, + opts: { + getState?: SandboxGatewayStateLookup; + allowRecovery?: boolean; + } = {}, ) { const getState = opts.getState ?? getSandboxGatewayState; + const allowRecovery = opts.allowRecovery ?? true; let lookup = await getState(sandboxName); if (lookup.state === "present") { return lookup; } if (lookup.state === "missing") { + if (!allowRecovery) return lookup; return reconcileMissingAgainstNamedGateway(sandboxName, lookup); } if (lookup.state === "gateway_error") { + if (!allowRecovery) return lookup; const recovery = await recoverNamedGatewayRuntime(); if (recovery.recovered) { const retried = await getState(sandboxName);Then have
sandboxStatus()call this with recovery disabled and rendermissing/gateway_erroras informational output instead of mutating local state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 1061 - 1118, getReconciledSandboxGatewayState currently performs reconciliation and recovery (calls reconcileMissingAgainstNamedGateway and recoverNamedGatewayRuntime) which mutates system state even when invoked from the read-only status path; add an explicit option like opts.readOnly (default false) or opts.allowRecovery and, when readOnly === true, skip calling reconcileMissingAgainstNamedGateway and recoverNamedGatewayRuntime and simply return the original lookup for "missing" and "gateway_error" cases (or map them to informational outputs) so no probes/restarts are triggered. Update callers (notably sandboxStatus) to call getReconciledSandboxGatewayState(sandboxName, { getState: getSandboxGatewayStateForStatus, readOnly: true }) and render the returned "missing"/"gateway_error" as informational rather than mutating state. Ensure symbols referenced: getReconciledSandboxGatewayState, reconcileMissingAgainstNamedGateway, recoverNamedGatewayRuntime, sandboxStatus, getSandboxGatewayStateForStatus.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/nemoclaw.ts`:
- Around line 1061-1118: getReconciledSandboxGatewayState currently performs
reconciliation and recovery (calls reconcileMissingAgainstNamedGateway and
recoverNamedGatewayRuntime) which mutates system state even when invoked from
the read-only status path; add an explicit option like opts.readOnly (default
false) or opts.allowRecovery and, when readOnly === true, skip calling
reconcileMissingAgainstNamedGateway and recoverNamedGatewayRuntime and simply
return the original lookup for "missing" and "gateway_error" cases (or map them
to informational outputs) so no probes/restarts are triggered. Update callers
(notably sandboxStatus) to call getReconciledSandboxGatewayState(sandboxName, {
getState: getSandboxGatewayStateForStatus, readOnly: true }) and render the
returned "missing"/"gateway_error" as informational rather than mutating state.
Ensure symbols referenced: getReconciledSandboxGatewayState,
reconcileMissingAgainstNamedGateway, recoverNamedGatewayRuntime, sandboxStatus,
getSandboxGatewayStateForStatus.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a195a385-c0db-4e80-bc83-20f9ee215a8b
📒 Files selected for processing (2)
src/lib/legacy-oclif-dispatch.test.tssrc/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
- src/lib/legacy-oclif-dispatch.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 25254765479
|
## Summary Ports the useful behavior from #2626 onto current main without depending on #2881 or #2822. - reconcile sandbox/gateway state before probing or printing inference provider health - suppress `Inference: healthy` unless the selected sandbox/gateway is verified present - exit non-zero for degraded or unverified sandbox/gateway status states - remove only the selected stale registry entry when status verifies that named sandbox is absent from a healthy NemoClaw gateway; status does not sweep unrelated registry-only entries - keep `nemoclaw list` rendering registered sandboxes when runtime inference probing is degraded Fixes #2595 Refs #2666 Refs #2604 ## Validation - `npm run build:cli` - `npx vitest run test/cli.test.ts -t "status|runtime inference probing"` - `npx vitest run test/cli.test.ts -t "gateway|inference|orphan|verified"` - `npx vitest run test/cli.test.ts` - `npm run typecheck:cli` - `npm run typecheck` - `npx biome lint src/nemoclaw.ts test/cli.test.ts` - `git diff --check` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Automatic generation and application of an initial sandbox create policy based on enabled messaging channels; merged policy presets are applied at sandbox boot. * **Bug Fixes** * Status now only reports inference when gateway/sandbox state is verified; otherwise shows "Inference: not verified..." and exits non-zero for unverified states. * Status removes the selected stale/orphaned registry entry only when that named sandbox is verified absent from a healthy NemoClaw gateway; it does not sweep unrelated registry-only entries. List/status gracefully fall back when inference probing is degraded. * **Tests** * CLI, e2e, and unit tests updated to expect non-zero exits and verification-gated inference reporting; policy merge behavior covered. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
nemoclaw <sandbox> statusthrough the Oclifsandbox:statuscommand pathnemoclaw <sandbox> statusso live OpenShell probes cannot hang the command when descendantopenshell/sshprocesses keep captured pipes openstatus --helpon the public sandbox-scoped usage pathdiagnostics-e2eEvidence
diagnostics-e2etimed out on main run 25244372371 ata8dfa272afterTC-DIAG-05: openclaw.json readable inside sandbox, hanging atChecking nemoclaw status from host...until the 45-minute job timeout32d3ab3e, again hanging atChecking nemoclaw status from host...with orphanopenshell/sshchildren killed by job cleanupd73371dareproduced the same failure:TC-DIAG-05passed the in-sandbox config read at 13:27:44, then hung atChecking nemoclaw status from host...until GitHub cancelled at 13:56:55; cleanup killed the outertimeout/bashplus twoopenshell/sshpairs84b4955cconfirmed that routing alone was insufficient: it hit the sameChecking nemoclaw status from host...hang and the same orphanopenshell/sshcleanup2956cd0epasseddiagnostics-e2e; TC-DIAG-05 moved fromChecking nemoclaw status from host...at 14:41:31Z toPASS TC-DIAG-05: nemoclaw status shows Model fieldat 14:41:32Z, and the job endedPASS: 8/FAIL: 03224a5c5passeddiagnostics-e2e; TC-DIAG-05 moved fromChecking nemoclaw status from host...at 15:16:11Z toPASS TC-DIAG-05: nemoclaw status shows Model fieldat 15:16:11Z, and the job endedPASS: 8/FAIL: 0b83ffe20, where the same status check completed in about 6 secondsOclif conformance
resolveSandboxOclifDispatch("<name>", "status", [])now returns{ kind: "oclif", commandId: "sandbox:status", args: ["<name>"] }LegacyDispatchno longer includesstatusrunDispatchResult()no longer has a legacystatusbranchSandboxStatusCommandremains the Oclif command surface and invokes the current status runtime implementation through the existing runtime bridge pattern used by migrated commands in this reponemoclaw <name> status --helpstill renders<name> statusrather than exposing the internalsandbox:statusidValidation
npm test -- src/lib/openshell.test.ts src/lib/sandbox-version.test.ts src/lib/legacy-oclif-dispatch.test.tsnpm run build:clinpm test -- test/cli.test.ts -t "keeps status bounded|keeps registry entries when status hits|recovers status after gateway runtime|status --help|status shows|sandbox inspection help"npm run typecheck:clinpm run lint -- src/lib/legacy-oclif-dispatch.ts src/lib/legacy-oclif-dispatch.test.ts src/nemoclaw.tsnpm run format:check -- src/lib/legacy-oclif-dispatch.ts src/lib/legacy-oclif-dispatch.test.ts src/nemoclaw.tsgit diff --checkdiagnostics-e2erun 25254068156 on2956cd0e: passdiagnostics-e2erun 25254765479 on Oclif-routing SHA3224a5c5: passSummary by CodeRabbit
Release Notes
New Features
Improvements
Tests