fix(cli): preserve sandbox registry when active gateway drifts (#2276)#2330
Conversation
…A#2276) When a user runs `openshell gateway stop && openshell gateway start` without `--name nemoclaw`, the default OpenShell gateway becomes active and `nemoclaw <name> connect` silently wiped the local registry entry because `openshell sandbox get <name>` returned NotFound against the wrong active gateway. Route `"missing"` through `reconcileMissingAgainstNamedGateway` before touching registry state: - If the nemoclaw gateway is healthy but not active (`connected_other`), try a silent `openshell gateway select nemoclaw` and re-query. On success, proceed normally. On failure, surface `wrong_gateway_active` with actionable guidance; do not remove the registry entry. - If the nemoclaw gateway is missing/unreachable, route to the existing `gateway_missing_after_restart` / `gateway_unreachable_after_restart` branches (which already print tailored guidance and skip removal). - Only when the nemoclaw gateway is demonstrably healthy-active and the sandbox is still NotFound do we proceed to the existing destructive path. Belt-and-suspenders: both `ensureLiveSandboxOrExit` and `sandboxStatus` re-check `getNamedGatewayLifecycleState()` immediately before calling `registry.removeSandbox`, delegating to the proper per-state message helper (`printWrongGatewayActiveGuidance` for drift, `printGatewayLifecycleHint` for unreachable/missing gateway) so users always see the correct recovery instructions. No public API surface changes. `getNamedGatewayLifecycleState` now returns an additional `activeGateway` field (additive). Tests in test/gateway-state-reconcile-2276.test.ts cover all 12 scenarios from the architect's design: regression guards for truly missing sandboxes, the NVIDIA#2276 drift repro, cross-command parity (connect/status/skill install), gateway-unhealthy fallbacks, and non-interactive mode. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughDetects when OpenShell’s active gateway differs from the configured Changes
Sequence DiagramsequenceDiagram
participant User
participant Nemoclaw
participant Registry as Local Registry
participant OpenShell as OpenShell Gateway
User->>Nemoclaw: connect / status / install
Nemoclaw->>Registry: read sandbox entry
Nemoclaw->>OpenShell: sandbox get <name>
alt Found
OpenShell-->>Nemoclaw: sandbox data
Nemoclaw->>User: proceed
else NotFound
OpenShell-->>Nemoclaw: NotFound
Nemoclaw->>OpenShell: gateway select nemoclaw
OpenShell-->>Nemoclaw: select result
Nemoclaw->>OpenShell: sandbox get <name> (retry)
alt Still NotFound
OpenShell-->>Nemoclaw: NotFound
Nemoclaw->>OpenShell: gateway info
OpenShell-->>Nemoclaw: activeGateway info
alt activeGateway == nemoclaw
Nemoclaw->>Registry: remove stale entry
Nemoclaw->>User: "Removed stale local registry entry"
else activeGateway != nemoclaw
Nemoclaw->>User: "wrong_gateway_active" + guidance (preserve registry)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/gateway-state-reconcile-2276.test.ts (2)
357-395: Test description mentions "connect" but runs "status" command.Scenario 3's
describeblock says "connect — select succeeds..." but the test callsrunCli("status")on line 379. This appears intentional (testing that the reconciliation works for both commands), but the description could be clarified to avoid confusion.📝 Suggested description clarification
-describe("Scenario 3: connect — select succeeds, sandbox reappears, registry intact", () => { +describe("Scenario 3: status — select succeeds, sandbox reappears, registry intact", () => {Or if testing both commands is intended, consider splitting into two tests or updating the description to reflect that status is being tested here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gateway-state-reconcile-2276.test.ts` around lines 357 - 395, Update the test description to accurately reflect the command under test: change the describe title "Scenario 3: connect — select succeeds..." (and/or the it title "attempts `gateway select nemoclaw`, re-queries, proceeds; registry preserved") to mention "status" (or split into two tests) so it matches the actual invocation runCli("status"); ensure references to gateway select behavior remain if the test is intended to validate reconciliation for the status command (or create a separate test that calls runCli("connect") with the same stub setup).
260-261: Unused fieldstatusCallsAfterSelectis always zero.The
statusCallsAfterSelectfield inHarnessResultis always set to0and never computed. If this metric isn't needed for assertions, consider removing it from the interface and return object to avoid confusion.🔧 Proposed cleanup
interface HarnessResult { status: number | null; stdout: string; stderr: string; registryExists: boolean; registry: any; sessionSandboxName: string | null | undefined; callLog: Array<string[]>; - statusCallsAfterSelect: number; selectCalls: number; }And in
runCli:return { status: result.status, stdout: result.stdout || "", stderr: result.stderr || "", registryExists, registry, sessionSandboxName, callLog, - statusCallsAfterSelect: 0, selectCalls, };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gateway-state-reconcile-2276.test.ts` around lines 260 - 261, The HarnessResult field statusCallsAfterSelect is unused and always zero; remove the statusCallsAfterSelect property from the HarnessResult type and from the object returned by runCli (and any callers/tests that expect it), and update any code or assertions referencing statusCallsAfterSelect to use the existing selectCalls or another real metric instead so no dead/always-zero field remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/gateway-state-reconcile-2276.test.ts`:
- Around line 357-395: Update the test description to accurately reflect the
command under test: change the describe title "Scenario 3: connect — select
succeeds..." (and/or the it title "attempts `gateway select nemoclaw`,
re-queries, proceeds; registry preserved") to mention "status" (or split into
two tests) so it matches the actual invocation runCli("status"); ensure
references to gateway select behavior remain if the test is intended to validate
reconciliation for the status command (or create a separate test that calls
runCli("connect") with the same stub setup).
- Around line 260-261: The HarnessResult field statusCallsAfterSelect is unused
and always zero; remove the statusCallsAfterSelect property from the
HarnessResult type and from the object returned by runCli (and any callers/tests
that expect it), and update any code or assertions referencing
statusCallsAfterSelect to use the existing selectCalls or another real metric
instead so no dead/always-zero field remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 4745dc05-fca9-41bb-902e-4fc4b71bc239
📒 Files selected for processing (2)
src/nemoclaw.tstest/gateway-state-reconcile-2276.test.ts
…VIDIA#2276) The belt-and-suspenders guard introduced by NVIDIA#2276 preserves the local registry entry whenever `getNamedGatewayLifecycleState()` does not report `healthy_named`. The existing "removes stale registry entries when connect targets a missing live sandbox" fixture only stubbed `openshell sandbox get` and left every other openshell command returning exit 0 with empty output, which the new guard classifies as `missing_named` — so the registry stayed intact and the test regressed. Extend the fake `openshell` to simulate a healthy, active `nemoclaw` named gateway (`status` reports `Gateway: nemoclaw` + `Status: Connected`, and `gateway info` reports `Gateway: nemoclaw`). That is the scenario the test is meant to exercise: gateway is fine but the sandbox is truly gone, so removal is expected and the "Removed stale local registry entry." message should appear. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Scenario 3 describe said "connect" but the test invokes `runCli("status")`.
Rename to "status" to match.
- Remove the unused `statusCallsAfterSelect` field from HarnessResult and
its always-zero initialization in runCli. Nothing reads it.
Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
✨ Thanks for submitting this issue that identifies a bug with the sandbox registry and proposes a fix. Related open issues: |
ericksoa
left a comment
There was a problem hiding this comment.
Well-designed fix. The core insight — NotFound is ambiguous when the active gateway isn't nemoclaw — is handled correctly at every layer.
Looks good
reconcileMissingAgainstNamedGateway— clean state machine: tries self-heal viagateway select, falls through to actionable guidance on failure, only allows the destructive path whenhealthy_namedis confirmed. The post-select re-check of lifecycle state (not just sandbox presence) catches the edge case where select appears to succeed but the gateway isn't actually healthy.- Belt-and-suspenders guards in both
ensureLiveSandboxOrExitandsandboxStatus— re-checkinggetNamedGatewayLifecycleStateimmediately beforeregistry.removeSandboxis the right call given the TOCTOU window between the reconciler and the actual removal. activeGatewayaddition togetNamedGatewayLifecycleState— additive, backward-compatible, and needed for the guidance message.- Test coverage is excellent — 12 scenarios covering the exact repro, self-heal success/failure, cross-command parity (connect/status/skill install), gateway-missing/unreachable fallbacks, empty/malformed output, and non-interactive mode. The stub openshell with cycling arrays is a solid test harness.
- Existing
cli.test.tsupdate — correctly adds gateway simulation so the lifecycle guard seeshealthy_namedand the pre-existing "removes stale registry" test doesn't regress.
Nit (non-blocking)
The selectFlipsActive field in the test stub's ScenarioScript interface is written to state.postSelect but never read back — the test correctness depends on the array ordering in sandboxGet/status, not on selectFlipsActive. Dead code in the stub, but no functional impact.
LGTM.
…A#2276) (NVIDIA#2330) ## Summary Fixes NVIDIA#2276 — on Ubuntu (and any platform), running `openshell gateway stop && openshell gateway start` (without `--name nemoclaw`) silently switches the active OpenShell gateway away from `nemoclaw`. The subsequent `nemoclaw <name> connect` was destroying the local registry entry because \`openshell sandbox get <name>\` returned NotFound against the wrong gateway. Users had to recreate the sandbox from scratch, losing telegram user IDs and other config. ## Root cause `getSandboxGatewayState` queries the currently active gateway. Its \`NotFound\` is ambiguous — the sandbox may be perfectly fine on the `nemoclaw` gateway but invisible because a different gateway is active. The previous code short-circuited on \`"missing"\` and immediately called \`registry.removeSandbox(sandboxName)\`. ## Fix Route \`"missing"\` through a new \`reconcileMissingAgainstNamedGateway\` helper that checks \`getNamedGatewayLifecycleState()\` before touching the registry: 1. **`connected_other`** (nemoclaw gateway healthy, different gateway active) — silently attempt `openshell gateway select nemoclaw` and re-query. On success, proceed as if the sandbox was present. On persistent NotFound, surface \`wrong_gateway_active\` with actionable guidance; preserve the registry entry. 2. **`missing_named` / `named_unreachable` / `named_unhealthy`** — route to the existing \`gateway_missing_after_restart\` / \`gateway_unreachable_after_restart\` branches which already skip removal and print tailored hints. 3. **`healthy_named` + still NotFound** — truly gone; proceed with the existing destructive path. Belt-and-suspenders guards in both \`ensureLiveSandboxOrExit\` and \`sandboxStatus\` re-check lifecycle state immediately before \`registry.removeSandbox\`, delegating to the correct per-state message helper (`printWrongGatewayActiveGuidance` for drift, `printGatewayLifecycleHint` for unreachable/missing gateway) so users always see accurate recovery instructions even under TOCTOU races. ## Non-goals - OpenShell's own behavior (plain `openshell gateway start` changing the active pointer) is unchanged — that is an upstream concern. This PR makes NemoClaw resilient to it. ## API impact - `getNamedGatewayLifecycleState` additively returns an \`activeGateway\` field on every branch. Existing consumers destructure only \`state\` / \`status\` / \`gatewayInfo\`, so this is backward-compatible. - No env vars, no config surface, no new CLI flags. ## Test plan - [x] `npx vitest run test/gateway-state-reconcile-2276.test.ts` — 13/13 passing. Covers the 12 scenarios from the architect's design: regression guards, the NVIDIA#2276 exact repro, \`wrong_gateway_active\` self-heal success/failure paths, gateway-unhealthy fallbacks, cross-command parity (connect / status / skill install), and non-interactive mode. - [x] `npm run typecheck:cli` passes. - [x] Full 3-file suite (gateway-state + nemoclaw-cli-recovery + new): 51/51 passing. - [ ] CI green. - [ ] Manual verification by the original reporter on Ubuntu 24.04. Signed-off-by: Tony Luo <xialuo@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Detect when the active gateway has drifted from NemoClaw and report a clear "wrong gateway active" status. - Show explicit, targeted guidance for switching gateways when mismatches occur. - Prevent accidental removal of sandbox registry entries unless the named NemoClaw gateway is confirmed active. - Ensure non-interactive runs exit deterministically without prompt-like output. - **Tests** - Added comprehensive regression tests covering gateway drift, status parity, metadata edge cases, and registry behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #2276 — on Ubuntu (and any platform), running
openshell gateway stop && openshell gateway start(without--name nemoclaw) silently switches the active OpenShell gateway away fromnemoclaw. The subsequentnemoclaw <name> connectwas destroying the local registry entry because `openshell sandbox get ` returned NotFound against the wrong gateway. Users had to recreate the sandbox from scratch, losing telegram user IDs and other config.Root cause
getSandboxGatewayStatequeries the currently active gateway. Its `NotFound` is ambiguous — the sandbox may be perfectly fine on thenemoclawgateway but invisible because a different gateway is active. The previous code short-circuited on `"missing"` and immediately called `registry.removeSandbox(sandboxName)`.Fix
Route `"missing"` through a new `reconcileMissingAgainstNamedGateway` helper that checks `getNamedGatewayLifecycleState()` before touching the registry:
connected_other(nemoclaw gateway healthy, different gateway active) — silently attemptopenshell gateway select nemoclawand re-query. On success, proceed as if the sandbox was present. On persistent NotFound, surface `wrong_gateway_active` with actionable guidance; preserve the registry entry.missing_named/named_unreachable/named_unhealthy— route to the existing `gateway_missing_after_restart` / `gateway_unreachable_after_restart` branches which already skip removal and print tailored hints.healthy_named+ still NotFound — truly gone; proceed with the existing destructive path.Belt-and-suspenders guards in both `ensureLiveSandboxOrExit` and `sandboxStatus` re-check lifecycle state immediately before `registry.removeSandbox`, delegating to the correct per-state message helper (
printWrongGatewayActiveGuidancefor drift,printGatewayLifecycleHintfor unreachable/missing gateway) so users always see accurate recovery instructions even under TOCTOU races.Non-goals
openshell gateway startchanging the active pointer) is unchanged — that is an upstream concern. This PR makes NemoClaw resilient to it.API impact
getNamedGatewayLifecycleStateadditively returns an `activeGateway` field on every branch. Existing consumers destructure only `state` / `status` / `gatewayInfo`, so this is backward-compatible.Test plan
npx vitest run test/gateway-state-reconcile-2276.test.ts— 13/13 passing. Covers the 12 scenarios from the architect's design: regression guards, the openshell stop destroys sandbox #2276 exact repro, `wrong_gateway_active` self-heal success/failure paths, gateway-unhealthy fallbacks, cross-command parity (connect / status / skill install), and non-interactive mode.npm run typecheck:clipasses.Signed-off-by: Tony Luo xialuo@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests