Skip to content

fix(cli): preserve sandbox registry when active gateway drifts (#2276)#2330

Merged
ericksoa merged 3 commits into
NVIDIA:mainfrom
TonyLuo-NV:fix/2276-openshell-stop-destroys-sandbox
Apr 23, 2026
Merged

fix(cli): preserve sandbox registry when active gateway drifts (#2276)#2330
ericksoa merged 3 commits into
NVIDIA:mainfrom
TonyLuo-NV:fix/2276-openshell-stop-destroys-sandbox

Conversation

@TonyLuo-NV

@TonyLuo-NV TonyLuo-NV commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2276 — on Ubuntu (and any platform), running openshell gateway stop && openshell gateway start (without --name nemoclaw) silently switches the active OpenShell gateway away from nemoclaw. The subsequent nemoclaw <name> connect was destroying the local registry entry because `openshell sandbox get ` returned NotFound against the wrong gateway. Users had to recreate the sandbox from scratch, losing telegram user IDs and other config.

Root cause

getSandboxGatewayState queries the currently active gateway. Its `NotFound` is ambiguous — the sandbox may be perfectly fine on the nemoclaw gateway but invisible because a different gateway is active. The previous code short-circuited on `"missing"` and immediately called `registry.removeSandbox(sandboxName)`.

Fix

Route `"missing"` through a new `reconcileMissingAgainstNamedGateway` helper that checks `getNamedGatewayLifecycleState()` before touching the registry:

  1. connected_other (nemoclaw gateway healthy, different gateway active) — silently attempt openshell gateway select nemoclaw and re-query. On success, proceed as if the sandbox was present. On persistent NotFound, surface `wrong_gateway_active` with actionable guidance; preserve the registry entry.
  2. missing_named / named_unreachable / named_unhealthy — route to the existing `gateway_missing_after_restart` / `gateway_unreachable_after_restart` branches which already skip removal and print tailored hints.
  3. healthy_named + still NotFound — truly gone; proceed with the existing destructive path.

Belt-and-suspenders guards in both `ensureLiveSandboxOrExit` and `sandboxStatus` re-check lifecycle state immediately before `registry.removeSandbox`, delegating to the correct per-state message helper (printWrongGatewayActiveGuidance for drift, printGatewayLifecycleHint for unreachable/missing gateway) so users always see accurate recovery instructions even under TOCTOU races.

Non-goals

  • OpenShell's own behavior (plain openshell gateway start changing the active pointer) is unchanged — that is an upstream concern. This PR makes NemoClaw resilient to it.

API impact

  • getNamedGatewayLifecycleState additively returns an `activeGateway` field on every branch. Existing consumers destructure only `state` / `status` / `gatewayInfo`, so this is backward-compatible.
  • No env vars, no config surface, no new CLI flags.

Test plan

  • npx vitest run test/gateway-state-reconcile-2276.test.ts — 13/13 passing. Covers the 12 scenarios from the architect's design: regression guards, the openshell stop destroys sandbox #2276 exact repro, `wrong_gateway_active` self-heal success/failure paths, gateway-unhealthy fallbacks, cross-command parity (connect / status / skill install), and non-interactive mode.
  • npm run typecheck:cli passes.
  • Full 3-file suite (gateway-state + nemoclaw-cli-recovery + new): 51/51 passing.
  • CI green.
  • Manual verification by the original reporter on Ubuntu 24.04.

Signed-off-by: Tony Luo xialuo@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Detect when the active gateway has drifted from NemoClaw and report a clear "wrong gateway active" status.
    • Show explicit, targeted guidance for switching gateways when mismatches occur.
    • Prevent accidental removal of sandbox registry entries unless the named NemoClaw gateway is confirmed active.
    • Ensure non-interactive runs exit deterministically without prompt-like output.
  • Tests

    • Added comprehensive regression tests covering gateway drift, status parity, metadata edge cases, and registry behavior.

…A#2276)

When a user runs `openshell gateway stop && openshell gateway start`
without `--name nemoclaw`, the default OpenShell gateway becomes active
and `nemoclaw <name> connect` silently wiped the local registry entry
because `openshell sandbox get <name>` returned NotFound against the
wrong active gateway.

Route `"missing"` through `reconcileMissingAgainstNamedGateway` before
touching registry state:
- If the nemoclaw gateway is healthy but not active (`connected_other`),
  try a silent `openshell gateway select nemoclaw` and re-query. On
  success, proceed normally. On failure, surface `wrong_gateway_active`
  with actionable guidance; do not remove the registry entry.
- If the nemoclaw gateway is missing/unreachable, route to the existing
  `gateway_missing_after_restart` / `gateway_unreachable_after_restart`
  branches (which already print tailored guidance and skip removal).
- Only when the nemoclaw gateway is demonstrably healthy-active and the
  sandbox is still NotFound do we proceed to the existing destructive
  path.

Belt-and-suspenders: both `ensureLiveSandboxOrExit` and `sandboxStatus`
re-check `getNamedGatewayLifecycleState()` immediately before calling
`registry.removeSandbox`, delegating to the proper per-state message
helper (`printWrongGatewayActiveGuidance` for drift,
`printGatewayLifecycleHint` for unreachable/missing gateway) so users
always see the correct recovery instructions.

No public API surface changes. `getNamedGatewayLifecycleState` now
returns an additional `activeGateway` field (additive).

Tests in test/gateway-state-reconcile-2276.test.ts cover all 12
scenarios from the architect's design: regression guards for truly
missing sandboxes, the NVIDIA#2276 drift repro, cross-command parity
(connect/status/skill install), gateway-unhealthy fallbacks, and
non-interactive mode.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1d37345c-9b9f-4d37-ba35-e211aee26592

📥 Commits

Reviewing files that changed from the base of the PR and between f7a8616 and cae53d8.

📒 Files selected for processing (1)
  • test/gateway-state-reconcile-2276.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/gateway-state-reconcile-2276.test.ts

📝 Walkthrough

Walkthrough

Detects when OpenShell’s active gateway differs from the configured nemoclaw gateway during sandbox reconciliation. On sandbox get NotFound Nemoclaw attempts openshell gateway select nemoclaw, retries the lookup, reports wrong_gateway_active when gateways differ, and avoids removing local registry entries, emitting guidance instead.

Changes

Cohort / File(s) Summary
Gateway Reconciliation Logic
src/nemoclaw.ts
Propagates activeGateway in lifecycle reporting. On sandbox get NotFound, runs openshell gateway select nemoclaw, retries sandbox get, queries gateway info, and returns wrong_gateway_active if the active gateway differs. Guards removal of registry entries and emits targeted gateway-switch guidance.
Regression & CLI Tests
test/gateway-state-reconcile-2276.test.ts, test/cli.test.ts
Adds a comprehensive regression test suite using an openshell stub to simulate sandbox get, status, gateway info, and gateway select behaviors across many scenarios (gateway drift, missing/malformed metadata, non-interactive). Updates cli.test.ts mocks to exercise the removal path when nemoclaw is active. Asserts registry and onboard-session state and stdout/stderr messages.

Sequence Diagram

sequenceDiagram
    participant User
    participant Nemoclaw
    participant Registry as Local Registry
    participant OpenShell as OpenShell Gateway

    User->>Nemoclaw: connect / status / install
    Nemoclaw->>Registry: read sandbox entry
    Nemoclaw->>OpenShell: sandbox get <name>

    alt Found
        OpenShell-->>Nemoclaw: sandbox data
        Nemoclaw->>User: proceed
    else NotFound
        OpenShell-->>Nemoclaw: NotFound
        Nemoclaw->>OpenShell: gateway select nemoclaw
        OpenShell-->>Nemoclaw: select result
        Nemoclaw->>OpenShell: sandbox get <name> (retry)

        alt Still NotFound
            OpenShell-->>Nemoclaw: NotFound
            Nemoclaw->>OpenShell: gateway info
            OpenShell-->>Nemoclaw: activeGateway info

            alt activeGateway == nemoclaw
                Nemoclaw->>Registry: remove stale entry
                Nemoclaw->>User: "Removed stale local registry entry"
            else activeGateway != nemoclaw
                Nemoclaw->>User: "wrong_gateway_active" + guidance (preserve registry)
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I sniffed the gateways, twitching my nose,
Found one astray where the wild wind blows,
I nudged a select, then checked once more,
Kept the registry safe, left no sore,
Hopping off — "Guide, don't remove!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: preventing sandbox registry removal when the active OpenShell gateway drifts from nemoclaw, addressing the core issue #2276.
Linked Issues check ✅ Passed The pull request directly addresses issue #2276 by detecting gateway drift, providing recovery guidance, and preventing destructive registry removal when the active gateway differs from nemoclaw.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing issue #2276: the reconciliation logic in nemoclaw.ts, comprehensive regression tests in gateway-state-reconcile-2276.test.ts, and a supporting fix in cli.test.ts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/gateway-state-reconcile-2276.test.ts (2)

357-395: Test description mentions "connect" but runs "status" command.

Scenario 3's describe block says "connect — select succeeds..." but the test calls runCli("status") on line 379. This appears intentional (testing that the reconciliation works for both commands), but the description could be clarified to avoid confusion.

📝 Suggested description clarification
-describe("Scenario 3: connect — select succeeds, sandbox reappears, registry intact", () => {
+describe("Scenario 3: status — select succeeds, sandbox reappears, registry intact", () => {

Or if testing both commands is intended, consider splitting into two tests or updating the description to reflect that status is being tested here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-state-reconcile-2276.test.ts` around lines 357 - 395, Update the
test description to accurately reflect the command under test: change the
describe title "Scenario 3: connect — select succeeds..." (and/or the it title
"attempts `gateway select nemoclaw`, re-queries, proceeds; registry preserved")
to mention "status" (or split into two tests) so it matches the actual
invocation runCli("status"); ensure references to gateway select behavior remain
if the test is intended to validate reconciliation for the status command (or
create a separate test that calls runCli("connect") with the same stub setup).

260-261: Unused field statusCallsAfterSelect is always zero.

The statusCallsAfterSelect field in HarnessResult is always set to 0 and never computed. If this metric isn't needed for assertions, consider removing it from the interface and return object to avoid confusion.

🔧 Proposed cleanup
 interface HarnessResult {
   status: number | null;
   stdout: string;
   stderr: string;
   registryExists: boolean;
   registry: any;
   sessionSandboxName: string | null | undefined;
   callLog: Array<string[]>;
-  statusCallsAfterSelect: number;
   selectCalls: number;
 }

And in runCli:

   return {
     status: result.status,
     stdout: result.stdout || "",
     stderr: result.stderr || "",
     registryExists,
     registry,
     sessionSandboxName,
     callLog,
-    statusCallsAfterSelect: 0,
     selectCalls,
   };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-state-reconcile-2276.test.ts` around lines 260 - 261, The
HarnessResult field statusCallsAfterSelect is unused and always zero; remove the
statusCallsAfterSelect property from the HarnessResult type and from the object
returned by runCli (and any callers/tests that expect it), and update any code
or assertions referencing statusCallsAfterSelect to use the existing selectCalls
or another real metric instead so no dead/always-zero field remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/gateway-state-reconcile-2276.test.ts`:
- Around line 357-395: Update the test description to accurately reflect the
command under test: change the describe title "Scenario 3: connect — select
succeeds..." (and/or the it title "attempts `gateway select nemoclaw`,
re-queries, proceeds; registry preserved") to mention "status" (or split into
two tests) so it matches the actual invocation runCli("status"); ensure
references to gateway select behavior remain if the test is intended to validate
reconciliation for the status command (or create a separate test that calls
runCli("connect") with the same stub setup).
- Around line 260-261: The HarnessResult field statusCallsAfterSelect is unused
and always zero; remove the statusCallsAfterSelect property from the
HarnessResult type and from the object returned by runCli (and any callers/tests
that expect it), and update any code or assertions referencing
statusCallsAfterSelect to use the existing selectCalls or another real metric
instead so no dead/always-zero field remains.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4745dc05-fca9-41bb-902e-4fc4b71bc239

📥 Commits

Reviewing files that changed from the base of the PR and between d9aced4 and b053430.

📒 Files selected for processing (2)
  • src/nemoclaw.ts
  • test/gateway-state-reconcile-2276.test.ts

TonyLuo-NV and others added 2 commits April 23, 2026 17:08
…VIDIA#2276)

The belt-and-suspenders guard introduced by NVIDIA#2276 preserves the local
registry entry whenever `getNamedGatewayLifecycleState()` does not
report `healthy_named`. The existing "removes stale registry entries
when connect targets a missing live sandbox" fixture only stubbed
`openshell sandbox get` and left every other openshell command returning
exit 0 with empty output, which the new guard classifies as
`missing_named` — so the registry stayed intact and the test regressed.

Extend the fake `openshell` to simulate a healthy, active `nemoclaw`
named gateway (`status` reports `Gateway: nemoclaw` + `Status: Connected`,
and `gateway info` reports `Gateway: nemoclaw`). That is the scenario
the test is meant to exercise: gateway is fine but the sandbox is truly
gone, so removal is expected and the "Removed stale local registry
entry." message should appear.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Scenario 3 describe said "connect" but the test invokes `runCli("status")`.
  Rename to "status" to match.
- Remove the unused `statusCallsAfterSelect` field from HarnessResult and
  its always-zero initialization in runCli. Nothing reads it.

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wscurran wscurran added bug Something fails against expected or documented behavior NemoClaw CLI labels Apr 23, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this issue that identifies a bug with the sandbox registry and proposes a fix.


Related open issues:

@wscurran wscurran added the platform: ubuntu Affects Ubuntu Linux environments label Apr 23, 2026

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-designed fix. The core insight — NotFound is ambiguous when the active gateway isn't nemoclaw — is handled correctly at every layer.

Looks good

  • reconcileMissingAgainstNamedGateway — clean state machine: tries self-heal via gateway select, falls through to actionable guidance on failure, only allows the destructive path when healthy_named is confirmed. The post-select re-check of lifecycle state (not just sandbox presence) catches the edge case where select appears to succeed but the gateway isn't actually healthy.
  • Belt-and-suspenders guards in both ensureLiveSandboxOrExit and sandboxStatus — re-checking getNamedGatewayLifecycleState immediately before registry.removeSandbox is the right call given the TOCTOU window between the reconciler and the actual removal.
  • activeGateway addition to getNamedGatewayLifecycleState — additive, backward-compatible, and needed for the guidance message.
  • Test coverage is excellent — 12 scenarios covering the exact repro, self-heal success/failure, cross-command parity (connect/status/skill install), gateway-missing/unreachable fallbacks, empty/malformed output, and non-interactive mode. The stub openshell with cycling arrays is a solid test harness.
  • Existing cli.test.ts update — correctly adds gateway simulation so the lifecycle guard sees healthy_named and the pre-existing "removes stale registry" test doesn't regress.

Nit (non-blocking)

The selectFlipsActive field in the test stub's ScenarioScript interface is written to state.postSelect but never read back — the test correctness depends on the array ordering in sandboxGet/status, not on selectFlipsActive. Dead code in the stub, but no functional impact.

LGTM.

@ericksoa ericksoa merged commit 3c5add7 into NVIDIA:main Apr 23, 2026
15 checks passed
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…A#2276) (NVIDIA#2330)

## Summary

Fixes NVIDIA#2276 — on Ubuntu (and any platform), running `openshell gateway
stop && openshell gateway start` (without `--name nemoclaw`) silently
switches the active OpenShell gateway away from `nemoclaw`. The
subsequent `nemoclaw <name> connect` was destroying the local registry
entry because \`openshell sandbox get <name>\` returned NotFound against
the wrong gateway. Users had to recreate the sandbox from scratch,
losing telegram user IDs and other config.

## Root cause

`getSandboxGatewayState` queries the currently active gateway. Its
\`NotFound\` is ambiguous — the sandbox may be perfectly fine on the
`nemoclaw` gateway but invisible because a different gateway is active.
The previous code short-circuited on \`"missing"\` and immediately
called \`registry.removeSandbox(sandboxName)\`.

## Fix

Route \`"missing"\` through a new
\`reconcileMissingAgainstNamedGateway\` helper that checks
\`getNamedGatewayLifecycleState()\` before touching the registry:

1. **`connected_other`** (nemoclaw gateway healthy, different gateway
active) — silently attempt `openshell gateway select nemoclaw` and
re-query. On success, proceed as if the sandbox was present. On
persistent NotFound, surface \`wrong_gateway_active\` with actionable
guidance; preserve the registry entry.
2. **`missing_named` / `named_unreachable` / `named_unhealthy`** — route
to the existing \`gateway_missing_after_restart\` /
\`gateway_unreachable_after_restart\` branches which already skip
removal and print tailored hints.
3. **`healthy_named` + still NotFound** — truly gone; proceed with the
existing destructive path.

Belt-and-suspenders guards in both \`ensureLiveSandboxOrExit\` and
\`sandboxStatus\` re-check lifecycle state immediately before
\`registry.removeSandbox\`, delegating to the correct per-state message
helper (`printWrongGatewayActiveGuidance` for drift,
`printGatewayLifecycleHint` for unreachable/missing gateway) so users
always see accurate recovery instructions even under TOCTOU races.

## Non-goals

- OpenShell's own behavior (plain `openshell gateway start` changing the
active pointer) is unchanged — that is an upstream concern. This PR
makes NemoClaw resilient to it.

## API impact

- `getNamedGatewayLifecycleState` additively returns an
\`activeGateway\` field on every branch. Existing consumers destructure
only \`state\` / \`status\` / \`gatewayInfo\`, so this is
backward-compatible.
- No env vars, no config surface, no new CLI flags.

## Test plan

- [x] `npx vitest run test/gateway-state-reconcile-2276.test.ts` — 13/13
passing. Covers the 12 scenarios from the architect's design: regression
guards, the NVIDIA#2276 exact repro, \`wrong_gateway_active\` self-heal
success/failure paths, gateway-unhealthy fallbacks, cross-command parity
(connect / status / skill install), and non-interactive mode.
- [x] `npm run typecheck:cli` passes.
- [x] Full 3-file suite (gateway-state + nemoclaw-cli-recovery + new):
51/51 passing.
- [ ] CI green.
- [ ] Manual verification by the original reporter on Ubuntu 24.04.

Signed-off-by: Tony Luo <xialuo@nvidia.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Detect when the active gateway has drifted from NemoClaw and report a
clear "wrong gateway active" status.
- Show explicit, targeted guidance for switching gateways when
mismatches occur.
- Prevent accidental removal of sandbox registry entries unless the
named NemoClaw gateway is confirmed active.
- Ensure non-interactive runs exit deterministically without prompt-like
output.

- **Tests**
- Added comprehensive regression tests covering gateway drift, status
parity, metadata edge cases, and registry behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tony Luo <xialuo@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TonyLuo-NV TonyLuo-NV deleted the fix/2276-openshell-stop-destroys-sandbox branch May 6, 2026 08:14
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: ubuntu Affects Ubuntu Linux environments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openshell stop destroys sandbox

3 participants