Skip to content

fix(onboard,uninstall): clearer diagnostics when no sandbox is registered#3464

Closed
laitingsheng wants to merge 4 commits into
mainfrom
fix/3456-onboard-uninstall-clearer-diagnostics
Closed

fix(onboard,uninstall): clearer diagnostics when no sandbox is registered#3464
laitingsheng wants to merge 4 commits into
mainfrom
fix/3456-onboard-uninstall-clearer-diagnostics

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolve the two misleading-output threads in #3456: the GPU-passthrough recovery suggestion that prints a literal <name> placeholder and assumes a sandbox is registered, and the nemoclaw uninstall log line that reads Destroyed gateway 'nemoclaw' skipped (a self-contradiction).

Related Issue

Refs #3456 — partially addresses. This PR fixes only the State B "destroy --yes" suggestion and the uninstall "skipped" wording. The State A bridge-reachability / "host firewall is blocking" hint and the underlying GLIBC mismatch are out of scope; the bridge-route work in #3459 covers the install loop, and the GLIBC binary baseline sits on the OpenShell side. Do not auto-close #3456 with this PR alone.

Changes

  • src/lib/onboard/gpu-recovery.ts: new helper module exposing gpuPassthroughRecoveryLines(names: readonly string[] | null) (pure line-builder) and reportGpuPassthroughRecovery(emit, listRegisteredSandboxes) (registry-lookup + print wrapper with injectable deps for tests). Three branches:
    • One or more sandboxes registered: prints each as nemoclaw <name> destroy --yes with --cleanup-gateway appended to the last one (since destroy --yes otherwise preserves the gateway and would re-loop on the next onboard).
    • No sandboxes registered: prints openshell gateway destroy -g nemoclaw followed by nemoclaw onboard --gpu. Uses the bare gateway-removal command rather than nemoclaw uninstall because the latter npm uninstall -g nemoclaws the CLI, after which nemoclaw onboard --gpu would no longer exist.
    • Registry unreadable (lookup threw): prints Could not read the NemoClaw sandbox registry and the same direct gateway-removal recipe, so a registry failure is not silently rebranded as "no sandboxes registered".
  • src/lib/onboard.ts: when the reusable gateway is missing GPU passthrough, replace the literal nemoclaw <name> destroy --yes && nemoclaw onboard --gpu line with a call to reportGpuPassthroughRecovery(). Diff is net-neutral against the entrypoint-budget check.
  • src/lib/actions/uninstall/run-plan.ts: when an openshell cleanup step returns non-zero, drop the past-tense verb so the warning reads Skipped <target> (already absent or unavailable) instead of Destroyed gateway 'nemoclaw' skipped.
  • Unit tests: src/lib/onboard.test.ts covers the four input branches (null / [] / one / many), asserts --cleanup-gateway lands only on the last destroy, asserts the joined output never contains <name> or the broken nemoclaw uninstall && nemoclaw onboard chain, and exercises reportGpuPassthroughRecovery for both the happy and registry-throws paths. src/lib/actions/uninstall/run-plan.test.ts adds a case that forces every openshell call non-zero and asserts the warnings use the new wording (and no warning matches /^Destroyed .+ skipped$/).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Enhanced GPU passthrough recovery guidance with step-by-step cleanup and re-onboard instructions that adapt to registry state.
  • Bug Fixes

    • More consistent uninstall warnings: standardized "Skipped ..." messages and clearer reporting when cleanup steps are omitted.
  • Tests

    • Added tests covering uninstall skip reporting and expanded GPU recovery guidance scenarios.

Review Change Stack

The "Existing gateway was started without GPU passthrough" suggestion
contained the literal `<name>` placeholder and assumed a sandbox was
registered; when none was, the suggested `nemoclaw <name> destroy --yes`
had nothing to act on and looped the user through `nemoclaw uninstall`
without saying so.

Also fix the uninstall log line: when an `openshell` cleanup step
returned non-zero (typically because the resource was already absent),
the warning read `Destroyed gateway 'nemoclaw' skipped`, which contradicts
itself.  Use a state-form wording so the message reads truthfully.

Refs #3456

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a9e0110-8c5b-4b54-ac0c-899f134a3224

📥 Commits

Reviewing files that changed from the base of the PR and between 6cc0b14 and d67ba17.

📒 Files selected for processing (3)
  • src/lib/onboard.test.ts
  • src/lib/onboard.ts
  • src/lib/onboard/gpu-recovery.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds a GPU passthrough recovery helper integrated into onboarding and standardizes uninstall "skipped" warnings by normalizing action prefixes; tests cover recovery guidance variants and uninstall skip reporting edge cases.

Changes

GPU Passthrough Recovery Guidance

Layer / File(s) Summary
GPU recovery instruction helper
src/lib/onboard/gpu-recovery.ts
`gpuPassthroughRecoveryLines(registeredNames: readonly string[]
Onboarding integration and tests
src/lib/onboard.ts, src/lib/onboard.test.ts
Onboarding imports and calls reportGpuPassthroughRecovery() when GPU passthrough is missing on a reusable Docker-driver gateway. Tests validate gpuPassthroughRecoveryLines for null/empty/single/multiple inputs, --cleanup-gateway placement, absence of <name> placeholders, and reportGpuPassthroughRecovery printer and error-fallback behavior.

Uninstall Cleanup Message Standardization

Layer / File(s) Summary
Standardized skipped-resource warnings
src/lib/actions/uninstall/run-plan.ts, src/lib/actions/uninstall/run-plan.test.ts
runOptional() strips leading action verbs (Destroyed, Deleted, Stopped, Removed) from descriptions and emits Skipped <target> (already absent or unavailable) when a command exits non‑zero; on success it still logs the original description. Tests assert gateway and OpenShell sandbox skip warnings and forbid contradictory past‑tense phrasing like Destroyed ... skipped or Deleted ... skipped.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

OpenShell, Sandbox, v0.0.40

Suggested reviewers

  • jyaunches
  • ericksoa
  • prekshivyas

Poem

🐰 I hop through logs with careful cheer,
If GPUs hide or sandboxes fear,
I list the steps to nudge them right,
And tidy skips so warnings light—
A rabbit's nibble makes fixes clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary purpose of the PR: improving diagnostic messages when no sandbox is registered, which is the core issue addressed across the onboard and uninstall modules.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3456-onboard-uninstall-clearer-diagnostics

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng marked this pull request as draft May 13, 2026 16:10
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: deployment-services-e2e, gpu-double-onboard-e2e, double-onboard-e2e

Dispatch hint: deployment-services-e2e,double-onboard-e2e,gpu-double-onboard-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. Both code paths are message-only refinements with comprehensive new vitest coverage (run-plan.test.ts + onboard.test.ts) over every observable branch (no sandboxes / unreadable registry / one sandbox / many sandboxes / placeholder absence; uninstall skip-wording, lsof-missing, swap-cleanup, openshell-failure). Runtime semantics (exit codes, gateway/sandbox lifecycle, credential handling, networking) are unchanged, so no required E2E gating is warranted.

Optional E2E

  • deployment-services-e2e (medium): Exercises 'uninstall --keep-openshell' (TC-DEPLOY-03), which is the real-world driver for runOptional() in run-plan.ts. Confirms the reworded skip-warnings still occur during a non-zero step exit without breaking the overall uninstall sequence.
  • gpu-double-onboard-e2e (high): Closest E2E to the onboard.ts touch point: re-onboards with the Ollama/GPU path on a real GPU runner. It does not deterministically hit the 'gateway lacks GPU passthrough' branch the PR rewrites, but it is the only existing job that re-runs onboard with --gpu against a pre-existing gateway and would catch any unintended import/wiring regression in onboard.ts from the new require('./onboard/gpu-recovery').
  • double-onboard-e2e (high): Lifecycle-recovery double-onboard flow on CPU; smoke check that the onboard.ts change does not regress reuse-of-healthy-gateway logic on the more common non-GPU path.

New E2E recommendations

  • onboard-gpu-recovery-messaging (low): The exact branch reportGpuPassthroughRecovery() guards (gateway healthy + reused + DeviceRequests is null/[] + opts.gpu set) is not deterministically exercised by any existing E2E job. A direct E2E that pre-creates a CPU-only NemoClaw gateway, then runs nemoclaw onboard --gpu against it, asserts process exits non-zero, asserts stderr lists each registered sandbox by name with destroy --yes and exactly one --cleanup-gateway on the last entry, and never contains the literal '' or nemoclaw uninstall && nemoclaw onboard strings would lock in the user-facing fix.
    • Suggested test: test/e2e/test-onboard-gpu-passthrough-recovery.sh — bring up nemoclaw gateway without GPU, register one or two sandboxes, run nemoclaw onboard --gpu, assert exit=1 and that emitted recovery lines match the schema produced by gpuPassthroughRecoveryLines (single-sandbox, multi-sandbox, and registry-unreadable variants).

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: deployment-services-e2e,double-onboard-e2e,gpu-double-onboard-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.test.ts`:
- Around line 11-16: Update the tests to forbid any use of the literal
placeholder "<name>" rather than allowing one exact variant; replace the
specific expectation that contains "No sandboxes... `nemoclaw <name> destroy`"
in the expect(lines).toEqual(...) array with a version that omits the `<name>`
placeholder, and add negative assertions that assert the rendered output does
not contain "<name>" (e.g., expect(lines.join('\n')).not.toMatch(/<name>/) or
similar). Apply the same change to the other occurrence around the 40-43 block
so both the positive expectation arrays and the negative checks uniformly reject
the "<name>" placeholder.

In `@src/lib/onboard.ts`:
- Around line 10417-10426: Extract the inline sandbox-name lookup into a new
helper getRegisteredSandboxNamesForGpuRecovery() in
src/lib/onboard/gpu-recovery.ts that performs the same try/catch behavior around
registry.listSandboxes().sandboxes.map(s => s.name).filter(Boolean) and returns
[] on error; then replace the block in onboard.ts (the current IIFE that assigns
registeredNames) with a simple call to
getRegisteredSandboxNamesForGpuRecovery(), and keep the subsequent loop that
calls gpuPassthroughRecoveryLines(registeredNames) unchanged so only the lookup
logic moves into the new function.

In `@src/lib/onboard/gpu-recovery.ts`:
- Line 8: Update the user-facing message string in
src/lib/onboard/gpu-recovery.ts that currently contains the literal backtick
token `nemoclaw <name> destroy`; remove the placeholder so it reads `nemoclaw
destroy` (i.e., change the string "No sandboxes are registered, so there is
nothing for `nemoclaw <name> destroy` to act on." to "No sandboxes are
registered, so there is nothing for `nemoclaw destroy` to act on.").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e4b21aec-b0da-470c-89aa-ca45b67d0dda

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb6c8e and d6b011d.

📒 Files selected for processing (5)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts
  • src/lib/onboard.test.ts
  • src/lib/onboard.ts
  • src/lib/onboard/gpu-recovery.ts

Comment thread src/lib/onboard.test.ts
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/gpu-recovery.ts Outdated
…s neutral

CodeRabbit and Codex feedback on #3464:
- gpu-recovery still emitted "nemoclaw <name> destroy" in the no-sandbox
  branch.  The whole point of the bug is that `<name>` is not actionable
  when no sandbox is registered; rewrite the line so the placeholder is
  gone entirely and broaden the test to forbid any "<name>" occurrence
  in the joined output.
- Hoist the registry lookup + print loop into reportGpuPassthroughRecovery
  inside gpu-recovery.ts so the entrypoint diff in src/lib/onboard.ts is
  net-neutral against the budget check.  Inject the lookup as a callback
  for testability and add a fallback case that hides registry-throw paths.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Codex review on #3464 caught three issues:

1) `nemoclaw <name> destroy --yes` explicitly preserves the shared gateway
   (destroy.ts L207).  The previous recovery snippet therefore destroyed the
   sandbox and then re-onboarded against the same gateway, which still lacks
   GPU passthrough — same wall, one round later.  Append --cleanup-gateway
   to the last destroy command so the gateway actually goes away.

2) `nemoclaw uninstall` removes the global CLI via `npm uninstall -g nemoclaw`
   (run-plan.ts L451), so `nemoclaw uninstall && nemoclaw onboard --gpu` is
   not even executable — the second half runs against a CLI that no longer
   exists.  Use the bare `openshell gateway destroy -g nemoclaw` command
   (the same one destroy.ts already prints as gatewayRemovalHint) so the
   CLI survives.

3) The registry catch collapsed read errors to `[]`, which then printed
   "No sandboxes are registered" — a false diagnosis.  Switch the return
   type to `readonly string[] | null` and add a dedicated branch with a
   "Could not read the NemoClaw sandbox registry" message and a direct
   gateway-removal recipe.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 13, 2026 17:34
cjagwani pushed a commit that referenced this pull request May 14, 2026
Resolve the two output threads in #3456 left after the core dead-loop fix
landed via #3459 + #3434:

Sub-bug #3 — `src/lib/onboard.ts` printed
  `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu`
with a literal `<name>` placeholder, and assumed at least one sandbox
was registered. When the GPU-passthrough mismatch hit on the State B
re-run path with an empty registry (the dead-loop case), the hint was
not actionable. Replace with a registry-aware helper at
`src/lib/onboard/gpu-recovery.ts` that renders the right shape:
  - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu`
  - one sandbox → suggest destroy --yes --cleanup-gateway for that name
  - multiple sandboxes → list each, only the last gets --cleanup-gateway

Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed
  `Destroyed gateway 'nemoclaw' skipped`
when the openshell destroy no-op'd (gateway already gone) — the
"Destroyed … skipped" wording was self-contradictory. Extend
`runOptional` with an `onSkip` option; route the gateway destroy to
emit `Gateway 'nemoclaw' already removed or unreachable` on no-op.

Tests:
- `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal
  `<name>` placeholder anywhere in the output; cover empty / single /
  multi-sandbox cases; defensive filter on whitespace names so a
  `nemoclaw  destroy` rendering can never happen.
- `src/lib/actions/uninstall/run-plan.test.ts`: assert the new
  "already removed or unreachable" wording and the absence of the
  "Destroyed gateway 'nemoclaw' skipped" string.

The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch)
is already addressed by #3459 + #3434 + #3483; #3456 will close once
this lands. See the #3456 status comment for the full mapping.

Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464,
which left the literal `<name>` placeholder in tests per CodeRabbit
feedback that was never addressed.

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
cv added a commit that referenced this pull request May 14, 2026
…3520)

> **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will
mark ready once batch self-review + CI complete.

## Summary

Closes the two remaining output threads in #3456 after the core
dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full
sub-bug mapping in the [#3456 status
comment](#3456 (comment)).

- **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint
replaced with a registry-aware helper.
- **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped`
self-contradictory wording replaced with `Gateway 'nemoclaw' already
removed or unreachable`.

## Acceptance criteria mapping

| Sub-bug | Resolution | Evidence |
|---|---|---|
| #1 dead loop | Already fixed on main (#3459) | out of scope |
| #2 firewall diagnostic | Already fixed on main (#3459) | out of scope
|
| **#3** literal `<name>` placeholder | **This PR** |
`src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` |
| **#4** misleading "skipped" wording | **This PR** |
`src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` |
| #5 uninstall residuals | Already fixed on main (#3483) | out of scope
|

## Behavior matrix

`gpuPassthroughRecoveryLines(names)`:

| Input | Suggestion |
|---|---|
| `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` |
| one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway &&
nemoclaw onboard --gpu` |
| many sandboxes | each `destroy --yes`, only the last gets
`--cleanup-gateway` |

## Test plan

```
npm run typecheck:cli
npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts
```

22 tests pass (6 new + 16 existing).

## Notes for reviewers

- This is the work [#3464
attempted](#3464); that PR was
closed without merging after CodeRabbit asked for the `<name>`
placeholder to be forbidden in tests via negative assertion. This PR
adopts that refinement.
- `runOptional` extension is backwards-compatible — existing callers
without `onSkip` get the original wording.

Closes #3456 once merged.

---------

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading

2 participants