fix(sandbox): propagate no-gpu intent into rebuild recreate#4399
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds exported rebuildShouldOptOutGpu and buildRebuildRecreateOnboardOpts, tests for explicit modes, aliases, and legacy fallbacks, and uses the builder in rebuild to pass a conditional ChangesGPU Opt-Out Detection for Sandbox Rebuild
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: None Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 3 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26554752730
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/actions/sandbox/rebuild.ts (1)
65-67:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCorrect the JSDoc comment.
The comment describes "Emit timestamped rebuild diagnostics" but the function actually returns a boolean indicating whether the sandbox rebuild should opt out of GPU based on registry metadata.
📝 Proposed fix
/** - * Emit timestamped rebuild diagnostics when verbose rebuild logging is enabled. + * Decide whether a sandbox rebuild should opt out of GPU based on its registry metadata. + * + * Returns `true` when the sandbox was created with explicit GPU opt-out (`sandboxGpuMode === "0"`), + * or when legacy entries have `gpuEnabled === false` without a counter-signal from `sandboxGpuEnabled`. */ export function rebuildShouldOptOutGpu(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/rebuild.ts` around lines 65 - 67, Update the JSDoc above the function in this file that returns a boolean for opting out of GPU during sandbox rebuild to accurately describe its behavior: state that the function reads registry metadata and returns true when the sandbox rebuild should opt out of GPU (false otherwise), document any parameters it inspects and the boolean return value, and remove the incorrect "Emit timestamped rebuild diagnostics" wording; ensure the summary, `@returns`, and any `@param` tags reflect the registry-check semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 65-67: Update the JSDoc above the function in this file that
returns a boolean for opting out of GPU during sandbox rebuild to accurately
describe its behavior: state that the function reads registry metadata and
returns true when the sandbox rebuild should opt out of GPU (false otherwise),
document any parameters it inspects and the boolean return value, and remove the
incorrect "Emit timestamped rebuild diagnostics" wording; ensure the summary,
`@returns`, and any `@param` tags reflect the registry-check semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7b09208-3040-4408-98fa-9f5ab1ebb43a
📒 Files selected for processing (2)
src/lib/actions/sandbox/rebuild-gpu-opt-out.test.tssrc/lib/actions/sandbox/rebuild.ts
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26555209513
|
Selective E2E Results — ✅ All requested jobs passedRun: 26555653066
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26556801195
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26557508443
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26559049011
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4399.docs.buildwithfern.com/nemoclaw |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/reference/commands.mdx (1)
789-789: ⚡ Quick winReplace the clause colon with a period (or split into two sentences).
The colon in this sentence is used as general punctuation between clauses, not to introduce a list.
As per coding guidelines, “Colons should only introduce a list. Flag colons used as general punctuation between clauses.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.mdx` at line 789, The sentence beginning "The recorded sandbox GPU mode is preserved across rebuild:" uses a colon incorrectly; edit the sentence to replace that colon with a period (or otherwise split into two sentences) so it no longer uses a colon as general punctuation—ensure the rest of the text (including references to sandboxGpuMode: "0", gpuEnabled: false, and the onboard --resume mention) remains unchanged and grammatically connected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/reference/commands.mdx`:
- Line 789: The sentence beginning "The recorded sandbox GPU mode is preserved
across rebuild:" uses a colon incorrectly; edit the sentence to replace that
colon with a period (or otherwise split into two sentences) so it no longer uses
a colon as general punctuation—ensure the rest of the text (including references
to sandboxGpuMode: "0", gpuEnabled: false, and the onboard --resume mention)
remains unchanged and grammatically connected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a35459c5-046d-4179-9e69-081578e67f16
📒 Files selected for processing (1)
docs/reference/commands.mdx
Selective E2E Results — ✅ All requested jobs passedRun: 26559810854
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26560526659
|
## Summary - Adds the v0.0.56 release notes section with links to the deeper docs pages for installer, status, inference, messaging, policy, and lifecycle changes. - Updates source docs for the remaining release-prep gaps around `uv` in the PyPI preset, compact WhatsApp pairing guidance, and `nemoclaw inference set` command boundaries. - Refreshes generated `nemoclaw-user-*` skills and removes skipped experimental command terms from generated skill surfaces. ## Source summary - #4613 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents that public installs and `nemoclaw update` follow the maintained `lkg` tag by default. - #4419 -> `docs/about/release-notes.mdx`: Notes that non-interactive Linux installs can reactivate Docker group membership and continue in one installer run when `sg docker` is available. - #4550 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures live sandbox agent-version probing for status, connect, and upgrade checks. - #4609 -> `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Captures the GPU Docker-driver host-network local-inference reachability gate. - #4607 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents compact WhatsApp QR pairing guidance and gateway/session diagnostics. - #4582 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Reflects Slack credential validation before enabling the channel. - #4554 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Keeps Telegram allowlist alias guidance in the generated user skills and release notes. - #4563 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Includes the new `nemoclaw <name> skill remove <skill>` command in command docs and release notes. - #4566 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents the `nemoclaw inference set` redirect boundary when `--provider` or `--model` is missing. - #4323 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures per-sandbox status JSON support. - #4506 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures debug command sandbox-name validation and safer tarball writing. - #4569 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Documents that the `pypi` preset allows `/usr/local/bin/uv`. - #4579 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Captures observable Jira preset validation guidance. - #4229 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents user-data preservation defaults for uninstall. - #4399 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures CPU-only sandbox intent preservation across rebuilds. - #4058 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures safer snapshot restore behavior around existing destinations. - #4155 and #4460 -> skipped by `docs/.docs-skip`: Removed skipped experimental command terms from source docs and generated skill evals instead of documenting those features. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports the pre-existing light-mode accent contrast warning) - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills` (no matches) - `npm run build:cli` (run to refresh local CLI artifacts for the pre-push TypeScript hook) - Commit hooks passed, including `NEMOCLAW_* env-var documentation gate`, `Verify docs-to-skills output`, `markdownlint-cli2`, `gitleaks`, and `Test (skills YAML)`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Expanded Model Router setup with YAML examples, flow diagrams, and credential handling; strengthened agent-config immutability and integrity guidance; messaging channels updated (Telegram aliases, WhatsApp pairing/diagnostics); CLI docs revised (GPU detection, inference set behavior, uninstall/rebuild preservation); overview rebranded to NemoClaw and added v0.0.56 release notes. * **New Features** * Added `nemoclaw <name> channels status` (messaging diagnostics, JSON); added `nemoclaw <name> skill remove`; Hermes no longer marked experimental; DGX Spark quickstart sandbox-name note. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nemoclaw <name> rebuild --yeson a sandbox onboarded with an explicit--no-gpu(recorded assandboxGpuMode: "0") now propagates that recorded intent into the inneronboard --resumeso the Docker CDI GPU preflight is skipped on hosts without an NVIDIA GPU (e.g. Snapdragon ARM WSL2). Without this, rebuild deletes the old sandbox, then the recreate's preflight fails on CDI and leaves the user with only arebuild-backups/archive and a stub registry entry. Auto-mode sandboxes (sandboxGpuMode: "auto") remain auto across rebuild, including the CPU-fallback case wheresandboxGpuEnabledends upfalsebecause the host has no NVIDIA GPU.Related Issue
Fixes #3985.
Changes
src/lib/actions/sandbox/rebuild-gpu-opt-out.ts. ExportsrebuildShouldOptOutGpu(sb): the source of truth issandboxGpuMode(routed throughnormalizeSandboxGpuModeso"AUTO"/"off"/"false"map consistently)."0"→ opt out,"1"/"auto"→ keep current behaviour. Legacy entries withoutsandboxGpuModefall back togpuEnabled === falseonly when nosandboxGpuEnabled: truecounter-signal is set, so a modern auto-mode CPU-fallback record (sandboxGpuMode: "auto",sandboxGpuEnabled: false) is not silently converted into a permanent explicit opt-out on rebuild. Also exportsbuildRebuildRecreateOnboardOpts({ sb, rebuildAgent, storedFromDockerfile, autoYes }), which returns the options object passed to the inneronboard()and conditionally spreads{ noGpu: true }only whenrebuildShouldOptOutGpu(sb)is true.src/lib/actions/sandbox/rebuild.tsreplaces the inline option literal with a call tobuildRebuildRecreateOnboardOpts(...)so the wiring is testable from the helper module without driving the full rebuild flow.src/lib/actions/sandbox/rebuild-gpu-opt-out.test.tscovers both helpers. ForrebuildShouldOptOutGpu: null/undefined entry,sandboxGpuMode: "0"(with and withoutsandboxGpuEnabled),sandboxGpuMode: "auto"(CPU-fallbacksandboxGpuEnabled: falsereturns false, matching the establishedgetResumeSandboxGpuOverridessemantic),sandboxGpuMode: "1", the legacygpuEnabled: falsefallback when nosandboxGpuModeis recorded, the counter-signal case wheresandboxGpuEnabled: trueoverrides legacygpuEnabled: false, no-intent ({}andgpuEnabled: true), unrecognisedsandboxGpuModevalues that are NOT routed through the legacygpuEnabledfallback (corrupted state cannot flip a sandbox into a permanent opt-out), empty-stringsandboxGpuModetreated as no recorded mode and so still falling back to legacygpuEnabled: false, and case-insensitive aliases ("AUTO","off","false","TRUE"). ForbuildRebuildRecreateOnboardOpts:noGpu: trueis forwarded forsandboxGpuMode: "0"and legacygpuEnabled: false; omitted forsandboxGpuMode: "auto"/"1"and for the no-sandbox-entry case;agent/storedFromDockerfile/autoYesare preserved regardless of GPU opt-out.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Tests
Documentation