fix(snapshot): use registered imageTag for VM-driver auto-create#4058
Conversation
Restoring a snapshot from a VM-driver sandbox into a new destination
(`snapshot restore --to <new>`) aborted with:
Cannot auto-create '<new>': could not resolve '<src>' pod image.
PR NVIDIA#3784 taught `probeGatewayRunning()` to treat `openshellDriver: "vm"`
like `"docker"`, but `resolveSrcPodImage()`'s fast path was still
hard-coded to `=== "docker"`. On macOS Apple Silicon (vm driver) the
fast path was skipped and we fell into the legacy
`docker exec openshell-cluster-* kubectl ...` probe — a container that
does not exist on VM-driver hosts — so the resolver always returned
null.
Route both Docker and VM driver sandboxes through the same
`usesGatewayMetadataProbe()` helper so the registered imageTag is
trusted. While here, gate the auto-create DNS-proxy step on the
`"kubernetes"` driver, matching `onboard.ts` — the VM driver does not
use the kubectl-based DNS proxy.
Signed-off-by: sevenc <sevenc@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPrefer the sandbox registry's registered imageTag for drivers using gateway metadata probes (docker/vm), restrict the kubectl-in-gateway fallback to the kubernetes driver, gate DNS proxy setup to srcDriver === "kubernetes", and add a VM-driver regression test verifying the registered imageTag is used during restore. ChangesSnapshot image resolution and proxy setup
Sequence Diagram(s)sequenceDiagram
participant CLI
participant resolveSrcPodImage
participant SandboxRegistry
participant GatewayMetadataProbe
participant kubectl
CLI->>resolveSrcPodImage: request source image for auto-create
resolveSrcPodImage->>SandboxRegistry: read registered imageTag
resolveSrcPodImage->>GatewayMetadataProbe: probe gateway metadata (docker/vm)
resolveSrcPodImage->>kubectl: kubectl fallback (only if srcDriver == "kubernetes")
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/snapshot.ts`:
- Around line 94-96: The current branch returns registeredImage for gateway
metadata probes even when the openshell driver is "vm" or "docker" and the image
lacks an imageTag, causing it to fall through to the kubectl probe; modify the
logic in the block that checks usesGatewayMetadataProbe(registeredDriver) &&
registeredImage so that if registeredDriver is "vm" or "docker" and
registeredImage.imageTag is missing, you return null (fail fast) instead of
registeredImage, otherwise keep the existing return; reference the symbols
usesGatewayMetadataProbe, registeredDriver, registeredImage and the openshell
driver values "vm"/"docker" when updating the condition.
In `@test/snapshot-gateway-guard.test.ts`:
- Around line 220-224: The test currently inspects runCli output but doesn't
assert process exit status; update the assertion after calling runCli (the const
r = runCli(...) invocation) to explicitly verify the command succeeded by
checking the exit code (e.g., assert/expect r.code is 0 or truthy success value
used in your runner). Add this single assertion (using the same test framework
style as other tests, e.g., expect(r.code).toBe(0) or equivalent) immediately
after creating r and before or after the existing output assertions so failures
with partial output fail the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1fbad4c5-6d0b-450c-b57c-e2cf0bb00a2f
📒 Files selected for processing (2)
src/lib/actions/sandbox/snapshot.tstest/snapshot-gateway-guard.test.ts
…without imageTag Per PR review: Docker- and VM-driver sandboxes never have the legacy `openshell-cluster-nemoclaw` container, so falling through to the `docker exec ... kubectl ...` probe when their imageTag happens to be missing only wastes a guaranteed-failing docker call (and pollutes output with a misleading docker error) before returning the same null. Return imageTag (or null) directly for both drivers; keep the kubectl probe path strictly for the legacy "kubernetes" driver. Signed-off-by: sevenc <sevenc@nvidia.com>
|
✨ |
## Summary - Adds the v0.0.56 release notes section with links to the deeper docs pages for installer, status, inference, messaging, policy, and lifecycle changes. - Updates source docs for the remaining release-prep gaps around `uv` in the PyPI preset, compact WhatsApp pairing guidance, and `nemoclaw inference set` command boundaries. - Refreshes generated `nemoclaw-user-*` skills and removes skipped experimental command terms from generated skill surfaces. ## Source summary - #4613 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents that public installs and `nemoclaw update` follow the maintained `lkg` tag by default. - #4419 -> `docs/about/release-notes.mdx`: Notes that non-interactive Linux installs can reactivate Docker group membership and continue in one installer run when `sg docker` is available. - #4550 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures live sandbox agent-version probing for status, connect, and upgrade checks. - #4609 -> `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Captures the GPU Docker-driver host-network local-inference reachability gate. - #4607 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents compact WhatsApp QR pairing guidance and gateway/session diagnostics. - #4582 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Reflects Slack credential validation before enabling the channel. - #4554 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Keeps Telegram allowlist alias guidance in the generated user skills and release notes. - #4563 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Includes the new `nemoclaw <name> skill remove <skill>` command in command docs and release notes. - #4566 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents the `nemoclaw inference set` redirect boundary when `--provider` or `--model` is missing. - #4323 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures per-sandbox status JSON support. - #4506 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures debug command sandbox-name validation and safer tarball writing. - #4569 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Documents that the `pypi` preset allows `/usr/local/bin/uv`. - #4579 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Captures observable Jira preset validation guidance. - #4229 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents user-data preservation defaults for uninstall. - #4399 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures CPU-only sandbox intent preservation across rebuilds. - #4058 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures safer snapshot restore behavior around existing destinations. - #4155 and #4460 -> skipped by `docs/.docs-skip`: Removed skipped experimental command terms from source docs and generated skill evals instead of documenting those features. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports the pre-existing light-mode accent contrast warning) - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills` (no matches) - `npm run build:cli` (run to refresh local CLI artifacts for the pre-push TypeScript hook) - Commit hooks passed, including `NEMOCLAW_* env-var documentation gate`, `Verify docs-to-skills output`, `markdownlint-cli2`, `gitleaks`, and `Test (skills YAML)`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Expanded Model Router setup with YAML examples, flow diagrams, and credential handling; strengthened agent-config immutability and integrity guidance; messaging channels updated (Telegram aliases, WhatsApp pairing/diagnostics); CLI docs revised (GPU detection, inference set behavior, uninstall/rebuild preservation); overview rebranded to NemoClaw and added v0.0.56 release notes. * **New Features** * Added `nemoclaw <name> channels status` (messaging diagnostics, JSON); added `nemoclaw <name> skill remove`; Hermes no longer marked experimental; DGX Spark quickstart sandbox-name note. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
This PR fixes
nemoclaw <src> snapshot restore --to <new>aborting on macOS Apple Silicon (VM driver) withCannot auto-create '<new>': could not resolve '<src>' pod image., becauseresolveSrcPodImage()'s fast path was still hard-coded toopenshellDriver === "docker"— a symmetric gap left by #3784 which only fixedprobeGatewayRunning(). Route both Docker and VM driver sandboxes through the sameusesGatewayMetadataProbe()helper and gate the auto-create DNS-proxy step on the"kubernetes"driver to matchonboard.ts.Related Issue
Fixes #4071
Changes
resolveSrcPodImage()fast path now trusts the registeredimageTagfor bothdockerandvmdrivers viausesGatewayMetadataProbe(), so VM-driver auto-create no longer falls into the kubectl-via-docker probe against a container that does not exist.autoCreateSandboxFromSource()DNS-proxy step gated onopenshellDriver === "kubernetes", matching the onboard logic (src/lib/onboard.ts). Previously vm-driver sandboxes invokedsetup-dns-proxy.sheven though they don't use the kubectl-based DNS proxy.test/snapshot-gateway-guard.test.tscoveringsnapshot restore --to <new>on a VM-driver sandbox: dockerexecemits a sentinel that the test asserts never appears, and the registeredimageTagis asserted to show up in the auto-create output.Type of Change
Test the PR locally on a MacOS Apple Silicon device:
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: sevenc sevenc@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests