Skip to content

fix(snapshot): use registered imageTag for VM-driver auto-create#4058

Merged
cv merged 5 commits into
NVIDIA:mainfrom
cr7258:snapshot-restore
May 29, 2026
Merged

fix(snapshot): use registered imageTag for VM-driver auto-create#4058
cv merged 5 commits into
NVIDIA:mainfrom
cr7258:snapshot-restore

Conversation

@cr7258

@cr7258 cr7258 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes nemoclaw <src> snapshot restore --to <new> aborting on macOS Apple Silicon (VM driver) with Cannot auto-create '<new>': could not resolve '<src>' pod image., because resolveSrcPodImage()'s fast path was still hard-coded to openshellDriver === "docker" — a symmetric gap left by #3784 which only fixed probeGatewayRunning(). Route both Docker and VM driver sandboxes through the same usesGatewayMetadataProbe() helper and gate the auto-create DNS-proxy step on the "kubernetes" driver to match onboard.ts.

Related Issue

Fixes #4071

Changes

  • resolveSrcPodImage() fast path now trusts the registered imageTag for both docker and vm drivers via usesGatewayMetadataProbe(), so VM-driver auto-create no longer falls into the kubectl-via-docker probe against a container that does not exist.
  • autoCreateSandboxFromSource() DNS-proxy step gated on openshellDriver === "kubernetes", matching the onboard logic (src/lib/onboard.ts). Previously vm-driver sandboxes invoked setup-dns-proxy.sh even though they don't use the kubectl-based DNS proxy.
  • Added a regression test in test/snapshot-gateway-guard.test.ts covering snapshot restore --to <new> on a VM-driver sandbox: docker exec emits a sentinel that the test asserts never appears, and the registered imageTag is asserted to show up in the auto-create output.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Test the PR locally on a MacOS Apple Silicon device:

nemoclaw my-assistant snapshot create --name verify-fix
  Creating snapshot of 'my-assistant' (--name verify-fix)...
  ✓ Snapshot v3 name=verify-fix created (13 directories, 0 files)
    /Users/sevenc/.nemoclaw/rebuild-backups/my-assistant/2026-05-22T07-28-20-816Z

nemoclaw my-assistant snapshot restore verify-fix --to clone-1
  'clone-1' does not exist. Creating from 'my-assistant' image (openshell/sandbox-from:1779432198)...
  Creating sandbox in gateway...
  Waiting for sandbox to become ready...
  Sandbox reported Ready before create stream exited; continuing.
  ✓ Sandbox 'clone-1' created

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: sevenc sevenc@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Fixed snapshot restore image resolution so sandbox restores use the registered image tag for gateway-backed drivers and avoid inappropriate fallback probes.
    • DNS proxy setup during sandbox auto-creation now runs only for Kubernetes-backed source sandboxes when setup scripts are present.
  • Tests

    • Added regression tests covering snapshot restore workflows to verify correct image selection and absence of fallback probe behavior.

Review Change Stack

Restoring a snapshot from a VM-driver sandbox into a new destination
(`snapshot restore --to <new>`) aborted with:

    Cannot auto-create '<new>': could not resolve '<src>' pod image.

PR NVIDIA#3784 taught `probeGatewayRunning()` to treat `openshellDriver: "vm"`
like `"docker"`, but `resolveSrcPodImage()`'s fast path was still
hard-coded to `=== "docker"`. On macOS Apple Silicon (vm driver) the
fast path was skipped and we fell into the legacy
`docker exec openshell-cluster-* kubectl ...` probe — a container that
does not exist on VM-driver hosts — so the resolver always returned
null.

Route both Docker and VM driver sandboxes through the same
`usesGatewayMetadataProbe()` helper so the registered imageTag is
trusted. While here, gate the auto-create DNS-proxy step on the
`"kubernetes"` driver, matching `onboard.ts` — the VM driver does not
use the kubectl-based DNS proxy.

Signed-off-by: sevenc <sevenc@nvidia.com>
@cr7258 cr7258 self-assigned this May 22, 2026
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0e3359bd-c7a3-4bdf-9c54-7b59a6f1ca9a

📥 Commits

Reviewing files that changed from the base of the PR and between 1cf20fd and 0e8d4de.

📒 Files selected for processing (1)
  • src/lib/actions/sandbox/snapshot.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/actions/sandbox/snapshot.ts

📝 Walkthrough

Walkthrough

Prefer the sandbox registry's registered imageTag for drivers using gateway metadata probes (docker/vm), restrict the kubectl-in-gateway fallback to the kubernetes driver, gate DNS proxy setup to srcDriver === "kubernetes", and add a VM-driver regression test verifying the registered imageTag is used during restore.

Changes

Snapshot image resolution and proxy setup

Layer / File(s) Summary
Image resolution and DNS proxy gating
src/lib/actions/sandbox/snapshot.ts
resolveSrcPodImage now returns the sandbox registry imageTag for drivers that use the gateway metadata probe and only falls back to kubectl when srcDriver === "kubernetes". DNS proxy setup in autoCreateSandboxFromSource runs only when srcDriver === "kubernetes" and the script exists.
VM-driver restore test with imageTag verification
test/snapshot-gateway-guard.test.ts
Adds makeVmRestoreToEnv test helper (sets sandbox registry imageTag, stubs openshell/ssh/docker) and a regression test that restores a snapshot to --to clone-1, asserting the registered imageTag appears in output and that the forbidden docker exec sentinel is absent.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant resolveSrcPodImage
  participant SandboxRegistry
  participant GatewayMetadataProbe
  participant kubectl
  CLI->>resolveSrcPodImage: request source image for auto-create
  resolveSrcPodImage->>SandboxRegistry: read registered imageTag
  resolveSrcPodImage->>GatewayMetadataProbe: probe gateway metadata (docker/vm)
  resolveSrcPodImage->>kubectl: kubectl fallback (only if srcDriver == "kubernetes")
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

NemoClaw CLI, OpenShell, v0.0.50

Suggested reviewers

  • ericksoa

Poem

🐰 A tiny hop through gateway mist,
ImageTag gleams where probes might twist,
Kubernetes hums when DNS must play,
VM restores glide the registry way,
No docker shout to steal the day.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: fixing snapshot restore to use the registered imageTag for VM-driver auto-create operations.
Linked Issues check ✅ Passed The PR fully addresses all coding objectives from issue #4071: uses registered imageTag for VM/docker drivers, prevents kubectl-via-docker probe failures, updates DNS proxy logic for kubernetes-only, and includes regression tests.
Out of Scope Changes check ✅ Passed All changes directly address the linked issue objectives; modifications to snapshot.ts and additions to test file are scoped to the VM-driver auto-create fix and regression testing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/snapshot.ts`:
- Around line 94-96: The current branch returns registeredImage for gateway
metadata probes even when the openshell driver is "vm" or "docker" and the image
lacks an imageTag, causing it to fall through to the kubectl probe; modify the
logic in the block that checks usesGatewayMetadataProbe(registeredDriver) &&
registeredImage so that if registeredDriver is "vm" or "docker" and
registeredImage.imageTag is missing, you return null (fail fast) instead of
registeredImage, otherwise keep the existing return; reference the symbols
usesGatewayMetadataProbe, registeredDriver, registeredImage and the openshell
driver values "vm"/"docker" when updating the condition.

In `@test/snapshot-gateway-guard.test.ts`:
- Around line 220-224: The test currently inspects runCli output but doesn't
assert process exit status; update the assertion after calling runCli (the const
r = runCli(...) invocation) to explicitly verify the command succeeded by
checking the exit code (e.g., assert/expect r.code is 0 or truthy success value
used in your runner). Add this single assertion (using the same test framework
style as other tests, e.g., expect(r.code).toBe(0) or equivalent) immediately
after creating r and before or after the existing output assertions so failures
with partial output fail the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1fbad4c5-6d0b-450c-b57c-e2cf0bb00a2f

📥 Commits

Reviewing files that changed from the base of the PR and between ef84117 and 0c70665.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/snapshot.ts
  • test/snapshot-gateway-guard.test.ts

Comment thread src/lib/actions/sandbox/snapshot.ts Outdated
Comment thread test/snapshot-gateway-guard.test.ts
…without imageTag

Per PR review: Docker- and VM-driver sandboxes never have the legacy
`openshell-cluster-nemoclaw` container, so falling through to the
`docker exec ... kubectl ...` probe when their imageTag happens to be
missing only wastes a guaranteed-failing docker call (and pollutes
output with a misleading docker error) before returning the same null.
Return imageTag (or null) directly for both drivers; keep the kubectl
probe path strictly for the legacy "kubernetes" driver.

Signed-off-by: sevenc <sevenc@nvidia.com>
@wscurran

Copy link
Copy Markdown
Contributor

@cr7258 cr7258 requested a review from cv May 26, 2026 05:38
@cr7258 cr7258 requested a review from ericksoa May 27, 2026 13:53
@cv cv added the v0.0.55 label May 28, 2026
@jyaunches jyaunches added R3 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@cv cv merged commit 5f65107 into NVIDIA:main May 29, 2026
27 checks passed
miyoungc added a commit that referenced this pull request Jun 1, 2026
## Summary

- Adds the v0.0.56 release notes section with links to the deeper docs
pages for installer, status, inference, messaging, policy, and lifecycle
changes.
- Updates source docs for the remaining release-prep gaps around `uv` in
the PyPI preset, compact WhatsApp pairing guidance, and `nemoclaw
inference set` command boundaries.
- Refreshes generated `nemoclaw-user-*` skills and removes skipped
experimental command terms from generated skill surfaces.

## Source summary

- #4613 -> `docs/manage-sandboxes/lifecycle.mdx`,
`docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents
that public installs and `nemoclaw update` follow the maintained `lkg`
tag by default.
- #4419 -> `docs/about/release-notes.mdx`: Notes that non-interactive
Linux installs can reactivate Docker group membership and continue in
one installer run when `sg docker` is available.
- #4550 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Captures live sandbox agent-version
probing for status, connect, and upgrade checks.
- #4609 -> `docs/inference/use-local-inference.mdx`,
`docs/about/release-notes.mdx`: Captures the GPU Docker-driver
host-network local-inference reachability gate.
- #4607 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents
compact WhatsApp QR pairing guidance and gateway/session diagnostics.
- #4582 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Reflects
Slack credential validation before enabling the channel.
- #4554 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`:
Keeps Telegram allowlist alias guidance in the generated user skills and
release notes.
- #4563 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Includes the new `nemoclaw <name> skill
remove <skill>` command in command docs and release notes.
- #4566 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Documents the `nemoclaw inference set`
redirect boundary when `--provider` or `--model` is missing.
- #4323 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Captures per-sandbox status JSON
support.
- #4506 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Captures debug command sandbox-name
validation and safer tarball writing.
- #4569 -> `docs/network-policy/integration-policy-examples.mdx`,
`docs/about/release-notes.mdx`: Documents that the `pypi` preset allows
`/usr/local/bin/uv`.
- #4579 -> `docs/network-policy/integration-policy-examples.mdx`,
`docs/about/release-notes.mdx`: Captures observable Jira preset
validation guidance.
- #4229 -> `docs/manage-sandboxes/lifecycle.mdx`,
`docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents
user-data preservation defaults for uninstall.
- #4399 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Captures CPU-only sandbox intent
preservation across rebuilds.
- #4058 -> `docs/reference/commands.mdx`,
`docs/about/release-notes.mdx`: Captures safer snapshot restore behavior
around existing destinations.
- #4155 and #4460 -> skipped by `docs/.docs-skip`: Removed skipped
experimental command terms from source docs and generated skill evals
instead of documenting those features.

## Verification

- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passes; Fern reports the pre-existing light-mode
accent contrast warning)
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills` (no matches)
- `npm run build:cli` (run to refresh local CLI artifacts for the
pre-push TypeScript hook)
- Commit hooks passed, including `NEMOCLAW_* env-var documentation
gate`, `Verify docs-to-skills output`, `markdownlint-cli2`, `gitleaks`,
and `Test (skills YAML)`.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Expanded Model Router setup with YAML examples, flow diagrams, and
credential handling; strengthened agent-config immutability and
integrity guidance; messaging channels updated (Telegram aliases,
WhatsApp pairing/diagnostics); CLI docs revised (GPU detection,
inference set behavior, uninstall/rebuild preservation); overview
rebranded to NemoClaw and added v0.0.56 release notes.

* **New Features**
* Added `nemoclaw <name> channels status` (messaging diagnostics, JSON);
added `nemoclaw <name> skill remove`; Hermes no longer marked
experimental; DGX Spark quickstart sandbox-name note.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: macos Affects macOS, including Apple Silicon v0.0.56 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Sandbox] nemoclaw snapshot restore fails on Apple Silicon

4 participants