fix(upgrade): detect NemoClaw image drift in upgrade-sandboxes (#5026)#5102
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR adds NemoClaw build fingerprint tracking and image drift detection to the sandbox upgrade system. Existing sandboxes now record their NemoClaw build version and are correctly flagged for upgrade when the CLI's NemoClaw version differs, independent of agent version changes. ChangesNemoClaw Image Drift Detection in Sandbox Upgrades
Sequence DiagramsequenceDiagram
participant CLI as upgrade-sandboxes
participant GetVersion as getVersion()
participant Classify as classifyUpgradeableSandboxes
participant Drift as isNemoclawImageStale
participant Output as describeStaleUpgrade
CLI->>GetVersion: resolveCurrentNemoclawVersion()
GetVersion-->>CLI: currentNemoclawVersion (string | null)
CLI->>Classify: classifyUpgradeableSandboxes(sandboxes, { currentNemoclawVersion })
Classify->>Drift: isNemoclawImageStale(recorded, current)
Drift-->>Classify: driftDetected (boolean)
Classify-->>CLI: candidates[] with reasons and optional imageCurrent/imageExpected
CLI->>Output: describeStaleUpgrade(candidate)
Output-->>CLI: human-readable reason
CLI->>CLI: print stale sandbox listing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
`upgrade-sandboxes` only compared the agent version inside a sandbox, so after a NemoClaw upgrade that left the bundled OpenClaw `expected_version` unchanged it printed "All sandboxes are up to date" even though the NemoClaw image payload (scripts, patches, policies, Dockerfile, generated config) had changed and the sandbox needed a rebuild. Persist a NemoClaw build fingerprint (`nemoclawVersion`) on managed sandbox images at create/rebuild time and compare it against the running NemoClaw build during classification. A sandbox is now flagged for rebuild when its agent version is stale OR its recorded NemoClaw build differs from the running one, with a clear reason, e.g. `OpenClaw v2026.5.27 unchanged; NemoClaw image v0.0.60 -> v0.0.61`. Drift is asserted only on positive evidence — a recorded fingerprint that differs. A missing fingerprint is never flagged: it is ambiguous between a legacy managed image and a custom `--from` image, and auto-rebuilding a custom image whose Dockerfile path is unavailable would silently recreate it with the default image. Custom images are therefore left without a fingerprint, and legacy sandboxes opt into drift detection on their next rebuild. The sandbox-reuse path no longer re-stamps the fingerprint, so reusing a sandbox after a NemoClaw upgrade cannot mask drift. The fingerprint is stamped via `getSandboxAgentRegistryFields`, keeping `onboard.ts` net-neutral. Fixes NVIDIA#5026 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
ed8ebe7 to
215c4bf
Compare
prekshivyas
left a comment
There was a problem hiding this comment.
Reviewed (code + security checklist). Persists a NemoClaw build fingerprint (nemoclawVersion) on managed sandbox entries at create/rebuild, and extends upgrade-sandboxes to flag sandboxes whose recorded build differs from the running build even when the agent version is unchanged (#5026).
✅ Approve. Defensively designed: drift is positive-evidence-only (missing fingerprint ⇒ not drifted), which correctly avoids the dangerous failure mode of auto-rebuilding a custom --from image onto the default. Verified the two safety claims against source — updateReusedSandboxMetadata never touches nemoclawVersion, and the supplementary update strips it so a reused image can't be re-stamped. getVersion() runs git describe via execFileSync (no shell). Security: all categories pass.
Non-blocking nits:
isNemoclawImageStaleusesrecorded !== current, so a downgrade also flags drift. Behavior is defensible — just tweak the JSDoc (says "older than running") to "differs from" to match.describeStaleUpgradereturns""on emptyreasons(currently unreachable; robustness only).
Tests adequate: unit coverage for the stale predicate + classifier, plus a real CLI regression (--check with current agent + stale fingerprint).
## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nemoclaw upgrade-sandboxesonly compared the agent version inside a sandbox, so after a NemoClaw upgrade that left the bundled OpenClawexpected_versionunchanged it printedAll sandboxes are up to date.even though the NemoClaw image payload (scripts, patches, policies, Dockerfile, generated config) had changed. This adds a persisted NemoClaw build fingerprint so image/build drift is detected even when the agent version is unchanged.Related Issue
Fixes #5026
Changes
nemoclawVersion, fromgetVersion()) on managed sandbox images at create/rebuild time, stamped viagetSandboxAgentRegistryFieldssoonboard.tsstays net-neutral.classifyUpgradeableSandboxesnow flags a sandbox when its agent version is stale or its recorded NemoClaw build differs from the running build, and reports the reason, e.g.OpenClaw v2026.5.27 unchanged; NemoClaw image v0.0.60 → v0.0.61.isNemoclawImageStale/ classification, and a CLI regression test that a current-agent-version sandbox with a stale recorded fingerprint is detected (upgrade-sandboxes --check).Design note — why missing fingerprints are not flagged (safe, forward-looking)
A missing fingerprint is intentionally treated as not drifted. It is ambiguous between a legacy NemoClaw-managed image (safe to rebuild) and a custom
--fromimage, and auto-rebuilding a custom image whose Dockerfile path is unavailable would silently recreate it with the default image (data loss). Custom images are therefore left without a fingerprint, and pre-existing sandboxes opt into NemoClaw image-drift detection on their next rebuild. Every sandbox created or rebuilt on this release onward gets full drift detection.Type of Change
Verification
nemoclaw upgrade-sandboxes --checkCLI (stale recorded fingerprint and a drifting build are flagged; matching/missing fingerprints are not; a custom--fromimage is never auto-rebuilt).npm run typecheck:cli, Biome, and the affected Vitest suites pass (upgrade,gateway-drift-preflight,list-share-live-inference, onboard sandbox state handler, registry). The fullclisuite could not be run to completion in the dev sandbox due to a host resource cap (OOM/SIGKILL); remaining failures observed there were confirmed environmental (missing plugin deps / live host gateway / single-fork cross-file state).Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests