fix: revert base image digest pinning that broke all e2e tests by ericksoa · Pull Request #1938 · NVIDIA/NemoClaw

ericksoa · 2026-04-16T02:20:31Z

Summary

Reverts cf4941a ("fix(upgrade): pin base image digest and rebuild stale sandboxes")
The digest in blueprint.yaml belongs to ghcr.io/nvidia/openshell-community/sandboxes/openclaw but the code applied it to ghcr.io/nvidia/nemoclaw/sandbox-base — a different registry repo
Docker digests are repo-specific, so every sandbox creation fails with "manifest unknown"
All 6 nightly e2e jobs fail identically: sandbox-survival, skip-permissions, hermes, rebuild-openclaw, cloud, messaging-providers

What's removed

getBlueprintBaseImageDigest() function and Dockerfile BASE_IMAGE patching in onboard.ts
Pre-pull of digest-pinned base image before sandbox creation
nemoclaw upgrade-sandboxes command in nemoclaw.ts
Post-upgrade sandbox check in install.sh
7 related tests in onboard.test.ts

Test plan

CI passes on this branch
Re-trigger nightly e2e after merge to confirm sandbox creation works again

Closes #1904 (will need to be re-opened with the correct registry approach)

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

Refactor
- Removed automatic sandbox upgrade step from the installer; onboarding now completes without attempting conditional sandbox upgrades.
- Removed the standalone sandbox upgrade CLI command.
- Stopped pinning sandbox base images to digest values during sandbox creation.
Tests
- Updated tests to reflect removal of digest-pinning and sandbox upgrade behavior.

) The digest from blueprint.yaml (openshell-community registry) was applied to the nemoclaw/sandbox-base registry, causing "manifest unknown" on every sandbox creation. Reverts cf4941a until digest pinning can use the correct registry/digest pair. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-04-16T02:20:45Z

📝 Walkthrough

Walkthrough

The PR removes sandbox upgrade behaviors: the post-install "upgrade-sandboxes" invocation, blueprint base-image digest extraction and digest-pinned pre-pull, and the global CLI upgrade-sandboxes command and its implementation. Tests and installer script references to these flows were removed or simplified.

Changes

Cohort / File(s)	Summary
Post-Install Hook `scripts/install.sh`	Deleted the conditional post-onboarding step that invoked `nemoclaw upgrade-sandboxes` when stale sandboxes were detected; installer now proceeds directly to completion messaging.
Blueprint / Sandbox Pinning `src/lib/onboard.ts`	Removed `getBlueprintBaseImageDigest` and all logic that parsed blueprint YAML, validated `sha256:` digests, pinned `ARG BASE_IMAGE`, and pre-pulled digest-pinned sandbox-base images; retained non-functional formatting changes.
CLI Command Removal `src/nemoclaw.ts`	Removed the global CLI command `upgrade-sandboxes`, its dispatcher case, and the `upgradeSandboxes(...)` implementation; updated help text to remove the command and relabeled help section.
Tests Updated `test/onboard.test.ts`	Removed tests covering digest extraction, Dockerfile patching, and pre-pull behavior; minor assertion loosenings/formatting changes elsewhere in the test file.
Script Formatting Only `scripts/brev-launchable-ci-cpu.sh`	Reformatted shell control-flow with explicit line continuations and re-indented case branches; behavior preserved (no functional changes).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibbled at code and found the trace,
I hopped the upgrade step right out of place.
Fewer hops, fewer logs to chase,
A quieter burrow, a tidier space. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title accurately describes the primary change: reverting base image digest pinning that caused e2e test failures.
Linked Issues check	✅ Passed	PR reverts changes related to issue `#1904`, but this was an attempted fix that broke e2e tests due to using incorrect image digest from wrong repository.
Out of Scope Changes check	✅ Passed	All changes are scoped to reverting the base image digest feature: removed getBlueprintBaseImageDigest, upgrade-sandboxes command, and related tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch revert/base-image-digest-pin

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Run shfmt and Prettier on files that diverged from formatter output. Update test regex to handle multi-line runOpenshell call after Prettier. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)
983-1073: ⚠️ Potential issue | 🟠 Major

Keep a base-image upgrade signal in the sandbox path.

This revert fixes the bad repo-specific digest, but it also removes the only mechanism in this file that forced sandboxes onto the current sandbox-base. The reuse path here still keeps a ready sandbox indefinitely, and the rebuild path no longer pins or refreshes the base image before build. After a NemoClaw upgrade, users can therefore remain on an older OpenClaw either by reuse or by rebuilding from a locally cached base tag, which is the regression issue #1904 was trying to solve.

Please keep a repo-correct freshness check here before release — e.g. resolve the digest from ghcr.io/nvidia/nemoclaw/sandbox-base, or force a pull / --pull-equivalent before the sandbox build.

Also applies to: 2506-2556, 2574-2796
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 983 - 1073, The revert removed the mechanism
that enforces sandbox base-image freshness, causing sandboxes to stick to stale
sandbox-base digests; in patchStagedDockerfile restore a repo-correct upgrade
signal by resolving the current digest for ghcr.io/nvidia/nemoclaw/sandbox-base
(or otherwise fetching the manifest) and inject it into the staged Dockerfile
(e.g. as ARG NEMOCLAW_SANDBOX_BASE_DIGEST or NEMOCLAW_BASE_DIGEST) or ensure the
sandbox build uses a forced pull (equivalent to docker build --pull) so rebuilt
sandboxes get the latest sandbox-base; update patchStagedDockerfile (and any
callers that rely on encodeDockerJsonArg / getSandboxInferenceConfig) to write
that digest/flag into the dockerfile content or into the sandbox path so reuse
and rebuild paths both honor base-image refresh.

🧹 Nitpick comments (1)

src/nemoclaw.ts (1)
1605-1608: Split sandboxRebuild into smaller steps to reduce complexity.

This function is still too branch-heavy for safe iteration/debugging, and this PR touches it. Extracting step helpers (precheck, backup, delete, recreate, restore, post-migrate) would materially improve maintainability and reduce regression risk.

As per coding guidelines, "Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15".

Also applies to: 1655-1657, 1670-1672, 1692-1694, 1711-1713, 1723-1725, 1736-1741, 1758-1760, 1770-1772, 1779-1781, 1785-1787, 1809-1811
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1605 - 1608, The sandboxRebuild function is too
branch-heavy; refactor it into clear step helpers to reduce cyclomatic
complexity and improve maintainability—extract and call small functions such as
sandboxPrecheck, sandboxBackup, sandboxDelete, sandboxRecreate, sandboxRestore,
and sandboxPostMigrate from within sandboxRebuild, move the corresponding
branching and try/catch logic into these helpers, preserve existing error
handling and return behavior (including verbose flag handling via the verbose
const), and update any callers/tests to use the refactored flow so behavior
remains identical.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/nemoclaw.ts`:
- Around line 2033-2035: The CLI no longer recognizes the removed command
"upgrade-sandboxes", which causes generic unknown-command errors; update the CLI
command dispatch (where commands are registered/handled—the code that prints the
Backup help block and resolves input commands) to detect the legacy literal
"upgrade-sandboxes" and provide a deterministic migration path: either print a
clear deprecation/migration message directing users to "nemoclaw backup-all" and
exit with a non-zero status, or implement a compatibility shim that logs a
deprecation warning and forwards the invocation to the existing "backup-all"
handler; make sure to reference "upgrade-sandboxes" and the existing
"backup-all" handler when adding this logic.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 983-1073: The revert removed the mechanism that enforces sandbox
base-image freshness, causing sandboxes to stick to stale sandbox-base digests;
in patchStagedDockerfile restore a repo-correct upgrade signal by resolving the
current digest for ghcr.io/nvidia/nemoclaw/sandbox-base (or otherwise fetching
the manifest) and inject it into the staged Dockerfile (e.g. as ARG
NEMOCLAW_SANDBOX_BASE_DIGEST or NEMOCLAW_BASE_DIGEST) or ensure the sandbox
build uses a forced pull (equivalent to docker build --pull) so rebuilt
sandboxes get the latest sandbox-base; update patchStagedDockerfile (and any
callers that rely on encodeDockerJsonArg / getSandboxInferenceConfig) to write
that digest/flag into the dockerfile content or into the sandbox path so reuse
and rebuild paths both honor base-image refresh.

---

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 1605-1608: The sandboxRebuild function is too branch-heavy;
refactor it into clear step helpers to reduce cyclomatic complexity and improve
maintainability—extract and call small functions such as sandboxPrecheck,
sandboxBackup, sandboxDelete, sandboxRecreate, sandboxRestore, and
sandboxPostMigrate from within sandboxRebuild, move the corresponding branching
and try/catch logic into these helpers, preserve existing error handling and
return behavior (including verbose flag handling via the verbose const), and
update any callers/tests to use the refactored flow so behavior remains
identical.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b3d7c9f0-65a7-4e75-970a-73124f72d34d

📥 Commits

Reviewing files that changed from the base of the PR and between f85eeb2 and 970515c.

📒 Files selected for processing (4)

scripts/brev-launchable-ci-cpu.sh
src/lib/onboard.ts
src/nemoclaw.ts
test/onboard.test.ts

✅ Files skipped from review due to trivial changes (1)

scripts/brev-launchable-ci-cpu.sh

🚧 Files skipped from review as they are similar to previous changes (1)

test/onboard.test.ts

coderabbitai · 2026-04-16T02:42:00Z

+  ${G}Backup:${R}
    nemoclaw backup-all              Back up all sandbox state before upgrade
-    nemoclaw upgrade-sandboxes       Detect and rebuild stale sandboxes ${D}(--check, --auto)${R}



⚠️ Potential issue | 🟡 Minor

Add an explicit migration note for removed upgrade-sandboxes command.

Given the command surface changed, consider a dedicated deprecation/error path so old scripts/users get a deterministic migration hint instead of generic unknown-command behavior.

Suggested compatibility shim

+ if (cmd === "upgrade-sandboxes") { + console.error(" `nemoclaw upgrade-sandboxes` has been removed."); + console.error(" Use `nemoclaw backup-all` before upgrade, then `nemoclaw <name> rebuild` per sandbox."); + process.exit(1); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 2033 - 2035, The CLI no longer recognizes the removed command "upgrade-sandboxes", which causes generic unknown-command errors; update the CLI command dispatch (where commands are registered/handled—the code that prints the Backup help block and resolves input commands) to detect the legacy literal "upgrade-sandboxes" and provide a deterministic migration path: either print a clear deprecation/migration message directing users to "nemoclaw backup-all" and exit with a non-zero status, or implement a compatibility shim that logs a deprecation warning and forwards the invocation to the existing "backup-all" handler; make sure to reference "upgrade-sandboxes" and the existing "backup-all" handler when adding this logic.

…sandboxes (#1904) Re-implements the digest pinning and upgrade-sandboxes features reverted in #1938. The original PR (#1937) broke all e2e tests because it read a digest from blueprint.yaml (belongs to ghcr.io/nvidia/openshell-community) and applied it to ghcr.io/nvidia/nemoclaw/sandbox-base — a different registry. Docker digests are repo-specific, so every pull returned "manifest unknown". This fix never reads blueprint.yaml for the base image digest. Instead: - pullAndResolveBaseImageDigest() pulls sandbox-base:latest from GHCR - docker inspect extracts the actual repo digest - patchStagedDockerfile() pins ARG BASE_IMAGE to the inspected digest The digest is self-consistent by construction — it always comes from the same registry we pin to. Falls back to unpinned :latest when GHCR is unreachable (offline/firewall users). Also re-adds the upgrade-sandboxes command with fixes from code review: - --auto flag for non-interactive installer contexts - sandbox list failure handling before classifying sandboxes - throwOnError option for sandboxRebuild to prevent process.exit in loops Closes #1904 Signed-off-by: Test User <test@example.com>

…sandboxes (#1904) Re-implements the digest pinning and upgrade-sandboxes features reverted in #1938. The original PR (#1937) broke all e2e tests because it read a digest from blueprint.yaml (belongs to ghcr.io/nvidia/openshell-community) and applied it to ghcr.io/nvidia/nemoclaw/sandbox-base — a different registry. Docker digests are repo-specific, so every pull returned "manifest unknown". This fix never reads blueprint.yaml for the base image digest. Instead: - pullAndResolveBaseImageDigest() pulls sandbox-base:latest from GHCR - docker inspect extracts the actual repo digest - patchStagedDockerfile() pins ARG BASE_IMAGE to the inspected digest The digest is self-consistent by construction — it always comes from the same registry we pin to. Falls back to unpinned :latest when GHCR is unreachable (offline/firewall users). Also re-adds the upgrade-sandboxes command with fixes from code review: - --auto flag for non-interactive installer contexts - sandbox list failure handling before classifying sandboxes - throwOnError option for sandboxRebuild to prevent process.exit in loops Closes #1904 Signed-off-by: Test User <test@example.com> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…sandboxes (#1904) Re-implements the digest pinning and upgrade-sandboxes features reverted in #1938. The original PR (#1937) broke all e2e tests because it read a digest from blueprint.yaml (belongs to ghcr.io/nvidia/openshell-community) and applied it to ghcr.io/nvidia/nemoclaw/sandbox-base — a different registry. Docker digests are repo-specific, so every pull returned "manifest unknown". This fix never reads blueprint.yaml for the base image digest. Instead: - pullAndResolveBaseImageDigest() pulls sandbox-base:latest from GHCR - docker inspect extracts the actual repo digest - patchStagedDockerfile() pins ARG BASE_IMAGE to the inspected digest The digest is self-consistent by construction — it always comes from the same registry we pin to. Falls back to unpinned :latest when GHCR is unreachable (offline/firewall users). Also re-adds the upgrade-sandboxes command with fixes from code review: - --auto flag for non-interactive installer contexts - sandbox list failure handling before classifying sandboxes - throwOnError option for sandboxRebuild to prevent process.exit in loops - RepoDigests filtered by SANDBOX_BASE_IMAGE prefix (ordering not guaranteed) - Exit non-zero from upgrade-sandboxes when sandbox list or rebuilds fail - Report sandboxes with unavailable version detection Closes #1904 Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…sandboxes (#1943) ## Summary - Re-implements digest pinning and `upgrade-sandboxes` reverted in #1938 - Fixes the root cause: original PR read a digest from `blueprint.yaml` (belongs to `ghcr.io/nvidia/openshell-community`) and applied it to `ghcr.io/nvidia/nemoclaw/sandbox-base` — a different registry. Docker digests are repo-specific, so every sandbox creation failed with "manifest unknown" - New approach: `pullAndResolveBaseImageDigest()` pulls `sandbox-base:latest` from GHCR, then `docker inspect` extracts the actual repo digest. The digest is self-consistent by construction — it always comes from the same registry we pin to - Falls back to unpinned `:latest` when GHCR is unreachable (offline/firewall users) - Re-adds `nemoclaw upgrade-sandboxes` command with `--check`/`--auto`/`--yes` flags and fixes from CodeRabbit review (sandbox list error handling, `throwOnError` for batch rebuilds, `--auto` in install.sh) Closes #1904 ## Test plan - [x] 8 new unit tests (6 in onboard.test.ts, 2 CLI integration tests in cli.test.ts) - [x] Regression guard: verifies `ARG BASE_IMAGE` references `nemoclaw/sandbox-base`, NOT `openshell-community` - [x] Structural test: `pullAndResolveBaseImageDigest()` called before `patchStagedDockerfile()` - [x] CLI test: `upgrade-sandboxes --check` detects stale sandbox (original reporter scenario from #1904) - [x] CLI test: `upgrade-sandboxes --check` reports all-current when no sandboxes are stale - [x] Nightly cloud e2e: verify sandbox creation succeeds with digest pinning (the exact scenario that broke in #1937) - [ ] Manual: verify `nemoclaw upgrade-sandboxes --check` on a real environment with a stale sandbox  ## Summary by CodeRabbit * **New Features** * Added a CLI command to detect and batch-upgrade stale sandboxes (report-only and auto/yes modes). * Installer now performs a post-onboarding scan and can trigger non-blocking sandbox upgrades. * **Improvements** * Sandbox base images are pinned to resolved digests when available for more reproducible builds. * Rebuild flow updated to continue processing multiple sandboxes and surface per-sandbox failures without aborting the whole run. * **Tests** * New unit/integration and end-to-end tests covering detection, reporting, upgrade, and rebuild workflows. * **Chores** * Nightly workflow extended with new E2E jobs; CI failure artifacts and notifications updated.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa changed the title ~~revert: remove base image digest pinning that broke all e2e tests~~ fix: revert base image digest pinning that broke all e2e tests Apr 16, 2026

chore: format files and fix source-reading test regex

970515c

Run shfmt and Prettier on files that diverged from formatter output. Update test regex to handle multi-line runOpenshell call after Prettier. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

cv approved these changes Apr 16, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

ericksoa merged commit bb5fdb4 into main Apr 16, 2026
18 of 21 checks passed

ericksoa mentioned this pull request Apr 16, 2026

[All Platform][Upgrade][Github Issue #1904] The sandbox OpenClaw version is not upgraded after NemoClaw upgrade #1904

Closed

ericksoa mentioned this pull request Apr 16, 2026

fix(upgrade): pin base image via pull-then-inspect and rebuild stale sandboxes #1941

Closed

7 tasks

ericksoa mentioned this pull request Apr 16, 2026

fix(upgrade): pin base image via pull-then-inspect and rebuild stale sandboxes #1943

Merged

7 tasks

Dongni-Yang mentioned this pull request Apr 16, 2026

nemoclaw onboard fails with "manifest unknown" after pulling — stale dist/ not detected #1958

Closed

2 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: revert base image digest pinning that broke all e2e tests#1938

fix: revert base image digest pinning that broke all e2e tests#1938
ericksoa merged 2 commits into
mainfrom
revert/base-image-digest-pin

ericksoa commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericksoa commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's removed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading