Skip to content

fix(onboard): classify ARM64 image-tar upload 404 with image-ref workaround (#3266)#4934

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3266-arm64-image-upload-fallback
Jun 8, 2026
Merged

fix(onboard): classify ARM64 image-tar upload 404 with image-ref workaround (#3266)#4934
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3266-arm64-image-upload-fallback

Conversation

@yimoj

@yimoj yimoj commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

On Linux ARM64 (aarch64), OpenShell's sandbox create --from <Dockerfile> path can fail while streaming the built image tar into the gateway container with a misleading Docker 404 (failed to upload image tar into container / the container does not exist for openshell-cluster-nemoclaw). The gateway is actually healthy — a same-size archive PUT succeeds directly — so the 404 is a symptom of the large-tar upload path, not a missing gateway. NemoClaw previously classified this as unknown and printed the generic onboard --resume hint, which just re-runs the same deterministically failing path. This PR classifies the failure and emits a precise, leak-safe local-registry / image-ref workaround that preserves the built image tag so the operator can recreate the sandbox without rebuilding.

Related Issue

Relates to #3266 (mitigation, not a closing fix). The root cause is OpenShell's image-tar upload path on ARM64; NemoClaw cannot make that upload succeed from its side. This change turns the misleading, unactionable failure into a classified error with an accurate workaround (the gateway is healthy; push the built image to a local registry and recreate from the image ref). Onboarding still does not complete automatically on the affected ARM64 path, so no closing keyword is used.

Changes

  • classifySandboxCreateFailure: new image_upload_container_missing kind detecting the upload-tar phrase, or the combined status code 404 + container-missing + openshell-cluster-nemoclaw shape.
  • New pure planSandboxCreateRecovery(failure, {platform, arch}) that decides the Linux-ARM64-targeted workaround — unit-testable without console spying. Other platforms get the generic note.
  • printSandboxCreateRecoveryHints now emits the workaround: start a local registry, push the built image with the matching builder (docker push or buildah push — the ARM64 build path uses buildah, where a docker-only push fails with No such image), then recreate from the registry ref.
  • The recreate step reconstructs NemoClaw's own create command (via the existing createArgs) with only --from swapped to the registry ref, so the configured provider/GPU/resource flags are preserved and the recreated sandbox is not misconfigured. The temporary --policy path and the -- env … nemoclaw-start runtime wrapper are shown as placeholders rather than dumped verbatim (the temp policy is cleaned up; the env wrapper carries host-specific values).
  • onboard.ts: net-zero single-line edit passing the existing createArgs to the hint.
  • Unit tests for the classifier, the recovery decision, the builder-agnostic hints, the built-image-ref extraction, and the create-command reconstruction. Manual aarch64 E2E note added alongside the tests.

Normal x86_64 happy paths are untouched: this path runs only after the OpenShell upload failure is classified.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • Targeted tests pass: vitest run --project cli src/lib/validation.test.ts src/lib/build-context.test.ts (76 passed)
  • npm run typecheck:cli passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • codex review --uncommitted clean (no actionable findings)

Reproduction: this host is x86_64 (uname -mx86_64), so the ARM64 hardware bug cannot be reproduced directly. The classifier/decision path was reproduced through the built CLI (dist/) with a controlled OpenShell output fixture: before the fix it classified as unknown and printed only onboard --resume; after the fix it classifies as image_upload_container_missing and prints the workaround. npm test was run; the 39 failing tests are pre-existing environment-only subprocess-spawn timeouts (confirmed identical on a stashed baseline) and are unrelated to this change.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error recovery guidance when sandbox creation fails due to image upload issues.
    • Added platform-specific recovery recommendations for ARM64 Linux systems.
    • Improved recovery instructions with detailed step-by-step guidance and alternative deployment methods.

…around (NVIDIA#3266)

On Linux ARM64 (aarch64), OpenShell's `sandbox create --from <Dockerfile>`
path can fail while streaming the built image tar into the gateway
container with a misleading Docker 404:

  failed to upload image tar into container
  Docker responded with status code 404: the container does not exist
  no container with name or ID "openshell-cluster-nemoclaw" found

The gateway container is healthy — a same-size archive PUT succeeds
directly — so the 404 is a symptom of the large-tar upload path, not a
missing gateway. Previously NemoClaw classified this as `unknown` and
printed the generic `onboard --resume` hint, which just re-runs the same
deterministically failing path.

Classify the failure shape (`image_upload_container_missing`) and emit a
precise local-registry / image-ref workaround that preserves the built
image tag so the operator can recreate without rebuilding:

- `classifySandboxCreateFailure` detects the upload-tar phrase or the
  combined 404 + container-missing + gateway-container-name shape.
- New pure `planSandboxCreateRecovery` decides the Linux-ARM64-targeted
  workaround (unit-testable without console spying); other platforms get
  the generic note.
- The recovery hint pushes the built image (Docker or buildah, matching
  the build log) to a local registry, then reconstructs NemoClaw's own
  create command with only `--from` swapped to the registry ref — keeping
  the provider/GPU/resource flags so the recreated sandbox is not
  misconfigured. The temporary `--policy` path and the `-- env …
  nemoclaw-start` runtime wrapper are shown as placeholders rather than
  dumped verbatim (the temp policy file is cleaned up, and the env
  wrapper carries host-specific values).

Normal x86_64 happy paths are untouched: this path runs only after the
OpenShell upload failure is classified. The onboard.ts change is a
net-zero single-line edit passing the existing `createArgs` through.

Host here is x86_64, so the ARM64 hardware bug cannot be reproduced
directly; reproduced the classifier/decision path through the built CLI
with a controlled OpenShell output fixture and added a manual aarch64
E2E note alongside the tests.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d47cad4-31e7-4df0-bd9c-d924808f6576

📥 Commits

Reviewing files that changed from the base of the PR and between c8be25d and a8aeec9.

📒 Files selected for processing (5)
  • src/lib/build-context.test.ts
  • src/lib/build-context.ts
  • src/lib/onboard.ts
  • src/lib/validation.test.ts
  • src/lib/validation.ts

📝 Walkthrough

Walkthrough

This PR detects a specific Docker 404 failure during large image tar uploads to the sandbox gateway, classifies it as image_upload_container_missing, and generates platform-aware recovery guidance that reconstructs the original sandbox create command with the pushed image reference, including ARM64-specific workarounds for Linux.

Changes

Image Upload Failure Recovery

Layer / File(s) Summary
Failure classification and recovery planning
src/lib/validation.ts, src/lib/validation.test.ts
SandboxCreateFailure gains image_upload_container_missing kind; new SandboxCreateRecoveryPlan interface and planSandboxCreateRecovery() function enable ARM64 workaround recommendations. classifySandboxCreateFailure() detects the specific "failed to upload image tar" 404 pattern tied to the nemoclaw gateway container.
Recovery hint generation and command helpers
src/lib/build-context.ts, src/lib/build-context.test.ts
extractBuiltImageRef() parses built image refs from output; reconstructImageRefCreateCommand() rebuilds the original sandbox create command by swapping --from to a registry reference and masking temporary arguments. printSandboxCreateRecoveryHints() now generates step-by-step recovery steps including reconstructed commands and ARM64-specific guidance based on platform and provided createArgs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

bug-fix, area: onboarding, Sandbox, Docker

Suggested reviewers

  • cv

Poem

🐰 A tar upload stumbles, a gateway 404 looms,
But NemoClaw's wisdom clears the rooms!
ARM64 gets special care today,
With reconstructed commands lighting the way. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: classifying an ARM64 image-tar upload 404 failure and providing an image-ref workaround, which is the core focus of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv added the v0.0.61 Release target label Jun 8, 2026
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: arm64 Affects ARM64 or aarch64 architecture labels Jun 8, 2026
@wscurran

wscurran commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@cv cv merged commit e2db5ed into NVIDIA:main Jun 8, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: arm64 Affects ARM64 or aarch64 architecture v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants