fix(onboard): classify ARM64 image-tar upload 404 with image-ref workaround (#3266)#4934
Conversation
…around (NVIDIA#3266) On Linux ARM64 (aarch64), OpenShell's `sandbox create --from <Dockerfile>` path can fail while streaming the built image tar into the gateway container with a misleading Docker 404: failed to upload image tar into container Docker responded with status code 404: the container does not exist no container with name or ID "openshell-cluster-nemoclaw" found The gateway container is healthy — a same-size archive PUT succeeds directly — so the 404 is a symptom of the large-tar upload path, not a missing gateway. Previously NemoClaw classified this as `unknown` and printed the generic `onboard --resume` hint, which just re-runs the same deterministically failing path. Classify the failure shape (`image_upload_container_missing`) and emit a precise local-registry / image-ref workaround that preserves the built image tag so the operator can recreate without rebuilding: - `classifySandboxCreateFailure` detects the upload-tar phrase or the combined 404 + container-missing + gateway-container-name shape. - New pure `planSandboxCreateRecovery` decides the Linux-ARM64-targeted workaround (unit-testable without console spying); other platforms get the generic note. - The recovery hint pushes the built image (Docker or buildah, matching the build log) to a local registry, then reconstructs NemoClaw's own create command with only `--from` swapped to the registry ref — keeping the provider/GPU/resource flags so the recreated sandbox is not misconfigured. The temporary `--policy` path and the `-- env … nemoclaw-start` runtime wrapper are shown as placeholders rather than dumped verbatim (the temp policy file is cleaned up, and the env wrapper carries host-specific values). Normal x86_64 happy paths are untouched: this path runs only after the OpenShell upload failure is classified. The onboard.ts change is a net-zero single-line edit passing the existing `createArgs` through. Host here is x86_64, so the ARM64 hardware bug cannot be reproduced directly; reproduced the classifier/decision path through the built CLI with a controlled OpenShell output fixture and added a manual aarch64 E2E note alongside the tests. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR detects a specific Docker 404 failure during large image tar uploads to the sandbox gateway, classifies it as ChangesImage Upload Failure Recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
On Linux ARM64 (aarch64), OpenShell's
sandbox create --from <Dockerfile>path can fail while streaming the built image tar into the gateway container with a misleading Docker 404 (failed to upload image tar into container/the container does not existforopenshell-cluster-nemoclaw). The gateway is actually healthy — a same-size archive PUT succeeds directly — so the 404 is a symptom of the large-tar upload path, not a missing gateway. NemoClaw previously classified this asunknownand printed the genericonboard --resumehint, which just re-runs the same deterministically failing path. This PR classifies the failure and emits a precise, leak-safe local-registry / image-ref workaround that preserves the built image tag so the operator can recreate the sandbox without rebuilding.Related Issue
Relates to #3266 (mitigation, not a closing fix). The root cause is OpenShell's image-tar upload path on ARM64; NemoClaw cannot make that upload succeed from its side. This change turns the misleading, unactionable failure into a classified error with an accurate workaround (the gateway is healthy; push the built image to a local registry and recreate from the image ref). Onboarding still does not complete automatically on the affected ARM64 path, so no closing keyword is used.
Changes
classifySandboxCreateFailure: newimage_upload_container_missingkind detecting the upload-tar phrase, or the combinedstatus code 404+ container-missing +openshell-cluster-nemoclawshape.planSandboxCreateRecovery(failure, {platform, arch})that decides the Linux-ARM64-targeted workaround — unit-testable without console spying. Other platforms get the generic note.printSandboxCreateRecoveryHintsnow emits the workaround: start a local registry, push the built image with the matching builder (docker pushorbuildah push— the ARM64 build path uses buildah, where a docker-only push fails withNo such image), then recreate from the registry ref.createArgs) with only--fromswapped to the registry ref, so the configured provider/GPU/resource flags are preserved and the recreated sandbox is not misconfigured. The temporary--policypath and the-- env … nemoclaw-startruntime wrapper are shown as placeholders rather than dumped verbatim (the temp policy is cleaned up; the env wrapper carries host-specific values).onboard.ts: net-zero single-line edit passing the existingcreateArgsto the hint.Normal x86_64 happy paths are untouched: this path runs only after the OpenShell upload failure is classified.
Type of Change
Verification
vitest run --project cli src/lib/validation.test.ts src/lib/build-context.test.ts(76 passed)npm run typecheck:clipassescodex review --uncommittedclean (no actionable findings)Reproduction: this host is x86_64 (
uname -m→x86_64), so the ARM64 hardware bug cannot be reproduced directly. The classifier/decision path was reproduced through the built CLI (dist/) with a controlled OpenShell output fixture: before the fix it classified asunknownand printed onlyonboard --resume; after the fix it classifies asimage_upload_container_missingand prints the workaround.npm testwas run; the 39 failing tests are pre-existing environment-only subprocess-spawn timeouts (confirmed identical on a stashed baseline) and are unrelated to this change.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit