Skip to content

Onboard fails uploading built sandbox image to gateway (404 "container does not exist") on aarch64 #3266

@OriPekelman

Description

@OriPekelman

Description

nemoclaw onboard builds the sandbox image successfully (55/55 buildah
steps), then fails at the post-build "Pushing image into gateway" step
with a misleading 404 from Docker claiming the gateway container
doesn't exist:

Pushing image openshell/sandbox-from:1778244557 into gateway "nemoclaw"
  [progress] Exported 100 MiB
  ... 1700 ...
  [progress] Exported 1777 MiB
Error:   × failed to upload image tar into container
  ╰─▶ Docker responded with status code 404: the container does not exist:
      no container with name or ID "openshell-cluster-nemoclaw" found:
      no such container
  Try:  openshell sandbox list        # check gateway state
  Recovery: nemoclaw onboard --resume
  Or:      nemoclaw onboard

The gateway container is healthy throughout. docker exec,
openshell status, and a direct curl PUT /containers/<gw>/archive
of a 2 GiB tar all succeed against the same Docker socket while the
onboard is sitting at this error.

The same openshell sandbox create --from <DIR> code path with a
minimal Dockerfile (only FROM ghcr.io/.../openclaw:latest,
producing a ~1.4 GiB image) succeeds reliably. Only the larger
nemoclaw-generated image (~1.86 GiB) fails. Strongly suggests a bug
in openshell sandbox create's upload path tied to size — possibly
in bollard 0.20.2 (per binary strings) when handling Content-Length
above ~1.5 GiB.

This blocks every nemoclaw onboard run on this platform because the
generated image is always ~1.86 GiB.

Workaround that unblocks us: stand up a local docker registry,
push the buildah-built image to it via buildah push, then
openshell sandbox create --from <registry-image-ref> directly. The
--from <image-ref> path doesn't go through the buggy upload code.
Sandbox boots normally and nemoclaw <name> doctor reports the agent
healthy.

Expected: nemoclaw onboard to complete the upload step without
failure, regardless of generated image size.

Reproduction Steps

  1. On a DGX Spark / aarch64 host with all prereqs satisfied
    (scripts/setup-spark.sh ran, default-cgroupns-mode: host,
    br_netfilter loaded), pre-start the gateway:

    openshell gateway start --name nemoclaw --port 8080 --gpu
    
  2. Run a clean non-interactive onboard pointed at any reachable
    OpenAI-compatible endpoint (we use a local ollama):

    env NEMOCLAW_NON_INTERACTIVE=1 \
        NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
        NEMOCLAW_PROVIDER=custom \
        NEMOCLAW_ENDPOINT_URL=http://<host>:11434/v1 \
        NEMOCLAW_PROVIDER_KEY=dummy \
        COMPATIBLE_API_KEY=dummy \
        NEMOCLAW_MODEL=qwen2.5:7b-instruct \
        NEMOCLAW_GATEWAY_NAME=nemoclaw \
      nemoclaw onboard \
        --fresh --non-interactive --yes \
        --yes-i-accept-third-party-software \
        --no-gpu --name probe-bug
    
  3. Observe: preflight, gateway reuse, inference config, build all
    succeed. The 55/55 buildah build commits an
    openshell/sandbox-from:<id> image. Exported 100..1777 MiB
    progress lines stream successfully. Final line: the 404 error
    above.

  4. Verify the gateway is fine while the error is still on screen:

    docker inspect openshell-cluster-nemoclaw --format '{{.State.Status}}'
    # → running
    openshell doctor exec -- ls /
    # → works
    
  5. Verify Docker accepts a same-size PUT against the same endpoint:

    dd if=/dev/zero of=/tmp/big.bin bs=1M count=2000 && \
        tar -cf /tmp/big.tar -C /tmp big.bin
    curl -s --unix-socket /var/run/docker.sock \
        -X PUT --upload-file /tmp/big.tar \
        -H "Content-Type: application/x-tar" \
        "http://localhost/containers/openshell-cluster-nemoclaw/archive?path=/tmp"
    # → HTTP 200 in ~3.5s
    
  6. Reproduce the success case with a minimal Dockerfile (~1.4 GiB
    image):

    mkdir -p /tmp/probe-tiny
    echo "FROM ghcr.io/nvidia/openshell-community/sandboxes/openclaw:latest" \
        > /tmp/probe-tiny/Dockerfile
    openshell sandbox create --name probe-tiny --from /tmp/probe-tiny --no-keep -- echo hi
    # → succeeds; sandbox creates, runs, deletes
    

Environment

  • OS: Ubuntu 24.04 (DGX OS) on NVIDIA DGX Spark (GB10), aarch64
  • Kernel: 6.17.0-1014-nvidia
  • Node.js: v22.22.2 (via nvm, per nemoclaw bootstrap)
  • Docker: server 29.2.1, API version 1.53
  • NemoClaw: v0.0.36
  • OpenShell: 0.0.36 (latest)
  • bollard (per strings ~/.local/bin/openshell): 0.20.2
  • default-cgroupns-mode: host set in /etc/docker/daemon.json
  • br_netfilter loaded
  • Gateway container openshell-cluster-nemoclaw running, healthy,
    all k3s pods Running, mTLS established.

nemoclaw-debug.tar.gz

Debug Output

Attached: `nemoclaw-debug.tar.gz` (collected via
`nemoclaw debug --output /tmp/nemoclaw-debug.tar.gz --sandbox ori-supervisor`).

Note: `ori-supervisor` is the sandbox installed via the registry
workaround described under "Description" — it boots fine, agent
OpenClaw v2026.4.24 healthy. The debug tarball reflects the
post-workaround state.

Logs

Failure tail (full output is repetitive 55-step buildah, omitted):


  [2/2] STEP 55/55: CMD ["/bin/bash"]
  [2/2] COMMIT docker.io/openshell/sandbox-from:1778244557
  --> c7455792ca94
  Successfully tagged docker.io/openshell/sandbox-from:1778244557
  Built image openshell/sandbox-from:1778244557
  Pushing image openshell/sandbox-from:1778244557 into gateway "nemoclaw"
  [progress] Exported 100 MiB
  [progress] Exported 200 MiB
  [progress] Exported 300 MiB
  ...
  [progress] Exported 1700 MiB
  [progress] Exported 1777 MiB
Error:   × failed to upload image tar into container
  ╰─▶ Docker responded with status code 404: the container does not exist:
      no container with name or ID "openshell-cluster-nemoclaw" found:
      no such container


Docker daemon logs during the failure show no incoming PUT request for
the archive endpoint (confirmed with
`docker events --filter container=openshell-cluster-nemoclaw`),
strongly suggesting the request is being short-circuited client-side
in bollard before it hits the daemon. Capturing wire-level traffic
on `/var/run/docker.sock` would confirm.

Checklist

  • I confirmed this bug is reproducible
  • I searched existing issues and this is not a duplicate

Metadata

Metadata

Assignees

No one assigned

    Labels

    platform: arm64Affects ARM64 or aarch64 architectureplatform: containerAffects Docker, containerd, Podman, or images

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions