Description
nemoclaw onboard builds the sandbox image successfully (55/55 buildah
steps), then fails at the post-build "Pushing image into gateway" step
with a misleading 404 from Docker claiming the gateway container
doesn't exist:
Pushing image openshell/sandbox-from:1778244557 into gateway "nemoclaw"
[progress] Exported 100 MiB
... 1700 ...
[progress] Exported 1777 MiB
Error: × failed to upload image tar into container
╰─▶ Docker responded with status code 404: the container does not exist:
no container with name or ID "openshell-cluster-nemoclaw" found:
no such container
Try: openshell sandbox list # check gateway state
Recovery: nemoclaw onboard --resume
Or: nemoclaw onboard
The gateway container is healthy throughout. docker exec,
openshell status, and a direct curl PUT /containers/<gw>/archive
of a 2 GiB tar all succeed against the same Docker socket while the
onboard is sitting at this error.
The same openshell sandbox create --from <DIR> code path with a
minimal Dockerfile (only FROM ghcr.io/.../openclaw:latest,
producing a ~1.4 GiB image) succeeds reliably. Only the larger
nemoclaw-generated image (~1.86 GiB) fails. Strongly suggests a bug
in openshell sandbox create's upload path tied to size — possibly
in bollard 0.20.2 (per binary strings) when handling Content-Length
above ~1.5 GiB.
This blocks every nemoclaw onboard run on this platform because the
generated image is always ~1.86 GiB.
Workaround that unblocks us: stand up a local docker registry,
push the buildah-built image to it via buildah push, then
openshell sandbox create --from <registry-image-ref> directly. The
--from <image-ref> path doesn't go through the buggy upload code.
Sandbox boots normally and nemoclaw <name> doctor reports the agent
healthy.
Expected: nemoclaw onboard to complete the upload step without
failure, regardless of generated image size.
Reproduction Steps
-
On a DGX Spark / aarch64 host with all prereqs satisfied
(scripts/setup-spark.sh ran, default-cgroupns-mode: host,
br_netfilter loaded), pre-start the gateway:
openshell gateway start --name nemoclaw --port 8080 --gpu
-
Run a clean non-interactive onboard pointed at any reachable
OpenAI-compatible endpoint (we use a local ollama):
env NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_PROVIDER=custom \
NEMOCLAW_ENDPOINT_URL=http://<host>:11434/v1 \
NEMOCLAW_PROVIDER_KEY=dummy \
COMPATIBLE_API_KEY=dummy \
NEMOCLAW_MODEL=qwen2.5:7b-instruct \
NEMOCLAW_GATEWAY_NAME=nemoclaw \
nemoclaw onboard \
--fresh --non-interactive --yes \
--yes-i-accept-third-party-software \
--no-gpu --name probe-bug
-
Observe: preflight, gateway reuse, inference config, build all
succeed. The 55/55 buildah build commits an
openshell/sandbox-from:<id> image. Exported 100..1777 MiB
progress lines stream successfully. Final line: the 404 error
above.
-
Verify the gateway is fine while the error is still on screen:
docker inspect openshell-cluster-nemoclaw --format '{{.State.Status}}'
# → running
openshell doctor exec -- ls /
# → works
-
Verify Docker accepts a same-size PUT against the same endpoint:
dd if=/dev/zero of=/tmp/big.bin bs=1M count=2000 && \
tar -cf /tmp/big.tar -C /tmp big.bin
curl -s --unix-socket /var/run/docker.sock \
-X PUT --upload-file /tmp/big.tar \
-H "Content-Type: application/x-tar" \
"http://localhost/containers/openshell-cluster-nemoclaw/archive?path=/tmp"
# → HTTP 200 in ~3.5s
-
Reproduce the success case with a minimal Dockerfile (~1.4 GiB
image):
mkdir -p /tmp/probe-tiny
echo "FROM ghcr.io/nvidia/openshell-community/sandboxes/openclaw:latest" \
> /tmp/probe-tiny/Dockerfile
openshell sandbox create --name probe-tiny --from /tmp/probe-tiny --no-keep -- echo hi
# → succeeds; sandbox creates, runs, deletes
Environment
- OS: Ubuntu 24.04 (DGX OS) on NVIDIA DGX Spark (GB10), aarch64
- Kernel: 6.17.0-1014-nvidia
- Node.js: v22.22.2 (via nvm, per nemoclaw bootstrap)
- Docker: server 29.2.1, API version 1.53
- NemoClaw: v0.0.36
- OpenShell: 0.0.36 (latest)
- bollard (per
strings ~/.local/bin/openshell): 0.20.2
default-cgroupns-mode: host set in /etc/docker/daemon.json
br_netfilter loaded
- Gateway container
openshell-cluster-nemoclaw running, healthy,
all k3s pods Running, mTLS established.
nemoclaw-debug.tar.gz
Debug Output
Attached: `nemoclaw-debug.tar.gz` (collected via
`nemoclaw debug --output /tmp/nemoclaw-debug.tar.gz --sandbox ori-supervisor`).
Note: `ori-supervisor` is the sandbox installed via the registry
workaround described under "Description" — it boots fine, agent
OpenClaw v2026.4.24 healthy. The debug tarball reflects the
post-workaround state.
Logs
Failure tail (full output is repetitive 55-step buildah, omitted):
[2/2] STEP 55/55: CMD ["/bin/bash"]
[2/2] COMMIT docker.io/openshell/sandbox-from:1778244557
--> c7455792ca94
Successfully tagged docker.io/openshell/sandbox-from:1778244557
Built image openshell/sandbox-from:1778244557
Pushing image openshell/sandbox-from:1778244557 into gateway "nemoclaw"
[progress] Exported 100 MiB
[progress] Exported 200 MiB
[progress] Exported 300 MiB
...
[progress] Exported 1700 MiB
[progress] Exported 1777 MiB
Error: × failed to upload image tar into container
╰─▶ Docker responded with status code 404: the container does not exist:
no container with name or ID "openshell-cluster-nemoclaw" found:
no such container
Docker daemon logs during the failure show no incoming PUT request for
the archive endpoint (confirmed with
`docker events --filter container=openshell-cluster-nemoclaw`),
strongly suggesting the request is being short-circuited client-side
in bollard before it hits the daemon. Capturing wire-level traffic
on `/var/run/docker.sock` would confirm.
Checklist
Description
nemoclaw onboardbuilds the sandbox image successfully (55/55 buildahsteps), then fails at the post-build "Pushing image into gateway" step
with a misleading 404 from Docker claiming the gateway container
doesn't exist:
The gateway container is healthy throughout.
docker exec,openshell status, and a directcurl PUT /containers/<gw>/archiveof a 2 GiB tar all succeed against the same Docker socket while the
onboard is sitting at this error.
The same
openshell sandbox create --from <DIR>code path with aminimal Dockerfile (only
FROM ghcr.io/.../openclaw:latest,producing a ~1.4 GiB image) succeeds reliably. Only the larger
nemoclaw-generated image (~1.86 GiB) fails. Strongly suggests a bug
in
openshell sandbox create's upload path tied to size — possiblyin bollard 0.20.2 (per binary strings) when handling Content-Length
above ~1.5 GiB.
This blocks every
nemoclaw onboardrun on this platform because thegenerated image is always ~1.86 GiB.
Workaround that unblocks us: stand up a local docker registry,
push the buildah-built image to it via
buildah push, thenopenshell sandbox create --from <registry-image-ref>directly. The--from <image-ref>path doesn't go through the buggy upload code.Sandbox boots normally and
nemoclaw <name> doctorreports the agenthealthy.
Expected:
nemoclaw onboardto complete the upload step withoutfailure, regardless of generated image size.
Reproduction Steps
On a DGX Spark / aarch64 host with all prereqs satisfied
(
scripts/setup-spark.shran,default-cgroupns-mode: host,br_netfilterloaded), pre-start the gateway:Run a clean non-interactive onboard pointed at any reachable
OpenAI-compatible endpoint (we use a local ollama):
Observe: preflight, gateway reuse, inference config, build all
succeed. The 55/55 buildah build commits an
openshell/sandbox-from:<id>image.Exported 100..1777 MiBprogress lines stream successfully. Final line: the 404 error
above.
Verify the gateway is fine while the error is still on screen:
Verify Docker accepts a same-size PUT against the same endpoint:
Reproduce the success case with a minimal Dockerfile (~1.4 GiB
image):
Environment
strings ~/.local/bin/openshell): 0.20.2default-cgroupns-mode: hostset in /etc/docker/daemon.jsonbr_netfilterloadedopenshell-cluster-nemoclawrunning, healthy,all k3s pods Running, mTLS established.
nemoclaw-debug.tar.gz
Debug Output
Logs
Checklist