Problem Statement
Summary
When a user runs nemoclaw onboard, the sandbox image is built using Docker's legacy (V1) builder instead of BuildKit. This happens because OpenShell's Rust code calls the Docker Engine API directly via the bollard library, which defaults to the legacy builder. The docker CLI has defaulted to BuildKit since Docker 23.0 (March 2023), so this behavior is surprising — a user can run docker build from the same machine and get BuildKit, but nemoclaw onboard on that same machine uses the legacy builder.
This is not caught by NemoClaw CI because no CI job exercises the openshell sandbox create --from code path. All CI Docker builds use the docker build CLI directly.
Affected versions
- OpenShell: 0.0.29 (current pin in NemoClaw)
- bollard: 0.20.2 (OpenShell's Docker client library)
- NemoClaw: all versions
How sandbox images are built
There are two distinct paths:
Path A: CI and direct docker build (uses BuildKit ✓)
NemoClaw CI workflows (.github/workflows/sandbox-images-and-e2e.yaml, pr-self-hosted.yaml) run docker build directly:
- name: Build production image
run: docker build --build-arg BASE_IMAGE=${{ env.BASE_IMAGE }} -t nemoclaw-production .
The Docker CLI (v23.0+) defaults to BuildKit. CI output confirms this:
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#2 [internal] load metadata for ghcr.io/nvidia/nemoclaw/sandbox-base:latest
Path B: nemoclaw onboard → OpenShell → bollard API (uses legacy builder ✗)
When a user runs nemoclaw onboard:
- NemoClaw stages a build context and calls
openshell sandbox create --from <path>/Dockerfile
- OpenShell's
crates/openshell-bootstrap/src/build.rs calls bollard's Docker::build_image()
- bollard sends
POST /build?version=1 to the Docker Engine socket
- The Docker daemon sees
version=1 and uses the legacy builder
The relevant OpenShell code (build.rs:89-99):
let mut builder = BuildImageOptionsBuilder::default()
.dockerfile(dockerfile_str)
.t(tag)
.rm(true);
if !build_args.is_empty() {
builder = builder.buildargs(build_args);
}
let options = builder.build();
No .version() call → bollard defaults to BuilderV1.
From bollard-stubs (query_parameters.rs:63-73):
pub enum BuilderVersion {
#[default]
BuilderV1 = 1, // legacy builder
BuilderBuildKit = 2, // BuildKit
}
Why CI doesn't catch this
None of the CI jobs exercise the openshell sandbox create --from path:
| CI job |
What it does |
Builder |
build-sandbox-images |
docker build CLI |
BuildKit (CLI default) |
build-sandbox-images-arm64 |
docker build CLI |
BuildKit (CLI default) |
test-e2e-sandbox |
Tests CLI error paths only (openshell not installed) |
N/A |
test-e2e-gateway-isolation |
Loads pre-built image via docker load |
N/A |
wsl-e2e |
Unit tests only |
N/A |
macos-e2e |
Skips Docker (unavailable on runner) |
N/A |
The real openshell sandbox create --from flow is only exercised during manual nemoclaw onboard on a user's machine.
Evidence in NemoClaw
NemoClaw's sandbox-create-stream.ts already has legacy-builder-specific output parsing, suggesting this has been the behavior since the beginning:
// Legacy builder output format: "Step 1/5 : FROM ..."
if (/^ {2}Step \d+\/\d+ : /.test(line)) { ... }
// Legacy builder success markers
/^Successfully built /.test(line)
/^Successfully tagged /.test(line)
BuildKit uses a completely different output format (#1 [internal] load build definition...), which is not matched by these patterns.
Impact
- Performance: BuildKit parallelizes independent build stages; the legacy builder runs them sequentially. NemoClaw's main
Dockerfile is a two-stage build (builder + runtime), and while the stages are dependent, BuildKit still has advantages in layer caching and transfer efficiency.
- Caching: BuildKit has more sophisticated layer caching. Users rebuilding sandboxes miss out on this.
- Future breakage: Docker has deprecated the legacy builder.
Dockerfile.base already uses RUN --mount=type=bind (BuildKit-only syntax). If Dockerfile adopts similar syntax, the openshell build path will break entirely.
- Consistency: Users see different build behavior between
docker build and nemoclaw onboard on the same machine.
Proposed Design
Recommended fix
OpenShell (one-line fix)
In crates/openshell-bootstrap/src/build.rs, add .version(BuilderVersion::BuilderBuildKit):
use bollard::query_parameters::BuilderVersion;
let mut builder = BuildImageOptionsBuilder::default()
.dockerfile(dockerfile_str)
.t(tag)
.rm(true)
.version(BuilderVersion::BuilderBuildKit);
This sends version=2 in the Docker Engine API request, which tells the daemon to use BuildKit. It mirrors what the docker CLI does by default since v23.0.
Note: bollard's BuildKit support via the Engine API (version=2) works without enabling bollard's buildkit or buildkit_providerless cargo features. Those features are for bollard's gRPC-based BuildKit session protocol, which is a separate (more advanced) integration. The version=2 query parameter is sufficient to get BuildKit builds through the standard /build HTTP endpoint.
NemoClaw (follow-up)
After the OpenShell fix ships, update sandbox-create-stream.ts to also recognize BuildKit output format for progress reporting:
// Add BuildKit output patterns alongside existing legacy ones
if (/^ {2}Building image /.test(line) ||
/^ {2}Step \d+\/\d+ : /.test(line) ||
/^#\d+ \[/.test(line)) { // BuildKit format
setPhase("build");
}
And in shouldShowLine:
/^#\d+ \[.*\]/.test(line) || // BuildKit step lines
/^#\d+ (DONE|CACHED)/.test(line) || // BuildKit completion
Alternatives Considered
No response
Category
enhancement: platform
Checklist
Problem Statement
Summary
When a user runs
nemoclaw onboard, the sandbox image is built using Docker's legacy (V1) builder instead of BuildKit. This happens because OpenShell's Rust code calls the Docker Engine API directly via the bollard library, which defaults to the legacy builder. ThedockerCLI has defaulted to BuildKit since Docker 23.0 (March 2023), so this behavior is surprising — a user can rundocker buildfrom the same machine and get BuildKit, butnemoclaw onboardon that same machine uses the legacy builder.This is not caught by NemoClaw CI because no CI job exercises the
openshell sandbox create --fromcode path. All CI Docker builds use thedocker buildCLI directly.Affected versions
How sandbox images are built
There are two distinct paths:
Path A: CI and direct
docker build(uses BuildKit ✓)NemoClaw CI workflows (
.github/workflows/sandbox-images-and-e2e.yaml,pr-self-hosted.yaml) rundocker builddirectly:The Docker CLI (v23.0+) defaults to BuildKit. CI output confirms this:
Path B:
nemoclaw onboard→ OpenShell → bollard API (uses legacy builder ✗)When a user runs
nemoclaw onboard:openshell sandbox create --from <path>/Dockerfilecrates/openshell-bootstrap/src/build.rscalls bollard'sDocker::build_image()POST /build?version=1to the Docker Engine socketversion=1and uses the legacy builderThe relevant OpenShell code (
build.rs:89-99):No
.version()call → bollard defaults toBuilderV1.From bollard-stubs (
query_parameters.rs:63-73):Why CI doesn't catch this
None of the CI jobs exercise the
openshell sandbox create --frompath:build-sandbox-imagesdocker buildCLIbuild-sandbox-images-arm64docker buildCLItest-e2e-sandboxtest-e2e-gateway-isolationdocker loadwsl-e2emacos-e2eThe real
openshell sandbox create --fromflow is only exercised during manualnemoclaw onboardon a user's machine.Evidence in NemoClaw
NemoClaw's
sandbox-create-stream.tsalready has legacy-builder-specific output parsing, suggesting this has been the behavior since the beginning:BuildKit uses a completely different output format (
#1 [internal] load build definition...), which is not matched by these patterns.Impact
Dockerfileis a two-stage build (builder + runtime), and while the stages are dependent, BuildKit still has advantages in layer caching and transfer efficiency.Dockerfile.basealready usesRUN --mount=type=bind(BuildKit-only syntax). IfDockerfileadopts similar syntax, the openshell build path will break entirely.docker buildandnemoclaw onboardon the same machine.Proposed Design
Recommended fix
OpenShell (one-line fix)
In
crates/openshell-bootstrap/src/build.rs, add.version(BuilderVersion::BuilderBuildKit):This sends
version=2in the Docker Engine API request, which tells the daemon to use BuildKit. It mirrors what thedockerCLI does by default since v23.0.Note: bollard's BuildKit support via the Engine API (
version=2) works without enabling bollard'sbuildkitorbuildkit_providerlesscargo features. Those features are for bollard's gRPC-based BuildKit session protocol, which is a separate (more advanced) integration. Theversion=2query parameter is sufficient to get BuildKit builds through the standard/buildHTTP endpoint.NemoClaw (follow-up)
After the OpenShell fix ships, update
sandbox-create-stream.tsto also recognize BuildKit output format for progress reporting:And in
shouldShowLine:Alternatives Considered
No response
Category
enhancement: platform
Checklist