Skip to content

OpenShell uses legacy Docker builder instead of BuildKit #2311

@iamsh4

Description

@iamsh4

Problem Statement

Summary

When a user runs nemoclaw onboard, the sandbox image is built using Docker's legacy (V1) builder instead of BuildKit. This happens because OpenShell's Rust code calls the Docker Engine API directly via the bollard library, which defaults to the legacy builder. The docker CLI has defaulted to BuildKit since Docker 23.0 (March 2023), so this behavior is surprising — a user can run docker build from the same machine and get BuildKit, but nemoclaw onboard on that same machine uses the legacy builder.

This is not caught by NemoClaw CI because no CI job exercises the openshell sandbox create --from code path. All CI Docker builds use the docker build CLI directly.

Affected versions

  • OpenShell: 0.0.29 (current pin in NemoClaw)
  • bollard: 0.20.2 (OpenShell's Docker client library)
  • NemoClaw: all versions

How sandbox images are built

There are two distinct paths:

Path A: CI and direct docker build (uses BuildKit ✓)

NemoClaw CI workflows (.github/workflows/sandbox-images-and-e2e.yaml, pr-self-hosted.yaml) run docker build directly:

- name: Build production image
  run: docker build --build-arg BASE_IMAGE=${{ env.BASE_IMAGE }} -t nemoclaw-production .

The Docker CLI (v23.0+) defaults to BuildKit. CI output confirms this:

#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#2 [internal] load metadata for ghcr.io/nvidia/nemoclaw/sandbox-base:latest

Path B: nemoclaw onboard → OpenShell → bollard API (uses legacy builder ✗)

When a user runs nemoclaw onboard:

  1. NemoClaw stages a build context and calls openshell sandbox create --from <path>/Dockerfile
  2. OpenShell's crates/openshell-bootstrap/src/build.rs calls bollard's Docker::build_image()
  3. bollard sends POST /build?version=1 to the Docker Engine socket
  4. The Docker daemon sees version=1 and uses the legacy builder

The relevant OpenShell code (build.rs:89-99):

let mut builder = BuildImageOptionsBuilder::default()
    .dockerfile(dockerfile_str)
    .t(tag)
    .rm(true);

if !build_args.is_empty() {
    builder = builder.buildargs(build_args);
}

let options = builder.build();

No .version() call → bollard defaults to BuilderV1.

From bollard-stubs (query_parameters.rs:63-73):

pub enum BuilderVersion {
    #[default]
    BuilderV1 = 1,       // legacy builder
    BuilderBuildKit = 2,  // BuildKit
}

Why CI doesn't catch this

None of the CI jobs exercise the openshell sandbox create --from path:

CI job What it does Builder
build-sandbox-images docker build CLI BuildKit (CLI default)
build-sandbox-images-arm64 docker build CLI BuildKit (CLI default)
test-e2e-sandbox Tests CLI error paths only (openshell not installed) N/A
test-e2e-gateway-isolation Loads pre-built image via docker load N/A
wsl-e2e Unit tests only N/A
macos-e2e Skips Docker (unavailable on runner) N/A

The real openshell sandbox create --from flow is only exercised during manual nemoclaw onboard on a user's machine.

Evidence in NemoClaw

NemoClaw's sandbox-create-stream.ts already has legacy-builder-specific output parsing, suggesting this has been the behavior since the beginning:

// Legacy builder output format: "Step 1/5 : FROM ..."
if (/^ {2}Step \d+\/\d+ : /.test(line)) { ... }

// Legacy builder success markers
/^Successfully built /.test(line)
/^Successfully tagged /.test(line)

BuildKit uses a completely different output format (#1 [internal] load build definition...), which is not matched by these patterns.

Impact

  • Performance: BuildKit parallelizes independent build stages; the legacy builder runs them sequentially. NemoClaw's main Dockerfile is a two-stage build (builder + runtime), and while the stages are dependent, BuildKit still has advantages in layer caching and transfer efficiency.
  • Caching: BuildKit has more sophisticated layer caching. Users rebuilding sandboxes miss out on this.
  • Future breakage: Docker has deprecated the legacy builder. Dockerfile.base already uses RUN --mount=type=bind (BuildKit-only syntax). If Dockerfile adopts similar syntax, the openshell build path will break entirely.
  • Consistency: Users see different build behavior between docker build and nemoclaw onboard on the same machine.

Proposed Design

Recommended fix

OpenShell (one-line fix)

In crates/openshell-bootstrap/src/build.rs, add .version(BuilderVersion::BuilderBuildKit):

use bollard::query_parameters::BuilderVersion;

let mut builder = BuildImageOptionsBuilder::default()
    .dockerfile(dockerfile_str)
    .t(tag)
    .rm(true)
    .version(BuilderVersion::BuilderBuildKit);

This sends version=2 in the Docker Engine API request, which tells the daemon to use BuildKit. It mirrors what the docker CLI does by default since v23.0.

Note: bollard's BuildKit support via the Engine API (version=2) works without enabling bollard's buildkit or buildkit_providerless cargo features. Those features are for bollard's gRPC-based BuildKit session protocol, which is a separate (more advanced) integration. The version=2 query parameter is sufficient to get BuildKit builds through the standard /build HTTP endpoint.

NemoClaw (follow-up)

After the OpenShell fix ships, update sandbox-create-stream.ts to also recognize BuildKit output format for progress reporting:

// Add BuildKit output patterns alongside existing legacy ones
if (/^ {2}Building image /.test(line) ||
    /^ {2}Step \d+\/\d+ : /.test(line) ||
    /^#\d+ \[/.test(line)) {                    // BuildKit format
  setPhase("build");
}

And in shouldShowLine:

/^#\d+ \[.*\]/.test(line) ||                    // BuildKit step lines
/^#\d+ (DONE|CACHED)/.test(line) ||             // BuildKit completion

Alternatives Considered

No response

Category

enhancement: platform

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: packagingPackages, images, registries, installers, or distributionarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: containerAffects Docker, containerd, Podman, or images
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions