feat: opt-in --support-docker for running dockerd inside a box#567
Closed
G4614 wants to merge 7 commits into
Closed
feat: opt-in --support-docker for running dockerd inside a box#567G4614 wants to merge 7 commits into
G4614 wants to merge 7 commits into
Conversation
Adds a `--support-docker` CLI flag (and `BoxOptions::support_docker` /
proto `ContainerInitRequest.support_docker` field) that opts a single
box into the relaxed in-container security envelope a docker daemon
needs to run. Default behaviour is unchanged byte-for-byte — every
existing box keeps its lean cgroup + 14-cap Docker-default profile.
When set, the guest:
- mounts /sys/fs/cgroup inside the container as a writable cgroup2
filesystem (previously omitted for performance: a lean single-tenant
VM doesn't pay the ~105ms hierarchy-init cost), so dockerd can
create cgroup limits for its child containers.
- grants the full Linux capability set (CAP_SYS_ADMIN,
CAP_NET_ADMIN, CAP_SYS_MODULE, ...) to the container init AND
every exec'd process — `docker run --privileged` semantics. The
exec-path side is handled by threading `support_docker` from
`Container` into `ContainerCommand` and from there into
`BuildSpec`, so the zygote picks `docker_capability_names()`
instead of `capability_names()` when building tenants.
SECURITY: this widens the in-VM attack surface significantly — any
process inside a `--support-docker` box runs effectively as root with
full caps and can mount filesystems, manipulate network/iptables, and
use every cap-gated kernel API. The microVM boundary still contains
the process (host stays protected by KVM/libkrun isolation), but the
in-box blast radius is effectively root. The CLI help text spells
this out so the choice is informed at opt-in time. Default-off keeps
every existing user on the lean profile.
What's NOT in this commit (follow-up):
- Kernel-level support for docker bridge networking (CONFIG_BRIDGE,
CONFIG_VETH, CONFIG_NETFILTER, CONFIG_NF_NAT, CONFIG_IPTABLE_NAT,
CONFIG_NF_TABLES). The shipped libkrunfw lacks these, so a
`--support-docker` box can run dockerd with --bridge=none
--iptables=false (host networking only). Bridge networks need a
"fat" libkrunfw build — issue boxlite-ai#276.
- SDK surface (Python / Node / Go / C). The Rust BoxOptions field is
plumbed end-to-end via proto, but the per-SDK constructors still
default it to false. Adding it to JsBoxOptions / PyBoxOptions /
boxConfig / OptionsHandle is a mechanical follow-up.
Verified end-to-end on this host:
- `boxlite run alpine sh -c "mount | grep cgroup; grep ^CapBnd ..."`
(default profile) → cgroup `ro`, CapBnd `0xa80425fb` — unchanged.
- same command with `--support-docker` → cgroup additionally mounted
`rw`, CapBnd `0x1ffffffffff` (full set).
- `boxlite run --support-docker docker:dind sh -c "dockerd
--bridge=none --iptables=false ..."` → dockerd boots, containerd
loads its plugin set; previously failed at cgroup write.
Touches:
- cli/src/cli.rs: --support-docker flag + ManagementFlags::apply_to
- boxlite/runtime/options.rs: BoxOptions field
- shared/proto/.../service.proto: ContainerInitRequest field
- boxlite/portal/interfaces/container.rs: plumb on host
- boxlite/litebox/init/tasks/guest_init.rs: read from BoxOptions
- guest/service/container.rs: read from proto
- guest/container/lifecycle.rs: Container struct field
- guest/container/start.rs: create_oci_bundle param
- guest/container/spec.rs: conditional caps + cgroup mount
- guest/container/capabilities.rs: docker_capabilities() +
docker_capability_names()
- guest/container/command.rs: ContainerCommand carries flag
- guest/container/zygote.rs: BuildSpec field + cap selection
- sdks/node/src/options.rs: default false on TryFrom
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `--entrypoint <PROGRAM>` to the boxlite run command, mirroring the Docker CLI flag of the same name. When set, the positional command tail is re-routed into BoxOptions::cmd so the image's init runs the user's command — matching `docker run --entrypoint X image arg1 arg2` semantics. Motivating use case is dind: the `docker:dind` image's ENTRYPOINT (dockerd-entrypoint.sh) generates TLS certs and then `exec "$@"` with the args. With no args, exec is a no-op and the script exits — which tears down the container's PID namespace and SIGKILLs every process inside. `--entrypoint sh` lets the caller run a probe script directly as PID 1, bypassing the image entrypoint entirely. The foreground exec for `boxlite run`'s stdio attach is replaced with `sleep infinity` when --entrypoint is set, so when the init process exits the container tears down the foreground sleep and `boxlite run` returns cleanly. BoxOptions::entrypoint and ::cmd already existed on the Rust API surface (used by SDKs); this commit only adds the CLI flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boxlite's lean libkrunfw kernel lacks BRIDGE/VETH/NETFILTER/NF_NAT/ IPTABLE_*/NF_TABLES/POSIX_MQUEUE — exactly the subsystems docker / docker-compose lean on. Phase A (the previous commit set) put the userspace plumbing (full caps + cgroup rw under `--support-docker`) in place; this commit adds the kernel half so `docker build` / `docker run` work end-to-end inside the box. Issue boxlite-ai#276. What this commit adds: - dind-configs/overlay-dind_{x86_64,aarch64}: a Kconfig overlay applied on top of the lean config when building the dind blob. Enables every CONFIG_* docker / docker-compose require, all =y (the guest has no /lib/modules and can't load modules at runtime). - scripts/build/build-libkrunfw-dind.sh + `make libkrunfw-dind`: merges the lean config + dind overlay, drives the upstream libkrunfw Makefile to produce a 21 MB libkrunfw-dind.so.5, SONAME-stamped distinct from the lean blob so the two can coexist. ~10–15 min, ~3 GB scratch. - libkrun-sys/build.rs: optional `BOXLITE_LIBKRUNFW_DIND_PATH` env var picks up the dind blob at boxlite build time and stages it next to the lean one in the install dir. Default builds (env unset) are byte-identical to before — no embedded change. - util::configure_library_env_with_prepend: new variant that accepts caller-supplied LD_LIBRARY_PATH prepends. Lets per-box setup inject a private libs dir without rewriting the existing embedded-runtime / dladdr path order. - vmm/controller/spawn.rs::stage_dind_libkrunfw: when a box's BoxOptions::support_docker is true AND a dind blob was bundled at build time, creates `<box-dir>/libs/libkrunfw.so.5` as a symlink to the dind blob and prepends that dir to the shim's LD_LIBRARY_PATH. libkrun's dlopen("libkrunfw.so.5") then resolves to the fat kernel for THIS box only — without a dind blob, silent fallback to the lean kernel (the Phase A caps + cgroup still apply, just no bridge / netfilter). Per-box symlink is torn down together with the box on `boxlite rm`. - guest/container/spec.rs (Phase A finalisation): the support_docker path now also clears readonly_paths + masked_paths to vec![], overriding oci-spec's defaults. KNOWN LIMITATION: libcontainer- 0.5.7 still applies the default /proc hardening despite our empty list — root cause documented in the inline comment; workaround is the dockerd-side bypass flags listed below. Verified end-to-end (docker:dind image): - `boxlite run --support-docker --entrypoint sh -v ctx:/probe docker:dind /probe/probe.sh` where probe.sh starts dockerd with `--bridge=none --iptables=false --storage-driver=vfs` and runs `docker build --network=host -t boxlite-test:1 /probe/ctx` against `FROM alpine:3.19 / RUN echo …` → build exit 0, image written, runs to completion. Default behaviour byte-identical: - `boxlite run alpine sh -c "grep ^CapBnd /proc/self/status"` returns 0xa80425fb (14 Docker defaults, lean profile) — unchanged. - `mount | grep cgroup` shows only the ro mount, no rw overlay. - `/proc/sys/net/bridge/` does not exist (lean kernel has no CONFIG_BRIDGE). What's still bypass-flag territory (follow-up): - dockerd's default bridge / iptables setup needs `/proc/sys` writable; libcontainer-0.5.7 ignores our empty `readonlyPaths` and bind- remounts /proc/sys ro. Users must pass dockerd `--bridge=none --iptables=false` and use `--network=host` for build / run. - storage-driver auto-detection probes mount paths that may trip the same hardening; users pin `--storage-driver=vfs` to skip detection. - Together: full `docker compose up` with default bridge networks (issue boxlite-ai#276 acceptance) still needs the libcontainer fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`make test:integration:cli` uses `cargo test -p boxlite-cli --tests`,
which only discovers integration tests under `src/cli/tests/`. The
94 binary-inline `#[cfg(test)]` unit tests in `src/cli/src/` (CLI
arg plumbing, parser helpers, ManagementFlags::apply_to, etc.) need
`--bins` to be picked up and were silently orphaned from the make
target matrix until now.
Adds:
- `test:unit:cli` — `cargo test -p boxlite-cli --bins` (or nextest
equivalent). Fast, no VM required. Pulls in all 94 inline tests.
- `test:unit:core` now invokes it alongside rust + ffi so the
full unit matrix actually covers the CLI plumbing.
Surfaced by adding the four `management_flags_*` /
`box_options_default_*` tests for the `--support-docker` flag plumbing
in the previous commits — they were passing under direct `cargo test`
but not via any `make` target.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a real-VM integration test that spawns `boxlite run --support-docker --entrypoint sh docker:dind ...` against an alpine Dockerfile and verifies `docker build` produces a tagged image. This is the canary for issue boxlite-ai#276's acceptance criterion (every kernel config + userspace plumbing + per-box symlink in Phase A/B is on the hot path). Gated on `BOXLITE_DIND_TEST=1` because it needs a libkrunfw-dind blob bundled at build time (`make libkrunfw-dind` + a rebuild with `BOXLITE_LIBKRUNFW_DIND_PATH` set). Without the env var the test prints a SKIP notice and returns — keeps the default `test:integration` matrix runnable on hosts that haven't built the dind kernel. New make target `test:integration:dind` sets the env and runs only this test file. Not wired into any aggregator: explicit invocation when validating Phase B changes or before a release that claims dind support. Doesn't go through `common::boxlite()` / `new_cmd()` because those force the Chinese-mirror test registries via `apply_registries()`, which don't reliably serve `docker:dind`. Uses the default Docker Hub registry instead — the right config for a test that exercises Docker itself. Skipped assertion on `boxlite run`'s own exit code: with `--entrypoint` the foreground exec is `sleep infinity` (artifact of attaching stdio while the real workload runs as PID 1). When the init process exits cleanly, the container's PID namespace tears down and the sleep gets SIGKILL'd; boxlite then reports 137 even though the actual workload succeeded. The probe script's `[exit=$?]` marker in result.log is the authoritative signal — that's what we assert on. (Follow-up: foreground exit code should follow the init process, not the foreground sleep.) Verified locally: - With env var + dind blob bundled: passes in ~21s, result.log shows full BuildKit pipeline + `naming to docker.io/library/...:1 done`. - Without env var: skips with a clear message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `--cmd ARG` (repeatable) under ManagementFlags so callers can pass arguments to the image's existing ENTRYPOINT without replacing it (Docker CMD semantics). Gated on `--support-docker`: the lean flow's positional `[COMMAND...]` is still a secondary-exec attach, so the gate keeps default behaviour byte-for-byte identical. run.rs: reject `--cmd` without `--support-docker` at validate-time; when `--cmd` is set without a positional foreground command, fall back to `sleep infinity` so boxlite blocks on PID 1's lifetime. Rewrite the dind integration test to pass dockerd flags via `--cmd` and run the image's own `dockerd-entrypoint.sh` as PID 1, removing the previous `--entrypoint sh` bypass. Probe script now waits for the dockerd socket the entrypoint brings up instead of starting dockerd itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the dind regression test was opt-in (BOXLITE_DIND_TEST=1 + manual rebuild with BOXLITE_LIBKRUNFW_DIND_PATH), so by default no run of `make test:integration:cli` exercised dockerd-in-box and a Phase B regression could land green. Now the make target: * Hard-checks for target/dind-kernel/lib64/libkrunfw-dind.so.5 and refuses to proceed without it (one-time `make libkrunfw-dind`, ~10–20 min, cached after). * Auto-exports BOXLITE_LIBKRUNFW_DIND_PATH before cargo builds so the embedded runtime carries both the lean and the dind blobs. `test:integration:dind` is now a thin alias that narrows nextest's filter to the single dind test for fast iteration. The test itself flips from opt-in (BOXLITE_DIND_TEST=1) to opt-out (BOXLITE_SKIP_DIND_TEST=1) — hosts that genuinely cannot run dind (no nested virt, blocked cgroup2 mounts) still have an escape hatch, but the default behavior is RUN so a real regression fails the suite. Verified: `make test:integration:dind` PASS in 37.66s on a host with the dind kernel blob pre-built. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
Contributor
Author
|
Superseded by #568. Same diff, but the flag is renamed from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in dind support to
boxlite runvia--support-docker, so you can rundockerd(anddocker build/docker run) inside a box. Default behaviour is byte-for-byte unchanged — every existing user keeps the lean, non-docker flow.Stack:
feat(box): opt-in --support-docker— wires the flag through CLI → proto → guest; gates extra caps + cgroup rw + the dind kernel behind the explicit opt-in.feat(cli): --entrypoint— override an image's ENTRYPOINT (needed for early dind bring-up; kept as a general-purpose flag).feat(box): Phase B — dind-capable libkrunfw blob + runtime selection— addsscripts/build/build-libkrunfw-dind.sh+make libkrunfw-dind. The dind kernel (mqueue, netfilter, cgroup2, overlay) is built once (~10–20 min, cached) and selected per-box only when--support-dockeris set; lean boxes still boot on the stock libkrunfw.feat(cli): --cmd— pass args to the image's real ENTRYPOINT without replacing it. Gated on--support-dockerso the lean flow's[COMMAND...]semantics (secondary-exec attach) are unchanged. This drops the previous--entrypoint shdind bypass; the image's owndockerd-entrypoint.shnow runs as PID 1.test(cli): dind end-to-end—src/cli/tests/dind_build.rsbootsdocker:dindunder--support-dockerand assertsdocker buildsucceeds end-to-end against the real VM.Current limitation:
network=hostonlydockerdinside the box is launched with--bridge=none --iptables=false --storage-driver=vfs. That means:docker buildmust use--network=host(the e2e test does exactly this).docker runinside the box also effectively runs in the box's host network — there is no docker0 bridge and no NAT/iptables.Why: libcontainer 0.5.7 still bind-remounts
/proc/sysread-only inside the box, which breaks dockerd's default bridge + iptables setup. Disabling them is what lets dind boot cleanly today; restoring full bridged networking is follow-up work (needs either a libcontainer bump or a guest-side/proc/sysshim).One-line test
make libkrunfw-dindis the one-time ~10–20 min kernel build (cached attarget/dind-kernel/lib64/libkrunfw-dind.so.5afterwards).make test:integration:dindruns just the dind end-to-end (dind_supports_docker_build). Under the hood it's a thin alias formake test:integration:cli FILTER=dind_supports_docker_build, which exportsBOXLITE_LIBKRUNFW_DIND_PATHso the CLI build staples the dind blob into the embedded runtime.BOXLITE_SKIP_DIND_TEST=1 make test:integration:cliskips just that test.Test plan
make test:unit:core(rust + ffi + cli unit suites) greenmake libkrunfw-dind && make test:integration:dindgreen on a host with nested virtmake test:integration:cligreen (full CLI matrix incl. dind)BOXLITE_SKIP_DIND_TEST=1 make test:integration:cliskips dind, rest passesboxlite runwithout--support-dockeris byte-for-byte unchanged on a representative image