feat(daemon): containerize netclawd and run evals in ephemeral Docker#603
Merged
Conversation
Issue #569: the behavioral eval suite (`./evals/run-evals.sh`) runs against the operator's live `netclawd` and mutates its state — seeding test documents into the production SQLite DB, accumulating LLM-formed memories across runs, and destroying real user memories when eval iterations reset. The inverse contamination is just as bad: operator memories crowd out seeded eval docs for recall slots, dropping memory eval scores from 100% → 30% by the third sequential run. This change runs `netclawd` inside an ephemeral Docker container per eval suite invocation. Two-way contamination disappears, and the same image becomes the supported Docker-deployment artifact for netclawd itself. ## Scope - `docker/Dockerfile` — release-grade ubuntu:24.04 image with pre-installed autonomous-agent tools (git, jq, sqlite3, python3, gh) and both `netclaw` + `netclawd` binaries on PATH. ENTRYPOINT is the daemon; operators supply config via `NETCLAW_*` env vars and optional volume mounts on `/root/.netclaw`. - `scripts/docker/build-image.sh` — single build entrypoint reused by contributors, PR validation, and the release publish job. Supports `IMAGE_REPO` override, `NO_BUILD=1` escape hatch, and fails loudly if the published binaries are missing. - `.github/workflows/pr_validation.yml` — new `validate-docker-build` job that runs the shared build script on every PR and verifies the image reaches `/api/health/ready=healthy` within 60s using an ollama stub provider (no secrets required). - `.github/workflows/publish_release_binaries.yml` — new `publish-docker` job pushes `ghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}` on every release tag, using the same build script. - `evals/run-evals.sh` — rewrite of bootstrap, daemon lifecycle, and log capture. The script spawns `docker run -d --rm --network host` with a throwaway `$EVAL_HOME` identity copy and forwards provider/model config via env vars. Required eval-target credentials (provider type, endpoint, model id) are always explicit: env vars in non-interactive contexts, stdin prompts on terminals, and a hard-fail otherwise. Assertion helpers and the 22 case bodies are untouched. - `src/Netclaw.Configuration/NetclawPaths.cs` — honour `NETCLAW_HOME` env var as a fallback source for `BasePath` (precedence: explicit arg → env var → default). Enables CLI-side path isolation during eval runs so `netclaw -p` can't leak state into the host's real `~/.netclaw/`. - `evals/README.md` — document the new env-var surface and remove the "local instance only — no isolation" limitation. ## Known gap / deferred The daemon does NOT fail fast on empty config today — missing provider config falls back to a default `local-ollama` provider, and missing identity files are silently tolerated by `FileSystemPromptProvider`. The design doc and the daemon-container spec deliberately reflect reality rather than aspiration; making the daemon fail loudly on empty config is a separate follow-up. Also deferred (follow-up issues): Docker Hub publishing alongside GHCR (#602), compaction eval cases using `NETCLAW_EVAL_CONTEXT_WINDOW`, CI eval execution against a remote LLM, committed identity fixture under `evals/fixtures/` for headless CI, drain-wait between memory-formation and recall eval phases (#437, already open). ## OpenSpec change Drives through `openspec/changes/containerize-daemon-and-evals/` (proposal, design, specs/daemon-container, specs/netclaw-cli delta, tasks). New capability `daemon-container`; modified capability `netclaw-cli` adds one requirement for `NETCLAW_HOME` env var precedence. Sync + archive will happen in a follow-up commit after merge, matching the pattern established by 325f856. ## Verification - `dotnet test src/Netclaw.Configuration.Tests` → 152 pass (7 new NetclawPaths env var precedence tests) - `dotnet test src/Netclaw.Cli.Tests` → 413 pass - `dotnet test src/Netclaw.Daemon.Tests` → 428 pass - `dotnet slopwatch analyze --hook` → no new violations vs baseline - `scripts/docker/build-image.sh dev` → produces `ghcr.io/aaronontheweb/netclawd:dev` successfully - Manual bootstrap smoke test: the eval script's `check_prerequisites` + `start_eval_daemon` + `cleanup_eval_env` path runs cleanly against a fixture identity directory, with the daemon reaching healthy state and `force_rmrf` handling the root-owned files the container writes into the bind-mounted logs directory. Full-suite verification against a real LLM endpoint is deferred to PR review; the bootstrap smoke test validates everything that doesn't require LLM round-trips. ## PR follow-up actions - First GHCR push will create a private package (`ghcr.io/aaronontheweb/netclawd`). Flip visibility to public under GitHub → Packages → Settings before external users can pull without `docker login`. - After merge, commit `chore(openspec): sync and archive containerize-daemon-and-evals` to move the delta specs into the canonical specs tree and archive the change directory. Closes #569.
Image construction is an orthogonal concern from .NET test execution and slopwatch, so `validate-docker-build` shouldn't live in `pr_validation.yml`. Extract it into a standalone `.github/workflows/validate_docker_image.yml` that triggers only on changes to files that actually affect the image (Dockerfile, build script, src/**, global.json, Directory.Build.props, the workflow itself). Saves runner minutes on unrelated PRs and keeps each workflow focused on one failure mode.
Review pass via /simplify surfaced a handful of issues across the daemon containerization PR: - `evals/run-evals.sh` invoked `docker run` twice on the failure path, which would conflict on the `--name` and swallow the real error. Now captures stderr from a single invocation and prints it alongside the error message. - `cleanup_eval_env` called `docker rm` after `docker stop` on a container launched with `--rm`, where the stop already removes it. Dropped the redundant step. - `TMPDIR_EVAL` only holds host-owned stdout captures, so `force_rmrf`'s alpine-root fallback is dead code for that path. Replaced with plain `rm -rf` and narrowed the comment on `force_rmrf` to clarify it only matters for `EVAL_HOME` where the container writes as root. - `docker/Dockerfile` installed six speculative apt packages (`wget`, `jq`, `sqlite3`, `python3-pip`, `python3-venv`) that no eval case or daemon code path exercises. `python3-pip` alone pulls ~50 MB of transitive deps. Trimmed to the minimum the current suite needs: `ca-certificates`, `curl`, `procps`, `git`, `python3`, `gh`. Operators who need more can `apt-get install` at runtime. - `.github/workflows/validate_docker_image.yml` `paths:` filter was missing `Directory.Packages.props`, meaning a CPM version bump that breaks the build would not trigger the job. Added it. - `.github/workflows/publish_release_binaries.yml` `publish-docker` re-ran `dotnet publish` for CLI and Daemon on top of the exact same publishes done by `publish-binaries`. Added an `upload-artifact`/`download-artifact` hop so `publish-docker` reuses the output and runs the shared build script with `NO_BUILD=1`. Also verified empirically that `docker logs "$name" >&2 2>&1` correctly merges both container streams to the script's stderr (reviewer flagged this as a potential swap; confirmed it's correct as written). Verified after the fixes: - `bash -n evals/run-evals.sh` → syntax OK - `dotnet slopwatch analyze --hook` → no new violations - `dotnet test src/Netclaw.Configuration.Tests` → 152 pass - Trimmed image still builds via `scripts/docker/build-image.sh dev` - Trimmed image reaches `/api/health/ready=healthy` in 4s
The previous simplification pass trimmed the Dockerfile's apt packages based on what the current eval suite exercises, but the image's primary purpose is general-purpose netclawd deployment, not evals. An autonomous agent should have the common shell toolkit pre-installed so it can reach for things without needing to `apt-get install` first on every new action. Restore: wget, jq, sqlite3, python3-pip, python3-venv. The `apt-get install` escape hatch the Ubuntu base provides is for tools the agent discovers it needs on demand, not for things it should always have. Verified: image still builds, starts healthy in 4s, and all 8 pre-installed tools resolve on PATH inside the container.
This was referenced Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #569. Containerizes
netclawdas a publishable Docker imageand rewrites the behavioral eval suite to run against an ephemeral
container per invocation — no more contamination between the eval
suite and the operator's real
~/.netclawstate.Why
The eval suite runs against the operator's live
netclawd, mutatingproduction memories and seeding test docs into the real SQLite DB.
Observed in #569: memory-score eval scores drop from 100% → 30% hit
rate by the third sequential run as LLM-formed memories from earlier
runs crowd out seeded eval documents. Running
netclawdinside aDocker container sidesteps the whole problem — each invocation gets a
fresh DB, fresh identity, fresh sessions, and fresh logs.
The same image also becomes the supported Docker-deployment artifact
for
netclawditself, closing a gap flagged in theexposure-modeschange.
What's in the PR
docker/Dockerfile— release-gradeubuntu:24.04image(pre-installed
git,jq,sqlite3,python3,gh, plus thenetclaw+netclawdbinaries). ENTRYPOINT is the daemon; operatorssupply config via
NETCLAW_*env vars and/root/.netclawmounts.scripts/docker/build-image.sh— single build entrypoint usedby contributors, the PR validation job, and the release publish job.
Supports
IMAGE_REPOoverride andNO_BUILD=1escape hatch.validate-docker-build— new PR job that runs the shared buildscript and verifies
/api/health/readyreturns healthy within 60susing an ollama stub provider (no secrets required).
publish-docker— new release job that pushesghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}on every release tag.
evals/run-evals.sh— rewrite of bootstrap, daemon lifecycle,and log capture. Spawns
docker run -d --rm --network hostwith athrowaway
\$EVAL_HOMEidentity copy, forwards provider/modelconfig via env vars, captures the daemon's file log via a writable
bind-mount on
/root/.netclaw/logs. Required eval-targetcredentials (
NETCLAW_EVAL_PROVIDER_TYPE/_ENDPOINT/_MODEL_ID)are always explicit: env vars in non-interactive contexts, stdin
prompts on terminals, and a hard-fail otherwise. All 22 assertion
helpers and case bodies are untouched.
NetclawPaths— honoursNETCLAW_HOMEenv var as a fallbacksource for
BasePath. Backward-compatible single-line change(explicit arg → env var → default).
evals/README.md— documents the new env-var surface, thecontainer lifecycle,
--network hostfor Tailscale MagicDNS, andremoves the "local instance only — no isolation" limitation.
OpenSpec
Drives through
openspec/changes/containerize-daemon-and-evals/:proposal, design (9 decisions), new capability spec
daemon-container(12 requirements), delta spec fornetclaw-cli(1 new requirement), and 10 task groups. Schema validation passes
(
openspec validate containerize-daemon-and-evals→ clean).sync+archivewill happen in a follow-upchore(openspec): sync and archive containerize-daemon-and-evalscommit after merge, matchingthe pattern in 325f856.
Known gap (intentional)
The daemon does NOT fail fast on empty config today —
FileSystemPromptProvidersilently tolerates missing identity layersand
Program.cs:427-430falls back to a default `local-ollama`provider when none is configured. The design doc and the new
`daemon-container` spec deliberately reflect this reality, and the
PR validation job only tests the happy-path contract. Making the
daemon fail loudly on empty config is a follow-up — tracked in the
design doc as a known risk.
Deferred (follow-up issues)
NETCLAW_EVAL_CONTEXT_WINDOWevals/fixtures/identity/forheadless CI
recall eval phases
feat(daemon): fail loudly on empty identity and missing providerconfig at startup
Test plan
dotnet test src/Netclaw.Configuration.Tests→ 152 pass (7 newNetclawPathsTestscovering env var precedence, empty/whitespacehandling, and explicit-override-wins)
dotnet test src/Netclaw.Cli.Tests→ 413 pass (no regressions)dotnet test src/Netclaw.Daemon.Tests→ 428 pass (no regressions)dotnet slopwatch analyze --hook→ no new violations vs baselinescripts/docker/build-image.sh dev→ produces`ghcr.io/aaronontheweb/netclawd:dev` successfully end-to-end
`check_prerequisites` + `start_eval_daemon` + `cleanup_eval_env`
runs cleanly against a fixture identity directory, the daemon
reaches healthy state, and `force_rmrf` handles the root-owned
files the container writes into the bind-mounted logs directory
reaches `/api/health/ready=healthy` within 4s
(deferred to reviewer — needs a working LLM, takes ~20 min)
degradation regression (deferred to reviewer)
validate-docker-buildjob runs green on this PRpublish-dockerruns end-to-endand lands the image at
ghcr.io/aaronontheweb/netclawdPR follow-up actions after merge
→ Settings (first push creates a private package by default).
containerize-daemon-and-evals` to move the delta specs into
`openspec/specs/daemon-container/spec.md`, apply the
`netclaw-cli` delta, and archive the change directory.