feat(daemon): containerize netclawd and run evals in ephemeral Docker by Aaronontheweb · Pull Request #603 · netclaw-dev/netclaw

Aaronontheweb · 2026-04-11T21:17:45Z

Summary

Closes #569. Containerizes netclawd as a publishable Docker image
and rewrites the behavioral eval suite to run against an ephemeral
container per invocation — no more contamination between the eval
suite and the operator's real ~/.netclaw state.

Why

The eval suite runs against the operator's live netclawd, mutating
production memories and seeding test docs into the real SQLite DB.
Observed in #569: memory-score eval scores drop from 100% → 30% hit
rate by the third sequential run as LLM-formed memories from earlier
runs crowd out seeded eval documents. Running netclawd inside a
Docker container sidesteps the whole problem — each invocation gets a
fresh DB, fresh identity, fresh sessions, and fresh logs.

The same image also becomes the supported Docker-deployment artifact
for netclawd itself, closing a gap flagged in the exposure-modes
change.

What's in the PR

docker/Dockerfile — release-grade ubuntu:24.04 image
(pre-installed git, jq, sqlite3, python3, gh, plus the
netclaw + netclawd binaries). ENTRYPOINT is the daemon; operators
supply config via NETCLAW_* env vars and /root/.netclaw mounts.
scripts/docker/build-image.sh — single build entrypoint used
by contributors, the PR validation job, and the release publish job.
Supports IMAGE_REPO override and NO_BUILD=1 escape hatch.
validate-docker-build — new PR job that runs the shared build
script and verifies /api/health/ready returns healthy within 60s
using an ollama stub provider (no secrets required).
publish-docker — new release job that pushes
ghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}
on every release tag.
evals/run-evals.sh — rewrite of bootstrap, daemon lifecycle,
and log capture. Spawns docker run -d --rm --network host with a
throwaway \$EVAL_HOME identity copy, forwards provider/model
config via env vars, captures the daemon's file log via a writable
bind-mount on /root/.netclaw/logs. Required eval-target
credentials (NETCLAW_EVAL_PROVIDER_TYPE / _ENDPOINT / _MODEL_ID)
are always explicit: env vars in non-interactive contexts, stdin
prompts on terminals, and a hard-fail otherwise. All 22 assertion
helpers and case bodies are untouched.
NetclawPaths — honours NETCLAW_HOME env var as a fallback
source for BasePath. Backward-compatible single-line change
(explicit arg → env var → default).
evals/README.md — documents the new env-var surface, the
container lifecycle, --network host for Tailscale MagicDNS, and
removes the "local instance only — no isolation" limitation.

OpenSpec

Drives through openspec/changes/containerize-daemon-and-evals/:
proposal, design (9 decisions), new capability spec
daemon-container (12 requirements), delta spec for netclaw-cli
(1 new requirement), and 10 task groups. Schema validation passes
(openspec validate containerize-daemon-and-evals → clean).

sync + archive will happen in a follow-up chore(openspec): sync and archive containerize-daemon-and-evals commit after merge, matching
the pattern in 325f856.

Known gap (intentional)

The daemon does NOT fail fast on empty config today —
FileSystemPromptProvider silently tolerates missing identity layers
and Program.cs:427-430 falls back to a default `local-ollama`
provider when none is configured. The design doc and the new
`daemon-container` spec deliberately reflect this reality, and the
PR validation job only tests the happy-path contract. Making the
daemon fail loudly on empty config is a follow-up — tracked in the
design doc as a known risk.

Deferred (follow-up issues)

Docker Hub publishing alongside GHCR — filed as feat(release): publish netclawd Docker image to Docker Hub alongside GHCR #602
Compaction eval cases using NETCLAW_EVAL_CONTEXT_WINDOW
CI eval execution against a remote LLM (needs secret wiring)
Committed identity fixture under evals/fixtures/identity/ for
headless CI
fix(evals): add checkpoint drain wait between memory formation and recall eval phases #437 (already open): drain-wait between memory-formation and
recall eval phases
feat(daemon): fail loudly on empty identity and missing provider
config at startup

Test plan

PR follow-up actions after merge

Flip GHCR package visibility to public under GitHub → Packages
→ Settings (first push creates a private package by default).
Commit `chore(openspec): sync and archive
containerize-daemon-and-evals` to move the delta specs into
`openspec/specs/daemon-container/spec.md`, apply the
`netclaw-cli` delta, and archive the change directory.

Issue #569: the behavioral eval suite (`./evals/run-evals.sh`) runs against the operator's live `netclawd` and mutates its state — seeding test documents into the production SQLite DB, accumulating LLM-formed memories across runs, and destroying real user memories when eval iterations reset. The inverse contamination is just as bad: operator memories crowd out seeded eval docs for recall slots, dropping memory eval scores from 100% → 30% by the third sequential run. This change runs `netclawd` inside an ephemeral Docker container per eval suite invocation. Two-way contamination disappears, and the same image becomes the supported Docker-deployment artifact for netclawd itself. ## Scope - `docker/Dockerfile` — release-grade ubuntu:24.04 image with pre-installed autonomous-agent tools (git, jq, sqlite3, python3, gh) and both `netclaw` + `netclawd` binaries on PATH. ENTRYPOINT is the daemon; operators supply config via `NETCLAW_*` env vars and optional volume mounts on `/root/.netclaw`. - `scripts/docker/build-image.sh` — single build entrypoint reused by contributors, PR validation, and the release publish job. Supports `IMAGE_REPO` override, `NO_BUILD=1` escape hatch, and fails loudly if the published binaries are missing. - `.github/workflows/pr_validation.yml` — new `validate-docker-build` job that runs the shared build script on every PR and verifies the image reaches `/api/health/ready=healthy` within 60s using an ollama stub provider (no secrets required). - `.github/workflows/publish_release_binaries.yml` — new `publish-docker` job pushes `ghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}` on every release tag, using the same build script. - `evals/run-evals.sh` — rewrite of bootstrap, daemon lifecycle, and log capture. The script spawns `docker run -d --rm --network host` with a throwaway `$EVAL_HOME` identity copy and forwards provider/model config via env vars. Required eval-target credentials (provider type, endpoint, model id) are always explicit: env vars in non-interactive contexts, stdin prompts on terminals, and a hard-fail otherwise. Assertion helpers and the 22 case bodies are untouched. - `src/Netclaw.Configuration/NetclawPaths.cs` — honour `NETCLAW_HOME` env var as a fallback source for `BasePath` (precedence: explicit arg → env var → default). Enables CLI-side path isolation during eval runs so `netclaw -p` can't leak state into the host's real `~/.netclaw/`. - `evals/README.md` — document the new env-var surface and remove the "local instance only — no isolation" limitation. ## Known gap / deferred The daemon does NOT fail fast on empty config today — missing provider config falls back to a default `local-ollama` provider, and missing identity files are silently tolerated by `FileSystemPromptProvider`. The design doc and the daemon-container spec deliberately reflect reality rather than aspiration; making the daemon fail loudly on empty config is a separate follow-up. Also deferred (follow-up issues): Docker Hub publishing alongside GHCR (#602), compaction eval cases using `NETCLAW_EVAL_CONTEXT_WINDOW`, CI eval execution against a remote LLM, committed identity fixture under `evals/fixtures/` for headless CI, drain-wait between memory-formation and recall eval phases (#437, already open). ## OpenSpec change Drives through `openspec/changes/containerize-daemon-and-evals/` (proposal, design, specs/daemon-container, specs/netclaw-cli delta, tasks). New capability `daemon-container`; modified capability `netclaw-cli` adds one requirement for `NETCLAW_HOME` env var precedence. Sync + archive will happen in a follow-up commit after merge, matching the pattern established by 325f856. ## Verification - `dotnet test src/Netclaw.Configuration.Tests` → 152 pass (7 new NetclawPaths env var precedence tests) - `dotnet test src/Netclaw.Cli.Tests` → 413 pass - `dotnet test src/Netclaw.Daemon.Tests` → 428 pass - `dotnet slopwatch analyze --hook` → no new violations vs baseline - `scripts/docker/build-image.sh dev` → produces `ghcr.io/aaronontheweb/netclawd:dev` successfully - Manual bootstrap smoke test: the eval script's `check_prerequisites` + `start_eval_daemon` + `cleanup_eval_env` path runs cleanly against a fixture identity directory, with the daemon reaching healthy state and `force_rmrf` handling the root-owned files the container writes into the bind-mounted logs directory. Full-suite verification against a real LLM endpoint is deferred to PR review; the bootstrap smoke test validates everything that doesn't require LLM round-trips. ## PR follow-up actions - First GHCR push will create a private package (`ghcr.io/aaronontheweb/netclawd`). Flip visibility to public under GitHub → Packages → Settings before external users can pull without `docker login`. - After merge, commit `chore(openspec): sync and archive containerize-daemon-and-evals` to move the delta specs into the canonical specs tree and archive the change directory. Closes #569.

Image construction is an orthogonal concern from .NET test execution and slopwatch, so `validate-docker-build` shouldn't live in `pr_validation.yml`. Extract it into a standalone `.github/workflows/validate_docker_image.yml` that triggers only on changes to files that actually affect the image (Dockerfile, build script, src/**, global.json, Directory.Build.props, the workflow itself). Saves runner minutes on unrelated PRs and keeps each workflow focused on one failure mode.

Review pass via /simplify surfaced a handful of issues across the daemon containerization PR: - `evals/run-evals.sh` invoked `docker run` twice on the failure path, which would conflict on the `--name` and swallow the real error. Now captures stderr from a single invocation and prints it alongside the error message. - `cleanup_eval_env` called `docker rm` after `docker stop` on a container launched with `--rm`, where the stop already removes it. Dropped the redundant step. - `TMPDIR_EVAL` only holds host-owned stdout captures, so `force_rmrf`'s alpine-root fallback is dead code for that path. Replaced with plain `rm -rf` and narrowed the comment on `force_rmrf` to clarify it only matters for `EVAL_HOME` where the container writes as root. - `docker/Dockerfile` installed six speculative apt packages (`wget`, `jq`, `sqlite3`, `python3-pip`, `python3-venv`) that no eval case or daemon code path exercises. `python3-pip` alone pulls ~50 MB of transitive deps. Trimmed to the minimum the current suite needs: `ca-certificates`, `curl`, `procps`, `git`, `python3`, `gh`. Operators who need more can `apt-get install` at runtime. - `.github/workflows/validate_docker_image.yml` `paths:` filter was missing `Directory.Packages.props`, meaning a CPM version bump that breaks the build would not trigger the job. Added it. - `.github/workflows/publish_release_binaries.yml` `publish-docker` re-ran `dotnet publish` for CLI and Daemon on top of the exact same publishes done by `publish-binaries`. Added an `upload-artifact`/`download-artifact` hop so `publish-docker` reuses the output and runs the shared build script with `NO_BUILD=1`. Also verified empirically that `docker logs "$name" >&2 2>&1` correctly merges both container streams to the script's stderr (reviewer flagged this as a potential swap; confirmed it's correct as written). Verified after the fixes: - `bash -n evals/run-evals.sh` → syntax OK - `dotnet slopwatch analyze --hook` → no new violations - `dotnet test src/Netclaw.Configuration.Tests` → 152 pass - Trimmed image still builds via `scripts/docker/build-image.sh dev` - Trimmed image reaches `/api/health/ready=healthy` in 4s

The previous simplification pass trimmed the Dockerfile's apt packages based on what the current eval suite exercises, but the image's primary purpose is general-purpose netclawd deployment, not evals. An autonomous agent should have the common shell toolkit pre-installed so it can reach for things without needing to `apt-get install` first on every new action. Restore: wget, jq, sqlite3, python3-pip, python3-venv. The `apt-get install` escape hatch the Ubuntu base provides is for tools the agent discovers it needs on demand, not for things it should always have. Verified: image still builds, starts healthy in 4s, and all 8 pre-installed tools resolve on PATH inside the container.

Aaronontheweb added 3 commits April 11, 2026 16:16

Aaronontheweb added enhancement New feature or request github_actions Pull requests that update GitHub Actions code .NET Pull requests that update .NET code memory Memory formation, recall, curation pipeline docker Docker image packaging, publishing, and containerized workflows labels Apr 11, 2026

Aaronontheweb enabled auto-merge (squash) April 11, 2026 21:31

Aaronontheweb merged commit d577bf6 into dev Apr 11, 2026
4 checks passed

Aaronontheweb deleted the feat/containerize-daemon-and-evals branch April 11, 2026 21:35

This was referenced Apr 12, 2026

System prompt cache busting: reorder dynamic context layers + move memory recall out of system role #608

Closed

release: prepare 0.12.0 release notes and version bump #624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(daemon): containerize netclawd and run evals in ephemeral Docker#603

feat(daemon): containerize netclawd and run evals in ephemeral Docker#603
Aaronontheweb merged 4 commits into
devfrom
feat/containerize-daemon-and-evals

Aaronontheweb commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Apr 11, 2026

Summary

Why

What's in the PR

OpenSpec

Known gap (intentional)

Deferred (follow-up issues)

Test plan

PR follow-up actions after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant