Skip to content

feat(daemon): containerize netclawd and run evals in ephemeral Docker#603

Merged
Aaronontheweb merged 4 commits into
devfrom
feat/containerize-daemon-and-evals
Apr 11, 2026
Merged

feat(daemon): containerize netclawd and run evals in ephemeral Docker#603
Aaronontheweb merged 4 commits into
devfrom
feat/containerize-daemon-and-evals

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Closes #569. Containerizes netclawd as a publishable Docker image
and rewrites the behavioral eval suite to run against an ephemeral
container per invocation — no more contamination between the eval
suite and the operator's real ~/.netclaw state.

Why

The eval suite runs against the operator's live netclawd, mutating
production memories and seeding test docs into the real SQLite DB.
Observed in #569: memory-score eval scores drop from 100% → 30% hit
rate by the third sequential run as LLM-formed memories from earlier
runs crowd out seeded eval documents. Running netclawd inside a
Docker container sidesteps the whole problem — each invocation gets a
fresh DB, fresh identity, fresh sessions, and fresh logs.

The same image also becomes the supported Docker-deployment artifact
for netclawd itself, closing a gap flagged in the exposure-modes
change.

What's in the PR

  • docker/Dockerfile — release-grade ubuntu:24.04 image
    (pre-installed git, jq, sqlite3, python3, gh, plus the
    netclaw + netclawd binaries). ENTRYPOINT is the daemon; operators
    supply config via NETCLAW_* env vars and /root/.netclaw mounts.
  • scripts/docker/build-image.sh — single build entrypoint used
    by contributors, the PR validation job, and the release publish job.
    Supports IMAGE_REPO override and NO_BUILD=1 escape hatch.
  • validate-docker-build — new PR job that runs the shared build
    script and verifies /api/health/ready returns healthy within 60s
    using an ollama stub provider (no secrets required).
  • publish-docker — new release job that pushes
    ghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}
    on every release tag.
  • evals/run-evals.sh — rewrite of bootstrap, daemon lifecycle,
    and log capture. Spawns docker run -d --rm --network host with a
    throwaway \$EVAL_HOME identity copy, forwards provider/model
    config via env vars, captures the daemon's file log via a writable
    bind-mount on /root/.netclaw/logs. Required eval-target
    credentials (NETCLAW_EVAL_PROVIDER_TYPE / _ENDPOINT / _MODEL_ID)
    are always explicit: env vars in non-interactive contexts, stdin
    prompts on terminals, and a hard-fail otherwise. All 22 assertion
    helpers and case bodies are untouched.
  • NetclawPaths — honours NETCLAW_HOME env var as a fallback
    source for BasePath. Backward-compatible single-line change
    (explicit arg → env var → default).
  • evals/README.md — documents the new env-var surface, the
    container lifecycle, --network host for Tailscale MagicDNS, and
    removes the "local instance only — no isolation" limitation.

OpenSpec

Drives through openspec/changes/containerize-daemon-and-evals/:
proposal, design (9 decisions), new capability spec
daemon-container (12 requirements), delta spec for netclaw-cli
(1 new requirement), and 10 task groups. Schema validation passes
(openspec validate containerize-daemon-and-evals → clean).

sync + archive will happen in a follow-up chore(openspec): sync and archive containerize-daemon-and-evals commit after merge, matching
the pattern in 325f856.

Known gap (intentional)

The daemon does NOT fail fast on empty config today —
FileSystemPromptProvider silently tolerates missing identity layers
and Program.cs:427-430 falls back to a default `local-ollama`
provider when none is configured. The design doc and the new
`daemon-container` spec deliberately reflect this reality, and the
PR validation job only tests the happy-path contract. Making the
daemon fail loudly on empty config is a follow-up — tracked in the
design doc as a known risk.

Deferred (follow-up issues)

Test plan

  • dotnet test src/Netclaw.Configuration.Tests → 152 pass (7 new
    NetclawPathsTests covering env var precedence, empty/whitespace
    handling, and explicit-override-wins)
  • dotnet test src/Netclaw.Cli.Tests → 413 pass (no regressions)
  • dotnet test src/Netclaw.Daemon.Tests → 428 pass (no regressions)
  • dotnet slopwatch analyze --hook → no new violations vs baseline
  • scripts/docker/build-image.sh dev → produces
    `ghcr.io/aaronontheweb/netclawd:dev` successfully end-to-end
  • Manual bootstrap smoke test: the eval script's
    `check_prerequisites` + `start_eval_daemon` + `cleanup_eval_env`
    runs cleanly against a fixture identity directory, the daemon
    reaches healthy state, and `force_rmrf` handles the root-owned
    files the container writes into the bind-mounted logs directory
  • Manual daemon health probe: `docker run` with minimal config
    reaches `/api/health/ready=healthy` within 4s
  • Full end-to-end eval suite run against a real LLM endpoint
    (deferred to reviewer — needs a working LLM, takes ~20 min)
  • Three-consecutive-run stability check for the feat(evals): isolated eval environment with ephemeral daemon #569 memory-recall
    degradation regression (deferred to reviewer)
  • validate-docker-build job runs green on this PR
  • After merge + first tag push: publish-docker runs end-to-end
    and lands the image at ghcr.io/aaronontheweb/netclawd

PR follow-up actions after merge

  1. Flip GHCR package visibility to public under GitHub → Packages
    → Settings (first push creates a private package by default).
  2. Commit `chore(openspec): sync and archive
    containerize-daemon-and-evals` to move the delta specs into
    `openspec/specs/daemon-container/spec.md`, apply the
    `netclaw-cli` delta, and archive the change directory.

Issue #569: the behavioral eval suite (`./evals/run-evals.sh`) runs
against the operator's live `netclawd` and mutates its state — seeding
test documents into the production SQLite DB, accumulating LLM-formed
memories across runs, and destroying real user memories when eval
iterations reset. The inverse contamination is just as bad: operator
memories crowd out seeded eval docs for recall slots, dropping memory
eval scores from 100% → 30% by the third sequential run.

This change runs `netclawd` inside an ephemeral Docker container per
eval suite invocation. Two-way contamination disappears, and the same
image becomes the supported Docker-deployment artifact for netclawd
itself.

## Scope

- `docker/Dockerfile` — release-grade ubuntu:24.04 image with
  pre-installed autonomous-agent tools (git, jq, sqlite3, python3, gh)
  and both `netclaw` + `netclawd` binaries on PATH. ENTRYPOINT is the
  daemon; operators supply config via `NETCLAW_*` env vars and optional
  volume mounts on `/root/.netclaw`.
- `scripts/docker/build-image.sh` — single build entrypoint reused by
  contributors, PR validation, and the release publish job. Supports
  `IMAGE_REPO` override, `NO_BUILD=1` escape hatch, and fails loudly if
  the published binaries are missing.
- `.github/workflows/pr_validation.yml` — new `validate-docker-build`
  job that runs the shared build script on every PR and verifies the
  image reaches `/api/health/ready=healthy` within 60s using an ollama
  stub provider (no secrets required).
- `.github/workflows/publish_release_binaries.yml` — new `publish-docker`
  job pushes `ghcr.io/aaronontheweb/netclawd:{latest,v${version},v${major}.${minor}}`
  on every release tag, using the same build script.
- `evals/run-evals.sh` — rewrite of bootstrap, daemon lifecycle, and log
  capture. The script spawns `docker run -d --rm --network host` with a
  throwaway `$EVAL_HOME` identity copy and forwards provider/model
  config via env vars. Required eval-target credentials (provider type,
  endpoint, model id) are always explicit: env vars in non-interactive
  contexts, stdin prompts on terminals, and a hard-fail otherwise.
  Assertion helpers and the 22 case bodies are untouched.
- `src/Netclaw.Configuration/NetclawPaths.cs` — honour `NETCLAW_HOME`
  env var as a fallback source for `BasePath` (precedence: explicit arg
  → env var → default). Enables CLI-side path isolation during eval
  runs so `netclaw -p` can't leak state into the host's real
  `~/.netclaw/`.
- `evals/README.md` — document the new env-var surface and remove the
  "local instance only — no isolation" limitation.

## Known gap / deferred

The daemon does NOT fail fast on empty config today — missing provider
config falls back to a default `local-ollama` provider, and missing
identity files are silently tolerated by `FileSystemPromptProvider`.
The design doc and the daemon-container spec deliberately reflect
reality rather than aspiration; making the daemon fail loudly on empty
config is a separate follow-up.

Also deferred (follow-up issues): Docker Hub publishing alongside GHCR
(#602), compaction eval cases using `NETCLAW_EVAL_CONTEXT_WINDOW`, CI
eval execution against a remote LLM, committed identity fixture under
`evals/fixtures/` for headless CI, drain-wait between memory-formation
and recall eval phases (#437, already open).

## OpenSpec change

Drives through `openspec/changes/containerize-daemon-and-evals/`
(proposal, design, specs/daemon-container, specs/netclaw-cli delta,
tasks). New capability `daemon-container`; modified capability
`netclaw-cli` adds one requirement for `NETCLAW_HOME` env var
precedence. Sync + archive will happen in a follow-up commit after
merge, matching the pattern established by 325f856.

## Verification

- `dotnet test src/Netclaw.Configuration.Tests` → 152 pass (7 new
  NetclawPaths env var precedence tests)
- `dotnet test src/Netclaw.Cli.Tests` → 413 pass
- `dotnet test src/Netclaw.Daemon.Tests` → 428 pass
- `dotnet slopwatch analyze --hook` → no new violations vs baseline
- `scripts/docker/build-image.sh dev` → produces
  `ghcr.io/aaronontheweb/netclawd:dev` successfully
- Manual bootstrap smoke test: the eval script's
  `check_prerequisites` + `start_eval_daemon` + `cleanup_eval_env`
  path runs cleanly against a fixture identity directory, with the
  daemon reaching healthy state and `force_rmrf` handling the
  root-owned files the container writes into the bind-mounted logs
  directory.

Full-suite verification against a real LLM endpoint is deferred to PR
review; the bootstrap smoke test validates everything that doesn't
require LLM round-trips.

## PR follow-up actions

- First GHCR push will create a private package
  (`ghcr.io/aaronontheweb/netclawd`). Flip visibility to public under
  GitHub → Packages → Settings before external users can pull
  without `docker login`.
- After merge, commit `chore(openspec): sync and archive
  containerize-daemon-and-evals` to move the delta specs into the
  canonical specs tree and archive the change directory.

Closes #569.
Image construction is an orthogonal concern from .NET test execution and
slopwatch, so `validate-docker-build` shouldn't live in
`pr_validation.yml`. Extract it into a standalone
`.github/workflows/validate_docker_image.yml` that triggers only on
changes to files that actually affect the image (Dockerfile, build
script, src/**, global.json, Directory.Build.props, the workflow
itself). Saves runner minutes on unrelated PRs and keeps each workflow
focused on one failure mode.
Review pass via /simplify surfaced a handful of issues across the
daemon containerization PR:

- `evals/run-evals.sh` invoked `docker run` twice on the failure path,
  which would conflict on the `--name` and swallow the real error. Now
  captures stderr from a single invocation and prints it alongside the
  error message.
- `cleanup_eval_env` called `docker rm` after `docker stop` on a
  container launched with `--rm`, where the stop already removes it.
  Dropped the redundant step.
- `TMPDIR_EVAL` only holds host-owned stdout captures, so `force_rmrf`'s
  alpine-root fallback is dead code for that path. Replaced with plain
  `rm -rf` and narrowed the comment on `force_rmrf` to clarify it only
  matters for `EVAL_HOME` where the container writes as root.
- `docker/Dockerfile` installed six speculative apt packages
  (`wget`, `jq`, `sqlite3`, `python3-pip`, `python3-venv`) that no eval
  case or daemon code path exercises. `python3-pip` alone pulls ~50 MB
  of transitive deps. Trimmed to the minimum the current suite needs:
  `ca-certificates`, `curl`, `procps`, `git`, `python3`, `gh`.
  Operators who need more can `apt-get install` at runtime.
- `.github/workflows/validate_docker_image.yml` `paths:` filter was
  missing `Directory.Packages.props`, meaning a CPM version bump that
  breaks the build would not trigger the job. Added it.
- `.github/workflows/publish_release_binaries.yml` `publish-docker`
  re-ran `dotnet publish` for CLI and Daemon on top of the exact same
  publishes done by `publish-binaries`. Added an
  `upload-artifact`/`download-artifact` hop so `publish-docker` reuses
  the output and runs the shared build script with `NO_BUILD=1`.

Also verified empirically that `docker logs "$name" >&2 2>&1` correctly
merges both container streams to the script's stderr (reviewer flagged
this as a potential swap; confirmed it's correct as written).

Verified after the fixes:
- `bash -n evals/run-evals.sh` → syntax OK
- `dotnet slopwatch analyze --hook` → no new violations
- `dotnet test src/Netclaw.Configuration.Tests` → 152 pass
- Trimmed image still builds via `scripts/docker/build-image.sh dev`
- Trimmed image reaches `/api/health/ready=healthy` in 4s
@Aaronontheweb Aaronontheweb added enhancement New feature or request github_actions Pull requests that update GitHub Actions code .NET Pull requests that update .NET code memory Memory formation, recall, curation pipeline docker Docker image packaging, publishing, and containerized workflows labels Apr 11, 2026
The previous simplification pass trimmed the Dockerfile's apt packages
based on what the current eval suite exercises, but the image's primary
purpose is general-purpose netclawd deployment, not evals. An
autonomous agent should have the common shell toolkit pre-installed so
it can reach for things without needing to `apt-get install` first on
every new action.

Restore: wget, jq, sqlite3, python3-pip, python3-venv.

The `apt-get install` escape hatch the Ubuntu base provides is for
tools the agent discovers it needs on demand, not for things it should
always have.

Verified: image still builds, starts healthy in 4s, and all 8
pre-installed tools resolve on PATH inside the container.
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) April 11, 2026 21:31
@Aaronontheweb Aaronontheweb merged commit d577bf6 into dev Apr 11, 2026
4 checks passed
@Aaronontheweb Aaronontheweb deleted the feat/containerize-daemon-and-evals branch April 11, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Docker image packaging, publishing, and containerized workflows enhancement New feature or request github_actions Pull requests that update GitHub Actions code memory Memory formation, recall, curation pipeline .NET Pull requests that update .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(evals): isolated eval environment with ephemeral daemon

1 participant