feat(evals): behavioral eval suite for session pipeline regressions by Aaronontheweb · Pull Request #347 · netclaw-dev/netclaw

Aaronontheweb · 2026-03-21T03:42:24Z

Summary

Adds evals/run-evals.sh — a bash eval suite that runs prompts against a live Netclaw daemon and verifies the session pipeline worked correctly by checking both stdout output and structured daemon log patterns
22 eval cases across 7 categories: identity, skill auto-loading, memory pipeline, tool discovery, grounding/alignment, autonomy, and complex multi-step tasks
Each case runs N times (default 5) with random prompt variant selection and configurable pass threshold (default 80%) to account for LLM non-determinism
Results persisted to SQLite at ~/.netclaw/evals/results.db for trend analysis across versions
Updates AGENTS.md with eval trigger rules and adds eval pass to Definition of Done

Closes #319

Test plan

Start daemon with netclaw daemon start
Run ./evals/run-evals.sh and verify output format matches expected scoring display
Run with NETCLAW_EVAL_RUNS=1 for quick smoke test
Verify SQLite DB created at ~/.netclaw/evals/results.db with correct schema
Verify exit code is 1 when cases fail, 0 when all pass
Run trend queries from README against results DB

…324) Adds the same grounding rules from the production AGENTS.md to the init wizard template so all new Netclaw installs get them by default: - Don't state runtime facts without tool verification - Don't claim actions not in tool call history - Don't claim tools don't exist without search_tools - Don't silently substitute a different answer when the primary task fails - "I don't know" beats a confident wrong answer

…n detection (#319) Shell script that runs prompts against a live Netclaw daemon and verifies identity, skill auto-loading, memory pipeline, tool use, grounding, and autonomy by checking both stdout output and structured daemon log patterns. - 22 eval cases across 7 categories with random prompt variant selection - N runs per case (default 5) with configurable pass threshold (default 80%) - SQLite results database at ~/.netclaw/evals/results.db for trend analysis - Daemon log tailing to isolate structured metrics per prompt - AGENTS.md updated with eval trigger rules and Definition of Done bullet

Aaronontheweb added 5 commits March 21, 2026 01:38

Merge branch 'dev' into feat/eval-suite

c0a2f91

Merge branch 'dev' into feat/eval-suite

39a1e2d

Merge branch 'dev' into feat/eval-suite

a4e9bf2

Aaronontheweb enabled auto-merge (squash) March 21, 2026 14:25

Merge branch 'dev' into feat/eval-suite

93919e5

Aaronontheweb mentioned this pull request Mar 21, 2026

feat(security): command-level approval gates within tool grants #352

Closed

Aaronontheweb merged commit fbd6c9e into dev Mar 21, 2026
3 checks passed

Aaronontheweb deleted the feat/eval-suite branch March 21, 2026 14:32

Aaronontheweb mentioned this pull request Mar 21, 2026

design: semantic skill discovery via description embedding instead of keyword extraction #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): behavioral eval suite for session pipeline regressions#347

feat(evals): behavioral eval suite for session pipeline regressions#347
Aaronontheweb merged 6 commits into
devfrom
feat/eval-suite

Aaronontheweb commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Mar 21, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant