Skip to content

feat(evals): behavioral eval suite for session pipeline regressions#347

Merged
Aaronontheweb merged 6 commits into
devfrom
feat/eval-suite
Mar 21, 2026
Merged

feat(evals): behavioral eval suite for session pipeline regressions#347
Aaronontheweb merged 6 commits into
devfrom
feat/eval-suite

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

  • Adds evals/run-evals.sh — a bash eval suite that runs prompts against a live Netclaw daemon and verifies the session pipeline worked correctly by checking both stdout output and structured daemon log patterns
  • 22 eval cases across 7 categories: identity, skill auto-loading, memory pipeline, tool discovery, grounding/alignment, autonomy, and complex multi-step tasks
  • Each case runs N times (default 5) with random prompt variant selection and configurable pass threshold (default 80%) to account for LLM non-determinism
  • Results persisted to SQLite at ~/.netclaw/evals/results.db for trend analysis across versions
  • Updates AGENTS.md with eval trigger rules and adds eval pass to Definition of Done

Closes #319

Test plan

  • Start daemon with netclaw daemon start
  • Run ./evals/run-evals.sh and verify output format matches expected scoring display
  • Run with NETCLAW_EVAL_RUNS=1 for quick smoke test
  • Verify SQLite DB created at ~/.netclaw/evals/results.db with correct schema
  • Verify exit code is 1 when cases fail, 0 when all pass
  • Run trend queries from README against results DB

…324)

Adds the same grounding rules from the production AGENTS.md to the init
wizard template so all new Netclaw installs get them by default:

- Don't state runtime facts without tool verification
- Don't claim actions not in tool call history
- Don't claim tools don't exist without search_tools
- Don't silently substitute a different answer when the primary task fails
- "I don't know" beats a confident wrong answer
…n detection (#319)

Shell script that runs prompts against a live Netclaw daemon and verifies
identity, skill auto-loading, memory pipeline, tool use, grounding, and
autonomy by checking both stdout output and structured daemon log patterns.

- 22 eval cases across 7 categories with random prompt variant selection
- N runs per case (default 5) with configurable pass threshold (default 80%)
- SQLite results database at ~/.netclaw/evals/results.db for trend analysis
- Daemon log tailing to isolate structured metrics per prompt
- AGENTS.md updated with eval trigger rules and Definition of Done bullet
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) March 21, 2026 14:25
@Aaronontheweb Aaronontheweb merged commit fbd6c9e into dev Mar 21, 2026
3 checks passed
@Aaronontheweb Aaronontheweb deleted the feat/eval-suite branch March 21, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: formal eval suite for identity, skills, memory, and compaction quality

1 participant