feat(evals): behavioral eval suite for session pipeline regressions#347
Merged
Conversation
…324) Adds the same grounding rules from the production AGENTS.md to the init wizard template so all new Netclaw installs get them by default: - Don't state runtime facts without tool verification - Don't claim actions not in tool call history - Don't claim tools don't exist without search_tools - Don't silently substitute a different answer when the primary task fails - "I don't know" beats a confident wrong answer
…n detection (#319) Shell script that runs prompts against a live Netclaw daemon and verifies identity, skill auto-loading, memory pipeline, tool use, grounding, and autonomy by checking both stdout output and structured daemon log patterns. - 22 eval cases across 7 categories with random prompt variant selection - N runs per case (default 5) with configurable pass threshold (default 80%) - SQLite results database at ~/.netclaw/evals/results.db for trend analysis - Daemon log tailing to isolate structured metrics per prompt - AGENTS.md updated with eval trigger rules and Definition of Done bullet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evals/run-evals.sh— a bash eval suite that runs prompts against a live Netclaw daemon and verifies the session pipeline worked correctly by checking both stdout output and structured daemon log patterns~/.netclaw/evals/results.dbfor trend analysis across versionsCloses #319
Test plan
netclaw daemon start./evals/run-evals.shand verify output format matches expected scoring displayNETCLAW_EVAL_RUNS=1for quick smoke test~/.netclaw/evals/results.dbwith correct schema