feat(skills): add evalview-agent-testing skill and MCP server#828
Conversation
Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new EvalView MCP server entry and a comprehensive EvalView skill document describing end-to-end regression testing workflows, CLI and Python usage, MCP registration, CI integration, and YAML test formats. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Dev/CI
participant EvalView as EvalView Service
participant MCP as MCP Server
participant Agent as AI Agent
participant Ops as Ops (Webhook)
Dev->>EvalView: trigger test run (CI/manual)
EvalView->>MCP: start/register tools (`mcp serve`)
EvalView->>Agent: execute test cases (single/multi-turn)
Agent-->>EvalView: outputs and tool calls
EvalView->>EvalView: compare outputs to baseline snapshot
EvalView-->>Dev: report result (PASSED/TOOLS_CHANGED/OUTPUT_CHANGED/REGRESSION)
EvalView-->>Ops: send notification (optional webhook)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR introduces A prior round of review raised a number of concerns; this iteration addresses the majority of them:
A few items from the previous review cycle remain unresolved in this diff: the Confidence Score: 5/5Safe to merge — all P0/P1 concerns from the previous review round have been resolved; remaining open items are P2 style issues already flagged in earlier threads. All findings from this review pass are P2 or lower. The major risks (mutable GH Action, TypeError on score_delta, unpinned pip dep, missing auto-revert warning, missing API key guidance) were addressed in this iteration. The skill's documentation is clear, the MCP config entry is robust, and no new critical issues were found. skills/evalview-agent-testing/SKILL.md — DiffStatus unused import, absent manifest entry, and MCP CLI command inconsistency remain from the prior review round but do not block merge. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph Setup
H[pip install evalview] --> I[evalview init]
I --> J[evalview snapshot - golden baseline committed to git]
end
J --> B
A[Agent Code Change] --> B[evalview check]
B --> C{DiffStatus}
C -->|PASSED| D[Ship with confidence]
C -->|TOOLS_CHANGED| E[Review tool diff]
C -->|OUTPUT_CHANGED| F[Review output diff]
C -->|REGRESSION| G[Fix before shipping]
G --> A
subgraph AutonomousLoop
K[make_code_change] --> L[gate_or_revert]
L -->|passed=True| M[Continue loop]
L -->|passed=False| N[git checkout - auto-reverted]
N --> O[try_alternative_approach]
O --> K
end
subgraph CIPipeline
P[PR opened] --> Q[checkout repo]
Q --> R[pip install evalview]
R --> S[evalview check --fail-on REGRESSION]
S -->|exit 0| T[PR passes]
S -->|exit 1| U[PR blocked]
end
Reviews (13): Last reviewed commit: "Merge branch 'main' into feat/evalview-a..." | Re-trigger Greptile |
| Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts: | ||
|
|
||
| ```python | ||
| from evalview import gate, DiffStatus |
There was a problem hiding this comment.
DiffStatus imported but never used in example
DiffStatus is imported on this line but not referenced anywhere in the surrounding code block. This will cause a linter warning if users copy the snippet directly, and the purpose of the import is unclear without a usage example.
Either remove the unused import or add a concrete example showing when to use DiffStatus (e.g. comparing against DiffStatus.REGRESSION):
| from evalview import gate, DiffStatus | |
| from evalview import gate |
| --- | ||
| name: evalview-agent-testing | ||
| description: Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production. | ||
| origin: ECC | ||
| tools: Bash, Read, Write | ||
| --- |
There was a problem hiding this comment.
Skill not registered in
manifests/install-modules.json
Per the Skill Placement Policy, all curated skills in skills/ must be listed in manifests/install-modules.json:
"Included in
manifests/install-modules.jsonpaths."
This skill is not added to any module in that file. Without a manifest entry, users who install ECC via the official installer will not receive this skill. It would fit naturally as an entry in the workflow-quality module alongside the related skills/ai-regression-testing and skills/eval-harness paths.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
skills/evalview-agent-testing/SKILL.md (2)
143-152: Consider adding .gitignore recommendation.Line 149 correctly advises against committing
state.json, but users would benefit from a concrete .gitignore pattern to prevent accidental commits.📝 Optional: Add .gitignore guidance
- **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`. + - **Add `.evalview/state.json` to .gitignore** to prevent accidental commits of transient state. - **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/evalview-agent-testing/SKILL.md` around lines 143 - 152, Add a concrete .gitignore recommendation to the Best Practices section so users don't accidentally commit runtime state: update the text near the references to ".evalview/golden/" and "state.json" to include a suggested .gitignore entry (e.g., ignore .evalview/state.json and other transient files) and a short note to commit ".evalview/golden/" but exclude "state.json"; reference the exact names "state.json", ".evalview/golden/" and ".gitignore" so reviewers can locate and apply the change in SKILL.md.
92-112: Pin GitHub Action to a specific version for stability.The GitHub Action reference at line 106 uses
@main, which points to a moving target. Action updates could introduce breaking changes without warning, causing CI failures.📌 Recommended: Pin to a specific version
First, verify the action exists and check available versions:
#!/bin/bash # Description: Verify GitHub Action existence and list available tags echo "=== Checking if hidai25/eval-view action exists ===" gh api repos/hidai25/eval-view --jq '.full_name' 2>/dev/null || echo "Action repository not found" echo -e "\n=== Listing available tags/versions ===" gh api repos/hidai25/eval-view/tags --jq '.[].name' 2>/dev/null || echo "No tags found or repo not accessible" echo -e "\n=== Checking latest release ===" gh api repos/hidai25/eval-view/releases/latest --jq '.tag_name' 2>/dev/null || echo "No releases found"Once verified, update the documentation to recommend version pinning:
- name: Check for regressions - uses: hidai25/eval-view@main + uses: hidai25/eval-view@v1.0.0 # Pin to specific version with: openai-api-key: ${{ secrets.OPENAI_API_KEY }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/evalview-agent-testing/SKILL.md` around lines 92 - 112, The workflow example pins the GitHub Action to a moving ref "uses: hidai25/eval-view@main", which is brittle; update the example to recommend and show pinning to a specific release tag or commit SHA (e.g., replace "hidai25/eval-view@main" with "hidai25/eval-view@vX.Y.Z" or a specific commit SHA) and add a short note to verify available tags/releases (use the repo tags or releases to pick the stable version); reference the workflow name "Agent Regression Check" and the uses line "uses: hidai25/eval-view@main" so reviewers can locate and change the example accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@skills/evalview-agent-testing/SKILL.md`:
- Around line 143-152: Add a concrete .gitignore recommendation to the Best
Practices section so users don't accidentally commit runtime state: update the
text near the references to ".evalview/golden/" and "state.json" to include a
suggested .gitignore entry (e.g., ignore .evalview/state.json and other
transient files) and a short note to commit ".evalview/golden/" but exclude
"state.json"; reference the exact names "state.json", ".evalview/golden/" and
".gitignore" so reviewers can locate and apply the change in SKILL.md.
- Around line 92-112: The workflow example pins the GitHub Action to a moving
ref "uses: hidai25/eval-view@main", which is brittle; update the example to
recommend and show pinning to a specific release tag or commit SHA (e.g.,
replace "hidai25/eval-view@main" with "hidai25/eval-view@vX.Y.Z" or a specific
commit SHA) and add a short note to verify available tags/releases (use the repo
tags or releases to pick the stable version); reference the workflow name "Agent
Regression Check" and the uses line "uses: hidai25/eval-view@main" so reviewers
can locate and change the example accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d061b61b-b2bf-4514-8cc3-b7a7c7bd1891
📒 Files selected for processing (2)
mcp-configs/mcp-servers.jsonskills/evalview-agent-testing/SKILL.md
There was a problem hiding this comment.
2 issues found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/evalview-agent-testing/SKILL.md">
<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: CI example uses a third-party action pinned to a mutable branch (`@main`) while requesting `pull-requests: write`, which is a supply-chain risk and leads to non-reproducible runs. Pin to a specific commit SHA or immutable release tag instead.</violation>
<violation number="2" location="skills/evalview-agent-testing/SKILL.md:155">
P2: User-facing docs link to external GitHub repositories without demonstrated org vetting, conflicting with the project’s supply-chain guidance.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/evalview-agent-testing/SKILL.md">
<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: User-facing skill docs recommend a non-org third-party GitHub Action in CI and pass secrets to it, conflicting with repo supply-chain guidance to avoid unvetted external repositories.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback.
Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback.
- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/evalview-agent-testing/SKILL.md">
<violation number="1" location="skills/evalview-agent-testing/SKILL.md:107">
P2: CI pins EvalView to a version range while the rest of the doc still instructs unpinned installs, which can cause local baselines to be generated with a different version than CI and lead to unexpected diffs or failures. Align the version guidance across sections.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH.
Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/evalview-agent-testing/SKILL.md">
<violation number="1" location="skills/evalview-agent-testing/SKILL.md:84">
P2: MCP launch command hardcodes `python3` while install docs use plain `pip`, which can break setup in multi-Python environments.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/evalview-agent-testing/SKILL.md">
<violation number="1" location="skills/evalview-agent-testing/SKILL.md:156">
P2: Installation instructions are inconsistent: this PR pins EvalView in Installation/CI but Core Workflow still uses unpinned `pip install evalview`, which can produce non-reproducible setups.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs.
|
thanks for the pr, queued for review. |
|
*Thanks* *Affaan!* *Happy* *to* *make* *any* *changes* *if* *needed.*
…On Mon, Mar 23, 2026 at 3:23 PM Affaan Mustafa ***@***.***> wrote:
*affaan-m* left a comment (affaan-m/ECC#828)
<#828 (comment)>
thanks for the pr, queued for review.
—
Reply to this email directly, view it on GitHub
<#828>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHQLDTFUPAYRWNCRJ5V47Z34SE3GBAVCNFSM6AAAAACW36YEYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMJQGU3TINJVHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
…-m#828) * feat(skills): add evalview-agent-testing skill and MCP server Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json. * fix(skills): pin action SHA and remove unvetted external links - Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback. * fix(skills): replace third-party action with pip install + CLI Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback. * fix(skills): add destructive revert warning for gate_or_revert Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback. * fix(skills): pin pip version range and document fail-on tradeoffs - Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback. * fix: use python3 -m evalview for venv compatibility in MCP config Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH. * fix: align MCP install command with mcp-servers.json pattern Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog. * fix: use evalview CLI entry point for MCP command pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues. * fix: pin install version to match CI section * fix: pin all pip install references consistently * fix: add API key placeholder and pin install version in MCP config Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs. * fix: guard score_delta format for non-scored statuses --------- Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
…-m#828) * feat(skills): add evalview-agent-testing skill and MCP server Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json. * fix(skills): pin action SHA and remove unvetted external links - Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback. * fix(skills): replace third-party action with pip install + CLI Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback. * fix(skills): add destructive revert warning for gate_or_revert Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback. * fix(skills): pin pip version range and document fail-on tradeoffs - Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback. * fix: use python3 -m evalview for venv compatibility in MCP config Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH. * fix: align MCP install command with mcp-servers.json pattern Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog. * fix: use evalview CLI entry point for MCP command pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues. * fix: pin install version to match CI section * fix: pin all pip install references consistently * fix: add API key placeholder and pin install version in MCP config Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs. * fix: guard score_delta format for non-scored statuses --------- Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
Summary
Adds EvalView as a regression testing skill for AI agents, plus its MCP server config.
EvalView snapshots agent behavior (tool calls, parameters, sequence, output), then diffs against baselines after every change. It's the implementation layer for the patterns described in
ai-regression-testingand the eval-driven workflow ineval-harness— but as an actual executable tool with CLI, Python API, and MCP integration.What it adds:
skills/evalview-agent-testing/SKILL.md— teaches Claude Code to use EvalView for agent regression testingmcp-configs/mcp-servers.json— adds the evalview MCP server (8 tools: create_test, run_snapshot, run_check, etc.)Key capabilities surfaced in the skill:
evalview init → snapshot → check → monitorgate()/gate_async()for programmatic checks in autonomous loopsType
Testing
claude mcp add --transport stdio evalview -- evalview mcp serve)pip install evalview)Checklist
Summary by cubic
Adds
evalviewas a regression-testing skill and configures its MCP server. Snapshots tool calls and outputs, diffs against baselines, and gates changes in dev and CI.New Features
skills/evalview-agent-testing/SKILL.mdcovering CLI workflow, Pythongate()/gate_async(), quick mode, CI usage, multi-turn tests, and OpenClaw auto-revert.python3 -m evalview mcp serveadded tomcp-configs/mcp-servers.jsonwith 8 tools; includesOPENAI_API_KEYplaceholder (optional — deterministic checks work without it).Bug Fixes
evalviewto>=0.5,<1; replace third-party action withpip install+ CLI.gate_or_revertrunsgit checkout -- .; documentrevert_cmdalternatives.python3 -m evalview mcp servein MCP config for venv compatibility; docs show theevalviewCLI.score_deltaformatting in sample code for non-scored statuses.Written for commit 66ae934. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation