Skip to content

feat(skills): add evalview-agent-testing skill and MCP server#828

Merged
affaan-m merged 13 commits into
affaan-m:mainfrom
hidai25:feat/evalview-agent-testing
Mar 31, 2026
Merged

feat(skills): add evalview-agent-testing skill and MCP server#828
affaan-m merged 13 commits into
affaan-m:mainfrom
hidai25:feat/evalview-agent-testing

Conversation

@hidai25

@hidai25 hidai25 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds EvalView as a regression testing skill for AI agents, plus its MCP server config.

EvalView snapshots agent behavior (tool calls, parameters, sequence, output), then diffs against baselines after every change. It's the implementation layer for the patterns described in ai-regression-testing and the eval-driven workflow in eval-harness — but as an actual executable tool with CLI, Python API, and MCP integration.

What it adds:

  • skills/evalview-agent-testing/SKILL.md — teaches Claude Code to use EvalView for agent regression testing
  • mcp-configs/mcp-servers.json — adds the evalview MCP server (8 tools: create_test, run_snapshot, run_check, etc.)

Key capabilities surfaced in the skill:

  • CLI workflow: evalview init → snapshot → check → monitor
  • Python API: gate() / gate_async() for programmatic checks in autonomous loops
  • Quick mode: no LLM judge, $0, sub-second — ideal for tight agent loops
  • CI/CD: GitHub Action with automatic PR comments, cost/latency alerts
  • Multi-turn test cases with per-turn evaluation
  • OpenClaw integration for autonomous agents

Type

  • Skill
  • MCP Config

Testing

  • Skill follows the standard SKILL.md format with YAML frontmatter
  • EvalView MCP server tested with Claude Code (claude mcp add --transport stdio evalview -- evalview mcp serve)
  • EvalView has 1126 passing tests and is published on PyPI (pip install evalview)

Checklist

  • Follows format guidelines
  • Tested with Claude Code
  • No sensitive info (API keys, paths)
  • Clear descriptions

Summary by cubic

Adds evalview as a regression-testing skill and configures its MCP server. Snapshots tool calls and outputs, diffs against baselines, and gates changes in dev and CI.

  • New Features

    • New skill: skills/evalview-agent-testing/SKILL.md covering CLI workflow, Python gate()/gate_async(), quick mode, CI usage, multi-turn tests, and OpenClaw auto-revert.
    • MCP server: python3 -m evalview mcp serve added to mcp-configs/mcp-servers.json with 8 tools; includes OPENAI_API_KEY placeholder (optional — deterministic checks work without it).
  • Bug Fixes

    • Pin evalview to >=0.5,<1; replace third-party action with pip install + CLI.
    • Add a warning that gate_or_revert runs git checkout -- .; document revert_cmd alternatives.
    • Use python3 -m evalview mcp serve in MCP config for venv compatibility; docs show the evalview CLI.
    • Guard score_delta formatting in sample code for non-scored statuses.

Written for commit 66ae934. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added EvalView MCP server integration to enable end-to-end AI agent regression testing, baseline creation, diff-based change gating, and optional automated revert behavior.
  • Documentation

    • Added a comprehensive EvalView guide covering setup, snapshot/check/monitor workflows, status meanings and responses, programmatic gating and revert patterns, CI/CD examples, supported single- and multi-turn test formats, and best practices.

Add EvalView as a regression testing skill for AI agents. EvalView
snapshots agent behavior (tool calls, parameters, output), then diffs
against baselines after every change — catching regressions before they
ship.

Skill covers:
- CLI workflow (init → snapshot → check → monitor)
- Python API (gate() / gate_async() for autonomous loops)
- Quick mode (no LLM judge, $0, sub-second)
- CI/CD integration (GitHub Actions with PR comments)
- MCP integration (8 tools for Claude Code)
- Multi-turn test cases
- OpenClaw integration for autonomous agents

Also adds evalview MCP server to mcp-servers.json.
@coderabbitai

coderabbitai Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new EvalView MCP server entry and a comprehensive EvalView skill document describing end-to-end regression testing workflows, CLI and Python usage, MCP registration, CI integration, and YAML test formats.

Changes

Cohort / File(s) Summary
MCP Server Configuration
mcp-configs/mcp-servers.json
Adds new mcpServers.evalview entry with command: "evalview", args: ["mcp", "serve"], and a descriptive label for EvalView.
EvalView Skill Documentation
skills/evalview-agent-testing/SKILL.md
New documentation describing CLI workflow (evalview init, snapshot, check, monitor), check statuses, Python API (gate(), gate_or_revert()), MCP registration/examples, CI/CD examples, YAML test schemas, and best-practices.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Dev/CI
  participant EvalView as EvalView Service
  participant MCP as MCP Server
  participant Agent as AI Agent
  participant Ops as Ops (Webhook)

  Dev->>EvalView: trigger test run (CI/manual)
  EvalView->>MCP: start/register tools (`mcp serve`)
  EvalView->>Agent: execute test cases (single/multi-turn)
  Agent-->>EvalView: outputs and tool calls
  EvalView->>EvalView: compare outputs to baseline snapshot
  EvalView-->>Dev: report result (PASSED/TOOLS_CHANGED/OUTPUT_CHANGED/REGRESSION)
  EvalView-->>Ops: send notification (optional webhook)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • affaan-m

Poem

🐰✨ I hopped to test with eager cheer,
Snapshots kept the baselines near.
Gates that check, and loops that mend,
Regressions caught before they bend.
Hop on—EvalView guards each gear!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new evalview-agent-testing skill and its MCP server configuration, which aligns with the PR objectives and file changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces skills/evalview-agent-testing/SKILL.md and a matching mcp-configs/mcp-servers.json entry for EvalView, an AI agent regression-testing tool. EvalView snapshots agent behavior (tool calls, parameters, sequence, output quality) and diffs against baselines to gate changes in dev and CI. The skill sits logically alongside the existing skills/ai-regression-testing and skills/eval-harness content and provides a concrete, executable implementation layer for those patterns.

A prior round of review raised a number of concerns; this iteration addresses the majority of them:

  • The risky third-party GitHub Action (hidai25/eval-view@main) has been removed entirely — CI now uses pip install followed by the CLI, removing the supply-chain risk.
  • score_delta format string is now guarded with if d.score_delta is not None else \"\", preventing TypeError on non-scored statuses.
  • pip install evalview is now consistently pinned to \"evalview>=0.5,<1\" in both the Installation section and the MCP config description.
  • The auto-revert warning block has been added to clearly communicate the destructive nature of gate_or_revert.
  • The --fail-on REGRESSION trade-off is now documented with alternatives (--fail-on REGRESSION,TOOLS_CHANGED, --strict).
  • The MCP mcp-servers.json entry now uses the robust python3 -m evalview mcp serve invocation and includes the OPENAI_API_KEY placeholder with an explanation that it is optional.

A few items from the previous review cycle remain unresolved in this diff: the DiffStatus symbol is imported but not referenced in the Python API snippet; the skill is still absent from manifests/install-modules.json (meaning it won't be installed via the official installer); and the in-skill MCP install command (evalview mcp serve) still uses the binary form inconsistently with the config entry (python3 -m evalview mcp serve).

Confidence Score: 5/5

Safe to merge — all P0/P1 concerns from the previous review round have been resolved; remaining open items are P2 style issues already flagged in earlier threads.

All findings from this review pass are P2 or lower. The major risks (mutable GH Action, TypeError on score_delta, unpinned pip dep, missing auto-revert warning, missing API key guidance) were addressed in this iteration. The skill's documentation is clear, the MCP config entry is robust, and no new critical issues were found.

skills/evalview-agent-testing/SKILL.md — DiffStatus unused import, absent manifest entry, and MCP CLI command inconsistency remain from the prior review round but do not block merge.

Important Files Changed

Filename Overview
skills/evalview-agent-testing/SKILL.md New skill file documenting EvalView regression testing; several prior review concerns addressed (auto-revert warning added, score_delta guard fixed, pip version pinned, CI action removed). DiffStatus import still unused; skill still absent from manifests/install-modules.json; MCP CLI install command still uses binary form inconsistently with mcp-servers.json.
mcp-configs/mcp-servers.json New evalview MCP server entry added; uses robust python3 -m invocation, OPENAI_API_KEY placeholder, and versioned install instruction — all previously flagged issues addressed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Setup
        H[pip install evalview] --> I[evalview init]
        I --> J[evalview snapshot - golden baseline committed to git]
    end

    J --> B

    A[Agent Code Change] --> B[evalview check]
    B --> C{DiffStatus}
    C -->|PASSED| D[Ship with confidence]
    C -->|TOOLS_CHANGED| E[Review tool diff]
    C -->|OUTPUT_CHANGED| F[Review output diff]
    C -->|REGRESSION| G[Fix before shipping]
    G --> A

    subgraph AutonomousLoop
        K[make_code_change] --> L[gate_or_revert]
        L -->|passed=True| M[Continue loop]
        L -->|passed=False| N[git checkout - auto-reverted]
        N --> O[try_alternative_approach]
        O --> K
    end

    subgraph CIPipeline
        P[PR opened] --> Q[checkout repo]
        Q --> R[pip install evalview]
        R --> S[evalview check --fail-on REGRESSION]
        S -->|exit 0| T[PR passes]
        S -->|exit 1| U[PR blocked]
    end
Loading

Reviews (13): Last reviewed commit: "Merge branch 'main' into feat/evalview-a..." | Re-trigger Greptile

Comment thread skills/evalview-agent-testing/SKILL.md Outdated
Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:

```python
from evalview import gate, DiffStatus

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DiffStatus imported but never used in example

DiffStatus is imported on this line but not referenced anywhere in the surrounding code block. This will cause a linter warning if users copy the snippet directly, and the purpose of the import is unclear without a usage example.

Either remove the unused import or add a concrete example showing when to use DiffStatus (e.g. comparing against DiffStatus.REGRESSION):

Suggested change
from evalview import gate, DiffStatus
from evalview import gate

Comment on lines +1 to +6
---
name: evalview-agent-testing
description: Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.
origin: ECC
tools: Bash, Read, Write
---

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Skill not registered in manifests/install-modules.json

Per the Skill Placement Policy, all curated skills in skills/ must be listed in manifests/install-modules.json:

"Included in manifests/install-modules.json paths."

This skill is not added to any module in that file. Without a manifest entry, users who install ECC via the official installer will not receive this skill. It would fit naturally as an entry in the workflow-quality module alongside the related skills/ai-regression-testing and skills/eval-harness paths.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
skills/evalview-agent-testing/SKILL.md (2)

143-152: Consider adding .gitignore recommendation.

Line 149 correctly advises against committing state.json, but users would benefit from a concrete .gitignore pattern to prevent accidental commits.

📝 Optional: Add .gitignore guidance
  - **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`.
+ - **Add `.evalview/state.json` to .gitignore** to prevent accidental commits of transient state.
  - **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/evalview-agent-testing/SKILL.md` around lines 143 - 152, Add a
concrete .gitignore recommendation to the Best Practices section so users don't
accidentally commit runtime state: update the text near the references to
".evalview/golden/" and "state.json" to include a suggested .gitignore entry
(e.g., ignore .evalview/state.json and other transient files) and a short note
to commit ".evalview/golden/" but exclude "state.json"; reference the exact
names "state.json", ".evalview/golden/" and ".gitignore" so reviewers can locate
and apply the change in SKILL.md.

92-112: Pin GitHub Action to a specific version for stability.

The GitHub Action reference at line 106 uses @main, which points to a moving target. Action updates could introduce breaking changes without warning, causing CI failures.

📌 Recommended: Pin to a specific version

First, verify the action exists and check available versions:

#!/bin/bash
# Description: Verify GitHub Action existence and list available tags

echo "=== Checking if hidai25/eval-view action exists ==="
gh api repos/hidai25/eval-view --jq '.full_name' 2>/dev/null || echo "Action repository not found"

echo -e "\n=== Listing available tags/versions ==="
gh api repos/hidai25/eval-view/tags --jq '.[].name' 2>/dev/null || echo "No tags found or repo not accessible"

echo -e "\n=== Checking latest release ==="
gh api repos/hidai25/eval-view/releases/latest --jq '.tag_name' 2>/dev/null || echo "No releases found"

Once verified, update the documentation to recommend version pinning:

      - name: Check for regressions
-        uses: hidai25/eval-view@main
+        uses: hidai25/eval-view@v1.0.0  # Pin to specific version
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/evalview-agent-testing/SKILL.md` around lines 92 - 112, The workflow
example pins the GitHub Action to a moving ref "uses: hidai25/eval-view@main",
which is brittle; update the example to recommend and show pinning to a specific
release tag or commit SHA (e.g., replace "hidai25/eval-view@main" with
"hidai25/eval-view@vX.Y.Z" or a specific commit SHA) and add a short note to
verify available tags/releases (use the repo tags or releases to pick the stable
version); reference the workflow name "Agent Regression Check" and the uses line
"uses: hidai25/eval-view@main" so reviewers can locate and change the example
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@skills/evalview-agent-testing/SKILL.md`:
- Around line 143-152: Add a concrete .gitignore recommendation to the Best
Practices section so users don't accidentally commit runtime state: update the
text near the references to ".evalview/golden/" and "state.json" to include a
suggested .gitignore entry (e.g., ignore .evalview/state.json and other
transient files) and a short note to commit ".evalview/golden/" but exclude
"state.json"; reference the exact names "state.json", ".evalview/golden/" and
".gitignore" so reviewers can locate and apply the change in SKILL.md.
- Around line 92-112: The workflow example pins the GitHub Action to a moving
ref "uses: hidai25/eval-view@main", which is brittle; update the example to
recommend and show pinning to a specific release tag or commit SHA (e.g.,
replace "hidai25/eval-view@main" with "hidai25/eval-view@vX.Y.Z" or a specific
commit SHA) and add a short note to verify available tags/releases (use the repo
tags or releases to pick the stable version); reference the workflow name "Agent
Regression Check" and the uses line "uses: hidai25/eval-view@main" so reviewers
can locate and change the example accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d061b61b-b2bf-4514-8cc3-b7a7c7bd1891

📥 Commits

Reviewing files that changed from the base of the PR and between bacc585 and 592cd12.

📒 Files selected for processing (2)
  • mcp-configs/mcp-servers.json
  • skills/evalview-agent-testing/SKILL.md

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: CI example uses a third-party action pinned to a mutable branch (`@main`) while requesting `pull-requests: write`, which is a supply-chain risk and leads to non-reproducible runs. Pin to a specific commit SHA or immutable release tag instead.</violation>

<violation number="2" location="skills/evalview-agent-testing/SKILL.md:155">
P2: User-facing docs link to external GitHub repositories without demonstrated org vetting, conflicting with the project’s supply-chain guidance.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread skills/evalview-agent-testing/SKILL.md Outdated
Comment thread skills/evalview-agent-testing/SKILL.md Outdated
- Pin hidai25/eval-view action to commit SHA instead of @main
- Replace external GitHub links with PyPI package link (vetted registry)

Addresses cubic-dev-ai review feedback.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: User-facing skill docs recommend a non-org third-party GitHub Action in CI and pass secrets to it, conflicting with repo supply-chain guidance to avoid unvetted external repositories.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread skills/evalview-agent-testing/SKILL.md Outdated
Use plain pip install + evalview CLI instead of a third-party GitHub
Action. No external actions, no secrets passed to unvetted code.

Addresses cubic-dev-ai supply-chain review feedback.
Comment thread skills/evalview-agent-testing/SKILL.md
Add prominent warning that gate_or_revert runs git checkout,
discarding uncommitted changes. Documents the revert_cmd override
for safer alternatives like git stash.

Addresses cubic-dev-ai review feedback.
Comment thread skills/evalview-agent-testing/SKILL.md
Comment thread skills/evalview-agent-testing/SKILL.md
- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades
- Document --fail-on REGRESSION vs --strict tradeoff so users
  understand what gates and what passes through

Addresses greptile-apps review feedback.
Comment thread mcp-configs/mcp-servers.json

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:107">
P2: CI pins EvalView to a version range while the rest of the doc still instructs unpinned installs, which can cause local baselines to be generated with a different version than CI and lead to unexpected diffs or failures. Align the version guidance across sections.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread skills/evalview-agent-testing/SKILL.md
Follows the same pattern as insaits entry. Resolves correctly even
when evalview is installed in a virtual environment that isn't on
the system PATH.
Comment thread skills/evalview-agent-testing/SKILL.md
Use python3 -m evalview mcp serve consistently across both the
skill docs and the MCP config catalog.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:84">
P2: MCP launch command hardcodes `python3` while install docs use plain `pip`, which can break setup in multi-Python environments.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread skills/evalview-agent-testing/SKILL.md Outdated
pip install evalview installs the evalview binary to PATH, so using
it directly is consistent with the install docs and avoids python3
version mismatch issues.
Comment thread skills/evalview-agent-testing/SKILL.md

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:156">
P2: Installation instructions are inconsistent: this PR pins EvalView in Installation/CI but Core Workflow still uses unpinned `pip install evalview`, which can produce non-reproducible setups.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread skills/evalview-agent-testing/SKILL.md
Comment thread mcp-configs/mcp-servers.json
Comment thread mcp-configs/mcp-servers.json Outdated
Add OPENAI_API_KEY env placeholder matching other entries. Note that
the key is optional — deterministic checks work without it. Pin
install version to match skill docs.
Comment thread skills/evalview-agent-testing/SKILL.md Outdated
@affaan-m

Copy link
Copy Markdown
Owner

thanks for the pr, queued for review.

@hidai25

hidai25 commented Mar 23, 2026 via email

Copy link
Copy Markdown
Contributor Author

@affaan-m affaan-m merged commit 0f40fd0 into affaan-m:main Mar 31, 2026
4 checks passed
peiking88 pushed a commit to peiking88/everything-claude-code that referenced this pull request Apr 4, 2026
…-m#828)

* feat(skills): add evalview-agent-testing skill and MCP server

Add EvalView as a regression testing skill for AI agents. EvalView
snapshots agent behavior (tool calls, parameters, output), then diffs
against baselines after every change — catching regressions before they
ship.

Skill covers:
- CLI workflow (init → snapshot → check → monitor)
- Python API (gate() / gate_async() for autonomous loops)
- Quick mode (no LLM judge, $0, sub-second)
- CI/CD integration (GitHub Actions with PR comments)
- MCP integration (8 tools for Claude Code)
- Multi-turn test cases
- OpenClaw integration for autonomous agents

Also adds evalview MCP server to mcp-servers.json.

* fix(skills): pin action SHA and remove unvetted external links

- Pin hidai25/eval-view action to commit SHA instead of @main
- Replace external GitHub links with PyPI package link (vetted registry)

Addresses cubic-dev-ai review feedback.

* fix(skills): replace third-party action with pip install + CLI

Use plain pip install + evalview CLI instead of a third-party GitHub
Action. No external actions, no secrets passed to unvetted code.

Addresses cubic-dev-ai supply-chain review feedback.

* fix(skills): add destructive revert warning for gate_or_revert

Add prominent warning that gate_or_revert runs git checkout,
discarding uncommitted changes. Documents the revert_cmd override
for safer alternatives like git stash.

Addresses cubic-dev-ai review feedback.

* fix(skills): pin pip version range and document fail-on tradeoffs

- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades
- Document --fail-on REGRESSION vs --strict tradeoff so users
  understand what gates and what passes through

Addresses greptile-apps review feedback.

* fix: use python3 -m evalview for venv compatibility in MCP config

Follows the same pattern as insaits entry. Resolves correctly even
when evalview is installed in a virtual environment that isn't on
the system PATH.

* fix: align MCP install command with mcp-servers.json pattern

Use python3 -m evalview mcp serve consistently across both the
skill docs and the MCP config catalog.

* fix: use evalview CLI entry point for MCP command

pip install evalview installs the evalview binary to PATH, so using
it directly is consistent with the install docs and avoids python3
version mismatch issues.

* fix: pin install version to match CI section

* fix: pin all pip install references consistently

* fix: add API key placeholder and pin install version in MCP config

Add OPENAI_API_KEY env placeholder matching other entries. Note that
the key is optional — deterministic checks work without it. Pin
install version to match skill docs.

* fix: guard score_delta format for non-scored statuses

---------

Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
FrancescoRosciano pushed a commit to FRosciano-Mambo/everything-claude-code that referenced this pull request Jun 1, 2026
…-m#828)

* feat(skills): add evalview-agent-testing skill and MCP server

Add EvalView as a regression testing skill for AI agents. EvalView
snapshots agent behavior (tool calls, parameters, output), then diffs
against baselines after every change — catching regressions before they
ship.

Skill covers:
- CLI workflow (init → snapshot → check → monitor)
- Python API (gate() / gate_async() for autonomous loops)
- Quick mode (no LLM judge, $0, sub-second)
- CI/CD integration (GitHub Actions with PR comments)
- MCP integration (8 tools for Claude Code)
- Multi-turn test cases
- OpenClaw integration for autonomous agents

Also adds evalview MCP server to mcp-servers.json.

* fix(skills): pin action SHA and remove unvetted external links

- Pin hidai25/eval-view action to commit SHA instead of @main
- Replace external GitHub links with PyPI package link (vetted registry)

Addresses cubic-dev-ai review feedback.

* fix(skills): replace third-party action with pip install + CLI

Use plain pip install + evalview CLI instead of a third-party GitHub
Action. No external actions, no secrets passed to unvetted code.

Addresses cubic-dev-ai supply-chain review feedback.

* fix(skills): add destructive revert warning for gate_or_revert

Add prominent warning that gate_or_revert runs git checkout,
discarding uncommitted changes. Documents the revert_cmd override
for safer alternatives like git stash.

Addresses cubic-dev-ai review feedback.

* fix(skills): pin pip version range and document fail-on tradeoffs

- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades
- Document --fail-on REGRESSION vs --strict tradeoff so users
  understand what gates and what passes through

Addresses greptile-apps review feedback.

* fix: use python3 -m evalview for venv compatibility in MCP config

Follows the same pattern as insaits entry. Resolves correctly even
when evalview is installed in a virtual environment that isn't on
the system PATH.

* fix: align MCP install command with mcp-servers.json pattern

Use python3 -m evalview mcp serve consistently across both the
skill docs and the MCP config catalog.

* fix: use evalview CLI entry point for MCP command

pip install evalview installs the evalview binary to PATH, so using
it directly is consistent with the install docs and avoids python3
version mismatch issues.

* fix: pin install version to match CI section

* fix: pin all pip install references consistently

* fix: add API key placeholder and pin install version in MCP config

Add OPENAI_API_KEY env placeholder matching other entries. Note that
the key is optional — deterministic checks work without it. Pin
install version to match skill docs.

* fix: guard score_delta format for non-scored statuses

---------

Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants