feat(skills): add evalview-agent-testing skill and MCP server by hidai25 · Pull Request #828 · affaan-m/ECC

hidai25 · 2026-03-23T11:30:15Z

Summary

Adds EvalView as a regression testing skill for AI agents, plus its MCP server config.

EvalView snapshots agent behavior (tool calls, parameters, sequence, output), then diffs against baselines after every change. It's the implementation layer for the patterns described in ai-regression-testing and the eval-driven workflow in eval-harness — but as an actual executable tool with CLI, Python API, and MCP integration.

What it adds:

skills/evalview-agent-testing/SKILL.md — teaches Claude Code to use EvalView for agent regression testing
mcp-configs/mcp-servers.json — adds the evalview MCP server (8 tools: create_test, run_snapshot, run_check, etc.)

Key capabilities surfaced in the skill:

CLI workflow: evalview init → snapshot → check → monitor
Python API: gate() / gate_async() for programmatic checks in autonomous loops
Quick mode: no LLM judge, $0, sub-second — ideal for tight agent loops
CI/CD: GitHub Action with automatic PR comments, cost/latency alerts
Multi-turn test cases with per-turn evaluation
OpenClaw integration for autonomous agents

Type

Skill
MCP Config

Testing

Skill follows the standard SKILL.md format with YAML frontmatter
EvalView MCP server tested with Claude Code (claude mcp add --transport stdio evalview -- evalview mcp serve)
EvalView has 1126 passing tests and is published on PyPI (pip install evalview)

Checklist

Follows format guidelines
Tested with Claude Code
No sensitive info (API keys, paths)
Clear descriptions

Summary by cubic

Adds evalview as a regression-testing skill and configures its MCP server. Snapshots tool calls and outputs, diffs against baselines, and gates changes in dev and CI.

New Features
- New skill: skills/evalview-agent-testing/SKILL.md covering CLI workflow, Python gate()/gate_async(), quick mode, CI usage, multi-turn tests, and OpenClaw auto-revert.
- MCP server: python3 -m evalview mcp serve added to mcp-configs/mcp-servers.json with 8 tools; includes OPENAI_API_KEY placeholder (optional — deterministic checks work without it).
Bug Fixes
- Pin evalview to >=0.5,<1; replace third-party action with pip install + CLI.
- Add a warning that gate_or_revert runs git checkout -- .; document revert_cmd alternatives.
- Use python3 -m evalview mcp serve in MCP config for venv compatibility; docs show the evalview CLI.
- Guard score_delta formatting in sample code for non-scored statuses.

^{Written for commit 66ae934. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Added EvalView MCP server integration to enable end-to-end AI agent regression testing, baseline creation, diff-based change gating, and optional automated revert behavior.
Documentation
- Added a comprehensive EvalView guide covering setup, snapshot/check/monitor workflows, status meanings and responses, programmatic gating and revert patterns, CI/CD examples, supported single- and multi-turn test formats, and best practices.

Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json.

coderabbitai · 2026-03-23T11:30:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new EvalView MCP server entry and a comprehensive EvalView skill document describing end-to-end regression testing workflows, CLI and Python usage, MCP registration, CI integration, and YAML test formats.

Changes

Cohort / File(s)	Summary
MCP Server Configuration `mcp-configs/mcp-servers.json`	Adds new `mcpServers.evalview` entry with `command: "evalview"`, `args: ["mcp", "serve"]`, and a descriptive label for EvalView.
EvalView Skill Documentation `skills/evalview-agent-testing/SKILL.md`	New documentation describing CLI workflow (`evalview init`, `snapshot`, `check`, `monitor`), check statuses, Python API (`gate()`, `gate_or_revert()`), MCP registration/examples, CI/CD examples, YAML test schemas, and best-practices.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Dev/CI
  participant EvalView as EvalView Service
  participant MCP as MCP Server
  participant Agent as AI Agent
  participant Ops as Ops (Webhook)

  Dev->>EvalView: trigger test run (CI/manual)
  EvalView->>MCP: start/register tools (`mcp serve`)
  EvalView->>Agent: execute test cases (single/multi-turn)
  Agent-->>EvalView: outputs and tool calls
  EvalView->>EvalView: compare outputs to baseline snapshot
  EvalView-->>Dev: report result (PASSED/TOOLS_CHANGED/OUTPUT_CHANGED/REGRESSION)
  EvalView-->>Ops: send notification (optional webhook)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

affaan-m

Poem

🐰✨ I hopped to test with eager cheer,
Snapshots kept the baselines near.
Gates that check, and loops that mend,
Regressions caught before they bend.
Hop on—EvalView guards each gear!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a new evalview-agent-testing skill and its MCP server configuration, which aligns with the PR objectives and file changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-23T11:33:10Z

Greptile Summary

This PR introduces skills/evalview-agent-testing/SKILL.md and a matching mcp-configs/mcp-servers.json entry for EvalView, an AI agent regression-testing tool. EvalView snapshots agent behavior (tool calls, parameters, sequence, output quality) and diffs against baselines to gate changes in dev and CI. The skill sits logically alongside the existing skills/ai-regression-testing and skills/eval-harness content and provides a concrete, executable implementation layer for those patterns.

A prior round of review raised a number of concerns; this iteration addresses the majority of them:

The risky third-party GitHub Action (hidai25/eval-view@main) has been removed entirely — CI now uses pip install followed by the CLI, removing the supply-chain risk.
score_delta format string is now guarded with if d.score_delta is not None else \"\", preventing TypeError on non-scored statuses.
pip install evalview is now consistently pinned to \"evalview>=0.5,<1\" in both the Installation section and the MCP config description.
The auto-revert warning block has been added to clearly communicate the destructive nature of gate_or_revert.
The --fail-on REGRESSION trade-off is now documented with alternatives (--fail-on REGRESSION,TOOLS_CHANGED, --strict).
The MCP mcp-servers.json entry now uses the robust python3 -m evalview mcp serve invocation and includes the OPENAI_API_KEY placeholder with an explanation that it is optional.

A few items from the previous review cycle remain unresolved in this diff: the DiffStatus symbol is imported but not referenced in the Python API snippet; the skill is still absent from manifests/install-modules.json (meaning it won't be installed via the official installer); and the in-skill MCP install command (evalview mcp serve) still uses the binary form inconsistently with the config entry (python3 -m evalview mcp serve).

Confidence Score: 5/5

Safe to merge — all P0/P1 concerns from the previous review round have been resolved; remaining open items are P2 style issues already flagged in earlier threads.

All findings from this review pass are P2 or lower. The major risks (mutable GH Action, TypeError on score_delta, unpinned pip dep, missing auto-revert warning, missing API key guidance) were addressed in this iteration. The skill's documentation is clear, the MCP config entry is robust, and no new critical issues were found.

skills/evalview-agent-testing/SKILL.md — DiffStatus unused import, absent manifest entry, and MCP CLI command inconsistency remain from the prior review round but do not block merge.

Important Files Changed

Filename	Overview
skills/evalview-agent-testing/SKILL.md	New skill file documenting EvalView regression testing; several prior review concerns addressed (auto-revert warning added, score_delta guard fixed, pip version pinned, CI action removed). DiffStatus import still unused; skill still absent from manifests/install-modules.json; MCP CLI install command still uses binary form inconsistently with mcp-servers.json.
mcp-configs/mcp-servers.json	New evalview MCP server entry added; uses robust python3 -m invocation, OPENAI_API_KEY placeholder, and versioned install instruction — all previously flagged issues addressed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Setup
        H[pip install evalview] --> I[evalview init]
        I --> J[evalview snapshot - golden baseline committed to git]
    end

    J --> B

    A[Agent Code Change] --> B[evalview check]
    B --> C{DiffStatus}
    C -->|PASSED| D[Ship with confidence]
    C -->|TOOLS_CHANGED| E[Review tool diff]
    C -->|OUTPUT_CHANGED| F[Review output diff]
    C -->|REGRESSION| G[Fix before shipping]
    G --> A

    subgraph AutonomousLoop
        K[make_code_change] --> L[gate_or_revert]
        L -->|passed=True| M[Continue loop]
        L -->|passed=False| N[git checkout - auto-reverted]
        N --> O[try_alternative_approach]
        O --> K
    end

    subgraph CIPipeline
        P[PR opened] --> Q[checkout repo]
        Q --> R[pip install evalview]
        R --> S[evalview check --fail-on REGRESSION]
        S -->|exit 0| T[PR passes]
        S -->|exit 1| U[PR blocked]
    end

_{Reviews (13): Last reviewed commit: "Merge branch 'main' into feat/evalview-a..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T11:33:15Z

+Use `gate()` as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:
+
+```python
+from evalview import gate, DiffStatus


DiffStatus imported but never used in example

DiffStatus is imported on this line but not referenced anywhere in the surrounding code block. This will cause a linter warning if users copy the snippet directly, and the purpose of the import is unclear without a usage example.

Either remove the unused import or add a concrete example showing when to use DiffStatus (e.g. comparing against DiffStatus.REGRESSION):

Suggested change

from evalview import gate, DiffStatus

from evalview import gate

greptile-apps · 2026-03-23T11:33:16Z

+---
+name: evalview-agent-testing
+description: Regression testing for AI agents using EvalView. Snapshot agent behavior, detect regressions in tool calls and output quality, and block broken agents before production.
+origin: ECC
+tools: Bash, Read, Write
+---


Skill not registered in manifests/install-modules.json

Per the Skill Placement Policy, all curated skills in skills/ must be listed in manifests/install-modules.json:

"Included in manifests/install-modules.json paths."

This skill is not added to any module in that file. Without a manifest entry, users who install ECC via the official installer will not receive this skill. It would fit naturally as an entry in the workflow-quality module alongside the related skills/ai-regression-testing and skills/eval-harness paths.

coderabbitai

🧹 Nitpick comments (2)

skills/evalview-agent-testing/SKILL.md (2)

143-152: Consider adding .gitignore recommendation.

Line 149 correctly advises against committing state.json, but users would benefit from a concrete .gitignore pattern to prevent accidental commits.

📝 Optional: Add .gitignore guidance

  - **Commit `.evalview/golden/` to git.** Baselines should be versioned. Don't commit `state.json`.
+ - **Add `.evalview/state.json` to .gitignore** to prevent accidental commits of transient state.
  - **Use variants for non-deterministic agents.** `evalview snapshot --variant v2` stores alternate valid behaviors (up to 5).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/evalview-agent-testing/SKILL.md` around lines 143 - 152, Add a
concrete .gitignore recommendation to the Best Practices section so users don't
accidentally commit runtime state: update the text near the references to
".evalview/golden/" and "state.json" to include a suggested .gitignore entry
(e.g., ignore .evalview/state.json and other transient files) and a short note
to commit ".evalview/golden/" but exclude "state.json"; reference the exact
names "state.json", ".evalview/golden/" and ".gitignore" so reviewers can locate
and apply the change in SKILL.md.

92-112: Pin GitHub Action to a specific version for stability.

The GitHub Action reference at line 106 uses @main, which points to a moving target. Action updates could introduce breaking changes without warning, causing CI failures.

📌 Recommended: Pin to a specific version

First, verify the action exists and check available versions:

#!/bin/bash
# Description: Verify GitHub Action existence and list available tags

echo "=== Checking if hidai25/eval-view action exists ==="
gh api repos/hidai25/eval-view --jq '.full_name' 2>/dev/null || echo "Action repository not found"

echo -e "\n=== Listing available tags/versions ==="
gh api repos/hidai25/eval-view/tags --jq '.[].name' 2>/dev/null || echo "No tags found or repo not accessible"

echo -e "\n=== Checking latest release ==="
gh api repos/hidai25/eval-view/releases/latest --jq '.tag_name' 2>/dev/null || echo "No releases found"

Once verified, update the documentation to recommend version pinning:

      - name: Check for regressions
-        uses: hidai25/eval-view@main
+        uses: hidai25/eval-view@v1.0.0  # Pin to specific version
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/evalview-agent-testing/SKILL.md` around lines 92 - 112, The workflow
example pins the GitHub Action to a moving ref "uses: hidai25/eval-view@main",
which is brittle; update the example to recommend and show pinning to a specific
release tag or commit SHA (e.g., replace "hidai25/eval-view@main" with
"hidai25/eval-view@vX.Y.Z" or a specific commit SHA) and add a short note to
verify available tags/releases (use the repo tags or releases to pick the stable
version); reference the workflow name "Agent Regression Check" and the uses line
"uses: hidai25/eval-view@main" so reviewers can locate and change the example
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@skills/evalview-agent-testing/SKILL.md`:
- Around line 143-152: Add a concrete .gitignore recommendation to the Best
Practices section so users don't accidentally commit runtime state: update the
text near the references to ".evalview/golden/" and "state.json" to include a
suggested .gitignore entry (e.g., ignore .evalview/state.json and other
transient files) and a short note to commit ".evalview/golden/" but exclude
"state.json"; reference the exact names "state.json", ".evalview/golden/" and
".gitignore" so reviewers can locate and apply the change in SKILL.md.
- Around line 92-112: The workflow example pins the GitHub Action to a moving
ref "uses: hidai25/eval-view@main", which is brittle; update the example to
recommend and show pinning to a specific release tag or commit SHA (e.g.,
replace "hidai25/eval-view@main" with "hidai25/eval-view@vX.Y.Z" or a specific
commit SHA) and add a short note to verify available tags/releases (use the repo
tags or releases to pick the stable version); reference the workflow name "Agent
Regression Check" and the uses line "uses: hidai25/eval-view@main" so reviewers
can locate and change the example accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d061b61b-b2bf-4514-8cc3-b7a7c7bd1891

📥 Commits

Reviewing files that changed from the base of the PR and between bacc585 and 592cd12.

📒 Files selected for processing (2)

mcp-configs/mcp-servers.json
skills/evalview-agent-testing/SKILL.md

cubic-dev-ai

2 issues found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: CI example uses a third-party action pinned to a mutable branch (`@main`) while requesting `pull-requests: write`, which is a supply-chain risk and leads to non-reproducible runs. Pin to a specific commit SHA or immutable release tag instead.</violation>

<violation number="2" location="skills/evalview-agent-testing/SKILL.md:155">
P2: User-facing docs link to external GitHub repositories without demonstrated org vetting, conflicting with the project’s supply-chain guidance.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

@main

- Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:106">
P2: User-facing skill docs recommend a non-org third-party GitHub Action in CI and pass secrets to it, conflicting with repo supply-chain guidance to avoid unvetted external repositories.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback.

Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback.

- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:107">
P2: CI pins EvalView to a version range while the rest of the doc still instructs unpinned installs, which can cause local baselines to be generated with a different version than CI and lead to unexpected diffs or failures. Align the version guidance across sections.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH.

Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:84">
P2: MCP launch command hardcodes `python3` while install docs use plain `pip`, which can break setup in multi-Python environments.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/evalview-agent-testing/SKILL.md">

<violation number="1" location="skills/evalview-agent-testing/SKILL.md:156">
P2: Installation instructions are inconsistent: this PR pins EvalView in Installation/CI but Core Workflow still uses unpinned `pip install evalview`, which can produce non-reproducible setups.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs.

affaan-m · 2026-03-23T13:23:21Z

thanks for the pr, queued for review.

hidai25 · 2026-03-23T13:29:16Z

*Thanks* *Affaan!* *Happy* *to* *make* *any* *changes* *if* *needed.*

…

On Mon, Mar 23, 2026 at 3:23 PM Affaan Mustafa ***@***.***> wrote: *affaan-m* left a comment (affaan-m/ECC#828) <#828 (comment)> thanks for the pr, queued for review. — Reply to this email directly, view it on GitHub <#828>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHQLDTFUPAYRWNCRJ5V47Z34SE3GBAVCNFSM6AAAAACW36YEYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMJQGU3TINJVHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

@main

…-m#828) * feat(skills): add evalview-agent-testing skill and MCP server Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json. * fix(skills): pin action SHA and remove unvetted external links - Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback. * fix(skills): replace third-party action with pip install + CLI Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback. * fix(skills): add destructive revert warning for gate_or_revert Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback. * fix(skills): pin pip version range and document fail-on tradeoffs - Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback. * fix: use python3 -m evalview for venv compatibility in MCP config Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH. * fix: align MCP install command with mcp-servers.json pattern Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog. * fix: use evalview CLI entry point for MCP command pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues. * fix: pin install version to match CI section * fix: pin all pip install references consistently * fix: add API key placeholder and pin install version in MCP config Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs. * fix: guard score_delta format for non-scored statuses --------- Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>

@main

…-m#828) * feat(skills): add evalview-agent-testing skill and MCP server Add EvalView as a regression testing skill for AI agents. EvalView snapshots agent behavior (tool calls, parameters, output), then diffs against baselines after every change — catching regressions before they ship. Skill covers: - CLI workflow (init → snapshot → check → monitor) - Python API (gate() / gate_async() for autonomous loops) - Quick mode (no LLM judge, $0, sub-second) - CI/CD integration (GitHub Actions with PR comments) - MCP integration (8 tools for Claude Code) - Multi-turn test cases - OpenClaw integration for autonomous agents Also adds evalview MCP server to mcp-servers.json. * fix(skills): pin action SHA and remove unvetted external links - Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback. * fix(skills): replace third-party action with pip install + CLI Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback. * fix(skills): add destructive revert warning for gate_or_revert Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback. * fix(skills): pin pip version range and document fail-on tradeoffs - Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback. * fix: use python3 -m evalview for venv compatibility in MCP config Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH. * fix: align MCP install command with mcp-servers.json pattern Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog. * fix: use evalview CLI entry point for MCP command pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues. * fix: pin install version to match CI section * fix: pin all pip install references consistently * fix: add API key placeholder and pin install version in MCP config Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs. * fix: guard score_delta format for non-scored statuses --------- Co-authored-by: Affaan Mustafa <me@affaanmustafa.com>

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md Outdated

Comment thread skills/evalview-agent-testing/SKILL.md Outdated

fix(skills): pin action SHA and remove unvetted external links

992f6b1

- Pin hidai25/eval-view action to commit SHA instead of @main - Replace external GitHub links with PyPI package link (vetted registry) Addresses cubic-dev-ai review feedback.

cubic-dev-ai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md Outdated

fix(skills): replace third-party action with pip install + CLI

71b8d65

Use plain pip install + evalview CLI instead of a third-party GitHub Action. No external actions, no secrets passed to unvetted code. Addresses cubic-dev-ai supply-chain review feedback.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

fix(skills): add destructive revert warning for gate_or_revert

cfbe39a

Add prominent warning that gate_or_revert runs git checkout, discarding uncommitted changes. Documents the revert_cmd override for safer alternatives like git stash. Addresses cubic-dev-ai review feedback.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

Comment thread skills/evalview-agent-testing/SKILL.md

fix(skills): pin pip version range and document fail-on tradeoffs

ae3676c

- Pin evalview to >=0.5,<1 to prevent breaking CI on major upgrades - Document --fail-on REGRESSION vs --strict tradeoff so users understand what gates and what passes through Addresses greptile-apps review feedback.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread mcp-configs/mcp-servers.json

cubic-dev-ai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

fix: use python3 -m evalview for venv compatibility in MCP config

a7f16ed

Follows the same pattern as insaits entry. Resolves correctly even when evalview is installed in a virtual environment that isn't on the system PATH.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

fix: align MCP install command with mcp-servers.json pattern

92baf0c

Use python3 -m evalview mcp serve consistently across both the skill docs and the MCP config catalog.

cubic-dev-ai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md Outdated

fix: use evalview CLI entry point for MCP command

1d24068

pip install evalview installs the evalview binary to PATH, so using it directly is consistent with the install docs and avoids python3 version mismatch issues.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

fix: pin install version to match CI section

f931b80

cubic-dev-ai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md

fix: pin all pip install references consistently

8f5ded2

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread mcp-configs/mcp-servers.json

Comment thread mcp-configs/mcp-servers.json Outdated

fix: add API key placeholder and pin install version in MCP config

d0e4643

Add OPENAI_API_KEY env placeholder matching other entries. Note that the key is optional — deterministic checks work without it. Pin install version to match skill docs.

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread skills/evalview-agent-testing/SKILL.md Outdated

fix: guard score_delta format for non-scored statuses

9340829

Merge branch 'main' into feat/evalview-agent-testing

66ae934

affaan-m merged commit 0f40fd0 into affaan-m:main Mar 31, 2026
4 checks passed

	from evalview import gate, DiffStatus
	from evalview import gate

Uh oh!

Conversation

hidai25 commented Mar 23, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type

Testing

Checklist

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

greptile-apps Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

affaan-m commented Mar 23, 2026

Uh oh!

hidai25 commented Mar 23, 2026 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

hidai25 commented Mar 23, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading

greptile-apps Bot commented Mar 23, 2026 •

edited

Loading