feat: MCP tool-use evaluation primitives

## Problem

Skills increasingly call MCP tools. Today, evaluating an MCP-using skill in CI requires:

- A real MCP server (network, auth, side effects, flaky).
- Or hand-rolled stubs per skill (no shared shape).

There's no waza-native way to stand up a **deterministic MCP mock** for an eval run. Tool-call assertions and record/replay of MCP traffic are handled by sibling issues (see "Non-goals" below).

## Proposal

Ship a built-in MCP mock server that waza launches alongside the eval:

```yaml
mcp_mocks:
  - name: github
    fixtures: ./fixtures/github-mcp/
    tools:
      list_issues:
        responses:
          - match: { owner: "octocat" }
            return: { issues: [...] }
```

- Mock server speaks MCP over stdio/HTTP (matches existing `config.mcp_servers` shape).
- Fixtures are JSON files in a directory; matching is exact-args by default, with optional JSON-schema match.
- Mock runs in-process during the eval; no network, no port collisions in CI.
- Unknown tool calls fail loudly (not silently return null) so missing fixtures are visible.
- Recorded fixtures redact secrets via the same policy used by snapshots (#367).

## Why this matters for agentic-first

MCP is the primary tool surface for agents. Without deterministic MCP mocking, every MCP-using skill is untestable in CI — or worse, "tested" against a live service that mutates state. This is the single biggest blocker to CI-first agentic skill development.

## Acceptance criteria

- [ ] `mcp_mocks:` field in eval schema; backward compatible.
- [ ] Mock server runs in-process and registers as an MCP server the skill can call.
- [ ] Fixture matching: exact-args (default), JSON-schema, regex on individual fields.
- [ ] Unknown/unmatched calls fail the task with a clear error pointing at the missing fixture.
- [ ] No live network required in CI; tests in `internal/` verify this.
- [ ] Docs in `site/` with a worked MCP eval example.

## Non-goals (filed separately)

- MCP-specific grader assertions (tool name/args/sequence) — handled by enhanced tool graders in #366.
- Record-and-replay of real MCP traffic — handled by snapshot/replay in #367.
- Adversarial MCP test packs (poisoned responses, scope-bypass) — see #365.

## Related

- Existing: `config.mcp_servers`
- Tool graders: #366
- Snapshot/replay: #367
- Safety packs: #365
- Roadmap: #66


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: MCP tool-use evaluation primitives #363

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: MCP tool-use evaluation primitives #363

Description

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions