Skip to content

feat: MCP tool-use evaluation primitives #363

Description

@spboyer

Problem

Skills increasingly call MCP tools. Today, evaluating an MCP-using skill in CI requires:

  • A real MCP server (network, auth, side effects, flaky).
  • Or hand-rolled stubs per skill (no shared shape).

There's no waza-native way to stand up a deterministic MCP mock for an eval run. Tool-call assertions and record/replay of MCP traffic are handled by sibling issues (see "Non-goals" below).

Proposal

Ship a built-in MCP mock server that waza launches alongside the eval:

mcp_mocks:
  - name: github
    fixtures: ./fixtures/github-mcp/
    tools:
      list_issues:
        responses:
          - match: { owner: "octocat" }
            return: { issues: [...] }
  • Mock server speaks MCP over stdio/HTTP (matches existing config.mcp_servers shape).
  • Fixtures are JSON files in a directory; matching is exact-args by default, with optional JSON-schema match.
  • Mock runs in-process during the eval; no network, no port collisions in CI.
  • Unknown tool calls fail loudly (not silently return null) so missing fixtures are visible.
  • Recorded fixtures redact secrets via the same policy used by snapshots (feat: Trace replay and deterministic snapshot for agent runs #367).

Why this matters for agentic-first

MCP is the primary tool surface for agents. Without deterministic MCP mocking, every MCP-using skill is untestable in CI — or worse, "tested" against a live service that mutates state. This is the single biggest blocker to CI-first agentic skill development.

Acceptance criteria

  • mcp_mocks: field in eval schema; backward compatible.
  • Mock server runs in-process and registers as an MCP server the skill can call.
  • Fixture matching: exact-args (default), JSON-schema, regex on individual fields.
  • Unknown/unmatched calls fail the task with a clear error pointing at the missing fixture.
  • No live network required in CI; tests in internal/ verify this.
  • Docs in site/ with a worked MCP eval example.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentic-firstcoding-agentGood candidate for coding-agent implementationenhancementNew feature or requestgo:yesReady to implementmcprelease:backlogNot yet targetedsquad:copilotAssigned to @copilot (Coding Agent) for autonomous worktype:featureNew capability

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions