Skip to content

feat: add MCP server mocks (closes #363)#387

Merged
spboyer merged 2 commits into
mainfrom
spboyer-squad-363-mcp-server-mocks
Jun 28, 2026
Merged

feat: add MCP server mocks (closes #363)#387
spboyer merged 2 commits into
mainfrom
spboyer-squad-363-mcp-server-mocks

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

  • Add top-level mcp_mocks eval schema support gated to schemaVersion: "1.1".
  • Launch deterministic waza-managed stdio MCP mock servers for Copilot SDK evals, with inline/fixture-backed tools and exact, JSON Schema, and per-field regex response matching.
  • Surface clear MCP tool errors for unknown tools and unmatched calls, and document hermetic MCP eval usage across README, site docs, CLI reference, and integration testing docs.

Test plan

  • go build ./... && go test ./... && go vet ./... && golangci-lint run
  • cd site && npm run build

⚠️ This task was flagged as "needs review" — please have a squad member review before merging.

Closes #363

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 12:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class, hermetic MCP mocking to waza evals (gated behind schemaVersion: "1.1"), enabling deterministic Copilot SDK runs without relying on live MCP services (network/auth/state).

Changes:

  • Extend eval spec + JSON schema with top-level mcp_mocks and validate it requires schema v1.1.
  • Implement a stdio MCP mock server (spawned via the waza binary) with exact / JSON Schema / per-field regex argument matching.
  • Document MCP mocks across README, site docs, CLI reference, and integration testing docs; add targeted tests for config wiring.
Show a summary per file
File Description
site/src/content/docs/reference/schema.mdx Documents mcp_mocks in the schema reference and points readers away from live config.mcp_servers for hermetic evals.
site/src/content/docs/reference/cli.mdx Adds an example waza run invocation for hermetic MCP mock usage.
site/src/content/docs/guides/eval-yaml.mdx Adds a worked guide section for mcp_mocks and response matching semantics.
schemas/eval.schema.json Adds JSON Schema definitions for mcp_mocks / tools / response matchers.
README.md Documents mcp_mocks and the supported matching modes at the repo level.
docs/INTEGRATION-TESTING.md Documents hermetic MCP mock servers for CI-safe Copilot SDK evals.
internal/models/spec.go Adds MCPMocks to EvalSpec and validates schema gating + uniqueness.
internal/models/spec_test.go Adds tests for schemaVersion gating and duplicate mock name rejection.
internal/copilotconfig/mcp.go Adds ConvertMCPServersWithMocks and wires mocks into Copilot SDK MCP server configs.
internal/copilotconfig/mcp_test.go Verifies mock conversion produces a hermetic stdio server config and preserves regular servers.
internal/mcpmock/config.go Resolves mcp_mocks entries from inline definitions and/or JSON fixture directories.
internal/mcpmock/server.go Implements MCP JSON-RPC handling for initialize, tools/list, and tools/call with ordered response matching.
internal/mcpmock/server_test.go Tests fixture directory loading, matching modes, and error surfacing for unknown/unmatched tool calls.
internal/orchestration/runner.go Wires mcp_mocks into execution requests (Copilot SDK MCP servers).
internal/orchestration/runner_test.go Adds a test ensuring MCP mocks are included in the built execution request.
internal/trigger/runner.go Threads mcp_mocks through MCP server conversion in the trigger runner.
internal/trigger/runner_test.go Updates conversion tests for the new convertMCPServers signature.
cmd/waza/root.go Registers hidden __mcp-mock command used to run the mock server over stdio.
cmd/waza/cmd_mcp_mock.go Implements the hidden __mcp-mock command to decode config and serve stdio MCP.

Review details

  • Files reviewed: 19/19 changed files
  • Comments generated: 4
  • Review effort level: Low

Comment thread internal/models/spec.go
Comment thread internal/copilotconfig/mcp.go
Comment thread internal/copilotconfig/mcp.go
Comment thread internal/mcpmock/server.go
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit ec8cb62 into main Jun 28, 2026
10 checks passed
@spboyer spboyer deleted the spboyer-squad-363-mcp-server-mocks branch June 28, 2026 12:25
spboyer pushed a commit that referenced this pull request Jun 28, 2026
The MCP mocks test in #387 used an empty schemaVersion and expected
the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion
to the current version (1.1), the test passed validation instead of
failing. Make the test explicit by setting schemaVersion: '1.0' to
actually trigger the gate, then bump to '1.1' in the second half.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer added a commit that referenced this pull request Jun 28, 2026
#388)

* feat: per-task tool metrics with structured arg matchers (closes #366)

- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
  tool_name, args, result, success, duration_ms, error) populated from
  session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
  rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
  compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
  cli.mdx (compare TOOL USE), README.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address Copilot review feedback on #388

- graders.mdx: matchers are single-key mappings (no kind field);
  graders evaluate session_digest.tool_calls (not tool_events[]);
  range matcher uses gte/lte/gt/lt (not [min, max]).
- tool_events.go: stringifyResult comment matches JSON-only behavior.
- cmd_compare.go: histogram bucketed per-run (not truncated per-task avg);
  added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels;
  use tagged switch on runCalls.
- schema-changes.md / README.md / schema.mdx: missing schemaVersion is
  interpreted as the current schema version (1.1), not 1.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test: fix MCP mocks schemaVersion test after rebase

The MCP mocks test in #387 used an empty schemaVersion and expected
the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion
to the current version (1.1), the test passed validation instead of
failing. Make the test explicit by setting schemaVersion: '1.0' to
actually trigger the gate, then bump to '1.1' in the second half.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address round-2 Copilot review on #388

- persist compiled matcher in validateToolSpecs (map value semantics)
- capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain')
- bucket call_count_histogram per-task across trials (not per-run)
- rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric
- sync README tool_events[] field list with ToolEvent struct
- add IsCompiled() accessor + tests for persisted compile, extra args, and
  per-task histogram with trials_per_task > 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: MCP tool-use evaluation primitives

3 participants