feat: add MCP server mocks (closes #363)#387
Merged
Merged
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class, hermetic MCP mocking to waza evals (gated behind schemaVersion: "1.1"), enabling deterministic Copilot SDK runs without relying on live MCP services (network/auth/state).
Changes:
- Extend eval spec + JSON schema with top-level
mcp_mocksand validate it requires schema v1.1. - Implement a stdio MCP mock server (spawned via the waza binary) with exact / JSON Schema / per-field regex argument matching.
- Document MCP mocks across README, site docs, CLI reference, and integration testing docs; add targeted tests for config wiring.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/schema.mdx | Documents mcp_mocks in the schema reference and points readers away from live config.mcp_servers for hermetic evals. |
| site/src/content/docs/reference/cli.mdx | Adds an example waza run invocation for hermetic MCP mock usage. |
| site/src/content/docs/guides/eval-yaml.mdx | Adds a worked guide section for mcp_mocks and response matching semantics. |
| schemas/eval.schema.json | Adds JSON Schema definitions for mcp_mocks / tools / response matchers. |
| README.md | Documents mcp_mocks and the supported matching modes at the repo level. |
| docs/INTEGRATION-TESTING.md | Documents hermetic MCP mock servers for CI-safe Copilot SDK evals. |
| internal/models/spec.go | Adds MCPMocks to EvalSpec and validates schema gating + uniqueness. |
| internal/models/spec_test.go | Adds tests for schemaVersion gating and duplicate mock name rejection. |
| internal/copilotconfig/mcp.go | Adds ConvertMCPServersWithMocks and wires mocks into Copilot SDK MCP server configs. |
| internal/copilotconfig/mcp_test.go | Verifies mock conversion produces a hermetic stdio server config and preserves regular servers. |
| internal/mcpmock/config.go | Resolves mcp_mocks entries from inline definitions and/or JSON fixture directories. |
| internal/mcpmock/server.go | Implements MCP JSON-RPC handling for initialize, tools/list, and tools/call with ordered response matching. |
| internal/mcpmock/server_test.go | Tests fixture directory loading, matching modes, and error surfacing for unknown/unmatched tool calls. |
| internal/orchestration/runner.go | Wires mcp_mocks into execution requests (Copilot SDK MCP servers). |
| internal/orchestration/runner_test.go | Adds a test ensuring MCP mocks are included in the built execution request. |
| internal/trigger/runner.go | Threads mcp_mocks through MCP server conversion in the trigger runner. |
| internal/trigger/runner_test.go | Updates conversion tests for the new convertMCPServers signature. |
| cmd/waza/root.go | Registers hidden __mcp-mock command used to run the mock server over stdio. |
| cmd/waza/cmd_mcp_mock.go | Implements the hidden __mcp-mock command to decode config and serve stdio MCP. |
Review details
- Files reviewed: 19/19 changed files
- Comments generated: 4
- Review effort level: Low
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
pushed a commit
that referenced
this pull request
Jun 28, 2026
The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
added a commit
that referenced
this pull request
Jun 28, 2026
#388) * feat: per-task tool metrics with structured arg matchers (closes #366) - Add argmatcher package (equals/regex/contains/range/json_schema) - Extend tool_calls grader with expect: [{tool, args}] block - Extend tool_constraint grader with args: matchers on expect_tools - Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id, tool_name, args, result, success, duration_ms, error) populated from session events — replay-friendly for Wave 3 (#367), OTel-aligned - Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382) - waza compare prints aggregate TOOL USE section (total calls, success rate, avg/task, histogram, selection accuracy) when tool data present - Unit tests for matchers, builder, both graders, schema round-trip, compare metrics + histogram - Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry), cli.mdx (compare TOOL USE), README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot review feedback on #388 - graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test: fix MCP mocks schemaVersion test after rebase The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address round-2 Copilot review on #388 - persist compiled matcher in validateToolSpecs (map value semantics) - capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain') - bucket call_count_histogram per-task across trials (not per-run) - rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric - sync README tool_events[] field list with ToolEvent struct - add IsCompiled() accessor + tests for persisted compile, extra args, and per-task histogram with trials_per_task > 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mcp_mockseval schema support gated toschemaVersion: "1.1".Test plan
go build ./... && go test ./... && go vet ./... && golangci-lint runcd site && npm run buildCloses #363