Plan: MCP Server Provisioning for Waza Evaluations
Problem Statement
Waza can evaluate AI agents on tasks, but cannot currently test agents that use MCP tools.
The mcp_servers field exists in the eval YAML schema and is parsed into Config.ServerConfigs,
but is never consumed — no servers are started, no config is written to workspaces, and no
MCP-aware grading exists.
Goal: Allow eval authors to declare MCP servers in eval YAML, have waza provision them
during task execution, and grade how effectively the agent uses those MCP tools.
Approach
Rather than waiting for Copilot SDK changes, use workspace-based MCP discovery:
the agent (Copilot CLI) already discovers MCP servers via .copilot/mcp.json in the
working directory. Waza can write this config file into each task's temp workspace and
manage the MCP server process lifecycle.
eval.yaml Task Workspace (/tmp/waza-xxx/)
┌──────────────┐ ┌────────────────────────────┐
│ mcp_servers: │ ──copy──▶ │ .copilot/mcp.json │
│ github: │ │ fixtures/... │
│ command: │ │ (agent works here) │
│ args: ... │ └────────────────────────────┘
└──────────────┘ │
waza starts MCP server process
agent discovers via mcp.json
waza tracks MCP tool calls
waza grades MCP tool usage
waza stops MCP server process
Phases & Todos
Phase 1: Formalize MCP Server Config Schema
Define a typed Go struct for MCP server configuration (replacing map[string]any),
and update the JSON schema with proper validation.
-
1a. Define MCPServerConfig struct in internal/models/spec.go
- Fields:
Command, Args, Env, WorkingDir, Url (for SSE/streamable-http transports)
- Replace
ServerConfigs map[string]any with MCPServers map[string]MCPServerConfig
- Maintain backward compatibility with existing YAML parsing
-
1b. Update JSON schema in schemas/eval.schema.json
- Replace
additionalProperties: true with proper object schema for each server
- Add
command (required), args (array), env (object), url (string)
-
1c. Add config validation in spec loading
- Validate that each server has either
command or url (stdio vs remote)
- Validate that
command is a string, args is string array, etc.
Phase 2: MCP Server Lifecycle Manager
Create a component that starts/stops MCP server processes and writes workspace config.
-
2a. Create internal/mcp/lifecycle.go
MCPServerManager struct with Start(ctx, configs) and Stop() methods
- Starts each configured server as a subprocess
- Tracks PIDs for cleanup
- Health-check: wait for server readiness (configurable timeout)
- Graceful shutdown with SIGTERM → SIGKILL fallback
-
2b. Create internal/mcp/workspace.go
WriteMCPConfig(workspaceDir, configs) function
- Writes
.copilot/mcp.json into the task workspace directory
- Transforms
MCPServerConfig → Copilot MCP JSON format
- Creates
.copilot/ directory if needed
-
2c. Add tests for lifecycle manager
- Test start/stop with a simple echo MCP server
- Test config file generation
- Test cleanup on context cancellation
- Test error handling (server fails to start)
Phase 3: Wire Into Execution Pipeline
Connect the lifecycle manager to the orchestration runner and execution engine.
-
3a. Update ExecutionRequest in internal/execution/engine.go
- Add
MCPServers map[string]MCPServerConfig field
- Allows per-request MCP server config to flow through
-
3b. Update CopilotEngine.Execute() in internal/execution/copilot.go
- Before creating session: call
WriteMCPConfig() to write config to workspace
- After task completion: cleanup MCP config file
-
3c. Update TestRunner.buildExecutionRequest() in internal/orchestration/runner.go
- Pass
spec.Config.MCPServers into ExecutionRequest.MCPServers
-
3d. Update TestRunner lifecycle hooks
- Before benchmark: start MCP servers via
MCPServerManager.Start()
- After benchmark: stop MCP servers via
MCPServerManager.Stop()
- Consider: per-task vs per-benchmark server lifetime (config option)
-
3e. Add integration tests
- Mock MCP server that exposes a simple tool
- Eval YAML that configures the mock server
- Verify agent can discover and call the MCP tool
- Verify cleanup after task
Phase 4: MCP Tool Call Tracking
Enrich tool call data with MCP server origin information.
-
4a. Extend ToolCall model in internal/models/events.go
- Add
Source string field: "builtin", "mcp:{server_name}", "skill"
- Add
MCPServer string field (empty for non-MCP tools)
-
4b. Update FilterToolCalls() to classify tool origins
- Match tool names against configured MCP server tool lists
- Or use naming convention (MCP tools often have server-prefixed names)
- Falls back to
"builtin" if no MCP match
-
4c. Update SessionDigest to include MCP summary
- Add
MCPToolCalls int — count of MCP-originated tool calls
- Add
MCPServersUsed []string — which MCP servers were actually called
-
4d. Update dashboard in web/
- Show MCP tool calls distinguished from built-in tools in TrajectoryViewer
- Add MCP server badge/icon to tool call entries
Phase 5: MCP-Aware Graders
Add grading capabilities specific to MCP tool usage.
-
5a. Extend tool_constraint grader
- Support
source: mcp:{server_name} filter in expect_tools / reject_tools
- Example:
expect_tools: [{tool: "get_issue", source: "mcp:github"}]
-
5b. Extend behavior grader
- Add
max_mcp_calls constraint
- Add
required_mcp_servers — ensure agent used specific MCP servers
- Add
forbidden_mcp_servers — ensure agent didn't use certain servers
-
5c. Add mcp_compliance grader (new grader type)
- Validates that the agent correctly discovered and used MCP tools
- Checks: tool selection accuracy, parameter correctness, error handling
- Configurable rubric for MCP-specific evaluation criteria
- Example config:
graders:
- type: mcp_compliance
name: github_tool_usage
config:
server: github
expect_tools_used: [get_issue, search_code]
max_tool_errors: 1
-
5d. Add tests for MCP graders
Phase 6: Task-Level MCP Overrides
Allow individual tasks to specify additional or different MCP servers.
-
6a. Add mcp_servers to TestCase model
- Tasks can add MCP servers beyond those in the eval config
- Tasks can override eval-level MCP server configs
- Merge strategy: task configs overlay eval configs
-
6b. Update task YAML schema in schemas/
- Add
mcp_servers field to task schema
-
6c. Update runner to merge configs
- Eval-level
mcp_servers as base
- Task-level
mcp_servers merged on top
- Pass merged config into
ExecutionRequest
Phase 7: Documentation & Examples
Update docs and provide example evals.
Key Design Decisions
-
Workspace-based discovery (not SDK-level): Write .copilot/mcp.json to workspace
so agent discovers MCP servers naturally. No SDK changes needed.
-
Per-benchmark server lifetime (default): Start MCP servers once per eval run,
not per-task. Add mcp_server_lifetime: per_task | per_benchmark config option.
-
Typed config over map[string]any: Replace generic map with MCPServerConfig
struct for type safety and validation.
-
Additive grading: Extend existing graders (tool_constraint, behavior) rather
than replacing them. Add new mcp_compliance grader for MCP-specific checks.
-
Tool origin tracking: Tag each tool call with its source (builtin/mcp/skill)
to enable source-aware grading.
Open Questions
- Should waza validate that MCP servers are healthy before starting tasks?
(Recommend: yes, with configurable timeout)
- Should MCP server stdout/stderr be captured in eval results for debugging?
(Recommend: yes, as optional verbose output)
- Should waza support remote MCP servers (SSE/streamable-http) in addition to stdio?
(Recommend: yes via url field, Phase 1)
- How to handle MCP servers that need auth tokens — env var passthrough?
(Recommend: env map in config, supports ${VAR} expansion)
Dependencies
- No Copilot SDK changes required (workspace-based discovery)
mcp-go v0.45.0 already in go.mod (indirect) — may use for server health checks
- Go 1.26+ (already required)
Plan: MCP Server Provisioning for Waza Evaluations
Problem Statement
Waza can evaluate AI agents on tasks, but cannot currently test agents that use MCP tools.
The
mcp_serversfield exists in the eval YAML schema and is parsed intoConfig.ServerConfigs,but is never consumed — no servers are started, no config is written to workspaces, and no
MCP-aware grading exists.
Goal: Allow eval authors to declare MCP servers in eval YAML, have waza provision them
during task execution, and grade how effectively the agent uses those MCP tools.
Approach
Rather than waiting for Copilot SDK changes, use workspace-based MCP discovery:
the agent (Copilot CLI) already discovers MCP servers via
.copilot/mcp.jsonin theworking directory. Waza can write this config file into each task's temp workspace and
manage the MCP server process lifecycle.
Phases & Todos
Phase 1: Formalize MCP Server Config Schema
Define a typed Go struct for MCP server configuration (replacing
map[string]any),and update the JSON schema with proper validation.
1a. Define
MCPServerConfigstruct ininternal/models/spec.goCommand,Args,Env,WorkingDir,Url(for SSE/streamable-http transports)ServerConfigs map[string]anywithMCPServers map[string]MCPServerConfig1b. Update JSON schema in
schemas/eval.schema.jsonadditionalProperties: truewith proper object schema for each servercommand(required),args(array),env(object),url(string)1c. Add config validation in spec loading
commandorurl(stdio vs remote)commandis a string,argsis string array, etc.Phase 2: MCP Server Lifecycle Manager
Create a component that starts/stops MCP server processes and writes workspace config.
2a. Create
internal/mcp/lifecycle.goMCPServerManagerstruct withStart(ctx, configs)andStop()methods2b. Create
internal/mcp/workspace.goWriteMCPConfig(workspaceDir, configs)function.copilot/mcp.jsoninto the task workspace directoryMCPServerConfig→ Copilot MCP JSON format.copilot/directory if needed2c. Add tests for lifecycle manager
Phase 3: Wire Into Execution Pipeline
Connect the lifecycle manager to the orchestration runner and execution engine.
3a. Update
ExecutionRequestininternal/execution/engine.goMCPServers map[string]MCPServerConfigfield3b. Update
CopilotEngine.Execute()ininternal/execution/copilot.goWriteMCPConfig()to write config to workspace3c. Update
TestRunner.buildExecutionRequest()ininternal/orchestration/runner.gospec.Config.MCPServersintoExecutionRequest.MCPServers3d. Update
TestRunnerlifecycle hooksMCPServerManager.Start()MCPServerManager.Stop()3e. Add integration tests
Phase 4: MCP Tool Call Tracking
Enrich tool call data with MCP server origin information.
4a. Extend
ToolCallmodel ininternal/models/events.goSource stringfield:"builtin","mcp:{server_name}","skill"MCPServer stringfield (empty for non-MCP tools)4b. Update
FilterToolCalls()to classify tool origins"builtin"if no MCP match4c. Update
SessionDigestto include MCP summaryMCPToolCalls int— count of MCP-originated tool callsMCPServersUsed []string— which MCP servers were actually called4d. Update dashboard in
web/Phase 5: MCP-Aware Graders
Add grading capabilities specific to MCP tool usage.
5a. Extend
tool_constraintgradersource: mcp:{server_name}filter inexpect_tools/reject_toolsexpect_tools: [{tool: "get_issue", source: "mcp:github"}]5b. Extend
behaviorgradermax_mcp_callsconstraintrequired_mcp_servers— ensure agent used specific MCP serversforbidden_mcp_servers— ensure agent didn't use certain servers5c. Add
mcp_compliancegrader (new grader type)5d. Add tests for MCP graders
Phase 6: Task-Level MCP Overrides
Allow individual tasks to specify additional or different MCP servers.
6a. Add
mcp_serverstoTestCasemodel6b. Update task YAML schema in
schemas/mcp_serversfield to task schema6c. Update runner to merge configs
mcp_serversas basemcp_serversmerged on topExecutionRequestPhase 7: Documentation & Examples
Update docs and provide example evals.
7a. Create example eval in
examples/mcp-eval/7b. Update documentation
site/src/content/docs/guides/eval-yaml.mdx— mcp_servers sectionsite/src/content/docs/guides/graders.mdx— mcp_compliance gradersite/src/content/docs/reference/cli.mdx— any new flagsREADME.md— MCP eval support overview7c. Update AGENTS.md with MCP eval patterns
Key Design Decisions
Workspace-based discovery (not SDK-level): Write
.copilot/mcp.jsonto workspaceso agent discovers MCP servers naturally. No SDK changes needed.
Per-benchmark server lifetime (default): Start MCP servers once per eval run,
not per-task. Add
mcp_server_lifetime: per_task | per_benchmarkconfig option.Typed config over
map[string]any: Replace generic map withMCPServerConfigstruct for type safety and validation.
Additive grading: Extend existing graders (tool_constraint, behavior) rather
than replacing them. Add new
mcp_compliancegrader for MCP-specific checks.Tool origin tracking: Tag each tool call with its source (builtin/mcp/skill)
to enable source-aware grading.
Open Questions
(Recommend: yes, with configurable timeout)
(Recommend: yes, as optional verbose output)
(Recommend: yes via
urlfield, Phase 1)(Recommend:
envmap in config, supports${VAR}expansion)Dependencies
mcp-go v0.45.0already ingo.mod(indirect) — may use for server health checks