Improve waza run concurrency: shared Copilot client + auto-sized workers (#135)#221
Conversation
Agent-Logs-Url: https://github.com/microsoft/waza/sessions/55c73a11-8dec-401a-925f-3f2a0536fc78 Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
waza run concurrency
Rubber-duck critique (Opus 4.6)Diagnosis verification — ✅ all three claims verified against code
Blocking issuesB1 — R3 without R4 is dangerous. B2 — R2 design is incomplete: dependency injection for graders is unspecified. B3 — "Process-wide singleton" is wrong for multi-model runs. Non-blocking
Suggestions
Final verdict
Must address before implementation: B1, B2, B3. Can defer: N2, N4, S1. |
|
@copilot implement the plan |
Resolutions captured in a new §8 of the design doc and reflected in the implementation that just landed:
S1 (wall-clock claim) and S2 (judge rate-limits) are noted but not measured in this PR — they fit naturally into the R1 validation plan.
Phase 1 of the plan is implemented in this PR (R3 + R2). R1 (unified worker pool) and R4 (split Highlights:
Race-enabled tests pass for |
waza run concurrencywaza run concurrency: shared Copilot client + auto-sized workers (#135)
spboyer
left a comment
There was a problem hiding this comment.
This improves waza run concurrency by sharing the Copilot SDK client and auto-sizing workers; I found one session cleanup issue and one doc mismatch to address.
Issues to address:
- internal/graders/prompt_grader.go:70 - shared-client prompt grader sessions are disconnected but not deleted
- docs/design/135-improve-concurrency.md:230 - lifecycle docs reference a stale SharedClient signature and shutdown hook
Resolve PR #221 conflicts with current main and keep prompt graders on the shared Copilot execution path while preserving ephemeral session cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR broadens Waza’s evaluation/runtime surface: it introduces shared Copilot SDK client lifecycle and new execution request controls, adjusts concurrency defaults via worker auto-sizing, and extends project configuration + documentation (including .agent.md support and instruction_files) alongside assorted web/dashboard tooling updates.
Changes:
- Add process-wide shared Copilot SDK client plumbing and extend execution/session event handling to support grader-style ephemeral sessions and richer request options.
- Auto-size/clamp worker concurrency for task/trigger phases and introduce worker resolution helpers.
- Expand project config/schema/docs/examples (custom eval/task naming,
instruction_files,.agent.mddiscovery) and update web tooling (ESLint, Playwright port config).
Show a summary per file
| File | Description |
|---|---|
| web/src/hooks/useSSE.ts | Fix SSE run state reset logic using a ref-backed current run. |
| web/src/components/TrajectoryViewer.tsx | Refactor event description derivation to a single expression. |
| web/src/components/TrajectoryDiff.tsx | Make tool alignment helper internal (non-exported). |
| web/src/components/RunsTable.tsx | Add ESLint disable directive for TanStack Table usage. |
| web/playwright.config.ts | Allow configuring Playwright preview port via env var. |
| web/package.json | Add ESLint + related tooling dependencies. |
| web/eslint.config.js | Introduce flat ESLint config for TS/React hooks/refresh rules. |
| web/e2e/run-detail.spec.ts | Remove unused fixture import. |
| web/e2e/fixtures/mock-data.ts | Remove unused Playwright Page type import. |
| web/e2e/dashboard.spec.ts | Remove unused fixture imports. |
| web/dist/index.html | Update built dashboard entrypoint asset hash reference. |
| version.txt | Bump repo version. |
| site/src/content/docs/reference/waza-yaml.mdx | Document files config section in YAML reference. |
| site/src/content/docs/reference/statistical-fields.mdx | Clarify per-run flat usage block vs aggregate usage. |
| site/src/content/docs/reference/schema.mdx | Document instruction_files and files sections in schema reference. |
| site/src/content/docs/quick-start.mdx | Update onboarding for skills vs custom agents and install flows. |
| site/src/content/docs/index.mdx | Clarify install command context (Bash environments). |
| site/src/content/docs/guides/graders.mdx | Add note about prompt grader mapping + tool_constraint auto-injection. |
| site/src/content/docs/getting-started.mdx | Update installation guidance and remove SKILL type field mention. |
| site/package.json | Bump Astro version. |
| site/astro.config.mjs | Add “Evaluating Custom Agents” to Guides nav. |
| schemas/task.schema.json | Add instruction_files support at task level. |
| schemas/eval.schema.json | Add instruction_files support at eval config level. |
| schemas/config.schema.json | Add files naming conventions to project config schema. |
| internal/workspace/workspace_test.go | Add tests for nested skill discovery + custom eval filename resolution. |
| internal/wizard/wizard.go | Remove skill “type” selection; validate name/description in wizard. |
| internal/webapi/store.go | Map transcript events via typed Copilot SDK event data helpers. |
| internal/webapi/additional_test.go | Update web API tests to new Copilot SDK typed event data. |
| internal/validation/schema.go | Fix comment to reflect LoadEvalSpec naming. |
| internal/validation/schema_test.go | Add schema tests for instruction_files and custom-agent example. |
| internal/utils/testdata/sample_events_slog.jsonl | Update sample tool completion event shape to include success. |
| internal/utils/logging_test.go | Update slog mapping test to typed assistant.message event. |
| internal/utils/copilot_log_iterator_test.go | Update iterator tests to typed session event constants/data. |
| internal/trigger/runner.go | Auto-size workers for trigger tests; centralize MCP server conversion. |
| internal/transcript/transcript_test.go | Update transcript tests for typed event data + renamed stimulus type. |
| internal/suggest/suggest.go | Switch eval YAML validation struct to EvalSpec. |
| internal/suggest/suggest_test.go | Update tests for EvalSpec rename. |
| internal/suggest/prompt.go | Update prompt text to reference EvalSpec not BenchmarkSpec. |
| internal/skill/agent.go | Add parsing/loading helpers for .agent.md frontmatter. |
| internal/skill/agent_test.go | Add tests for .agent.md parsing and filename detection. |
| internal/scoring/scoring.go | Include skill body in heuristic scorer trigger/clarity detection. |
| internal/scoring/scoring_test.go | Add coverage for trigger detection in SKILL body content. |
| internal/scaffold/scaffold.go | Read project file naming config; parameterize eval template/task filenames. |
| internal/scaffold/scaffold_test.go | Add tests for custom task glob/suffix and project file naming config. |
| internal/projectconfig/schema_parity_test.go | Assert schema defaults match new files Go defaults. |
| internal/projectconfig/config.go | Add files config, defaults, and validation. |
| internal/projectconfig/config_test.go | Add coverage for files defaults/merge and invalid config errors. |
| internal/orchestration/workers.go | Add worker auto-sizing/clamping utility. |
| internal/orchestration/workers_test.go | Test worker resolution behavior and cap logging. |
| internal/orchestration/skill_discovery.go | Extend discovery to support .agent.md in skill directories. |
| internal/orchestration/skill_discovery_test.go | Add discovery tests for .agent.md and SKILL priority. |
| internal/orchestration/csv_integration_test.go | Update orchestration CSV tests to EvalSpec/EvalRunner. |
| internal/orchestration/agent_graders.go | Inject implicit tool_constraint grader for agents declaring tools. |
| internal/models/testcase.go | Add instruction_files; rename stimulus/expectation types with aliases. |
| internal/models/testcase_test.go | Add test for loading instruction_files. |
| internal/models/outcome.go | Add per-run Usage pointer on RunResult. |
| internal/models/grader_validation_test.go | Update to LoadEvalSpec. |
| internal/models/grader_params_test.go | Update to LoadEvalSpec. |
| internal/models/events_test.go | Update transcript event JSON round-trip to typed data structures. |
| internal/models/baseline_test.go | Update baseline YAML serialization test to EvalSpec. |
| internal/jsonrpc/handlers.go | Update JSON-RPC handlers to load EvalSpec. |
| internal/graders/trigger_grader_test.go | Update tests to renamed stimulus type. |
| internal/graders/run_test.go | Update tests to renamed expectation type. |
| internal/graders/inline_script_grader_test.go | Update to typed assistant message event data. |
| internal/graders/grader.go | Add Executor interface to grader context. |
| internal/execution/usage.go | Populate per-run Usage and re-aggregate from RunResult.Usage. |
| internal/execution/skill_injection_test.go | Add tests for instruction system message formatting. |
| internal/execution/session_usage_collector.go | Update usage collection to typed event data helpers. |
| internal/execution/session_events_collector.go | Update event parsing to typed Copilot SDK event data helpers. |
| internal/execution/session_events_collector_test.go | Update tests to typed SDK data + add helper for error messages. |
| internal/execution/sdkclient.go | Add shared singleton Copilot SDK client API. |
| internal/execution/sdkclient_test.go | Add tests for shared client singleton + shutdown idempotency. |
| internal/execution/mock.go | Expand mock output to echo task/context/resources; support skipping workspace capture. |
| internal/execution/mock_engine_test.go | Add tests for mock output including resource content/truncation. |
| internal/execution/engine.go | Expand execution request options (instructions/tools/streaming/ephemeral/etc). |
| internal/execution/engine_response_test.go | Update message extraction tests for typed assistant message events. |
| internal/execution/copilot_engine_test.go | Add tests for grader-style ephemeral session behaviors and options. |
| internal/embedded/zcopilot_windows_arm64.go | Bump embedded Copilot CLI bundle version/hash (Windows ARM64). |
| internal/embedded/zcopilot_windows_amd64.go | Bump embedded Copilot CLI bundle version/hash (Windows AMD64). |
| internal/embedded/zcopilot_linux_arm64.go | Bump embedded Copilot CLI bundle version/hash (Linux ARM64). |
| internal/embedded/zcopilot_linux_amd64.go | Bump embedded Copilot CLI bundle version/hash (Linux AMD64). |
| internal/embedded/zcopilot_darwin_arm64.go | Bump embedded Copilot CLI bundle version/hash (macOS ARM64). |
| internal/embedded/zcopilot_darwin_amd64.go | Bump embedded Copilot CLI bundle version/hash (macOS AMD64). |
| internal/embedded/zcopilot_1.0.49_windows_arm64.exe.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_windows_arm64.exe.license | Add license for embedded Copilot CLI (Windows ARM64). |
| internal/embedded/zcopilot_1.0.49_windows_amd64.exe.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_windows_amd64.exe.license | Add license for embedded Copilot CLI (Windows AMD64). |
| internal/embedded/zcopilot_1.0.49_linux_arm64.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_linux_arm64.license | Add license for embedded Copilot CLI (Linux ARM64). |
| internal/embedded/zcopilot_1.0.49_linux_amd64.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_linux_amd64.license | Add license for embedded Copilot CLI (Linux AMD64). |
| internal/embedded/zcopilot_1.0.49_darwin_arm64.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_darwin_arm64.license | Add license for embedded Copilot CLI (macOS ARM64). |
| internal/embedded/zcopilot_1.0.49_darwin_amd64.zst | Add new embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.49_darwin_amd64.license | Add license for embedded Copilot CLI (macOS AMD64). |
| internal/embedded/zcopilot_1.0.2_windows_arm64.exe.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.2_windows_amd64.exe.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.2_linux_arm64.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.2_linux_amd64.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.2_darwin_arm64.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/zcopilot_1.0.2_darwin_amd64.zst | Remove old embedded Copilot CLI binary blob (LFS pointer). |
| internal/embedded/generate/generate.go | Allow pinning Copilot CLI bundler version via env var. |
| internal/discovery/discovery.go | Respect configured eval filename during skill eval discovery. |
| internal/discovery/discovery_test.go | Add tests for discovery using custom eval filename. |
| internal/copilotconfig/mcp.go | Centralize robust MCP server config conversion/validation. |
| internal/config/config.go | Rename BenchmarkConfig → EvalConfig with backward-compatible alias. |
| internal/config/config_test.go | Update config tests to EvalConfig naming. |
| internal/cache/cache.go | Update cache key + grader checks to accept EvalSpec. |
| install.sh | Detect WSL and print guidance for native Windows installs. |
| go.sum | Bump copilot-sdk module version checksums. |
| go.mod | Bump copilot-sdk module version and adjust semver dependency placement. |
| extension.yaml | Bump azd extension version. |
| examples/README.md | Add custom-agent/required-skills-demo/rubrics sections to examples index. |
| examples/custom-agent/trigger_tests.yaml | Add trigger tests for the custom-agent example. |
| examples/custom-agent/tasks/review-clean-code.yaml | Add example task fixture for “clean code” review. |
| examples/custom-agent/tasks/find-xss.yaml | Add example task fixture for XSS detection. |
| examples/custom-agent/tasks/find-sql-injection.yaml | Add example task fixture for SQL injection detection. |
| examples/custom-agent/security-reviewer.agent.md | Add example .agent.md custom agent definition. |
| examples/custom-agent/README.md | Add documentation for the custom-agent example. |
| examples/custom-agent/fixtures/xss.html | Add XSS fixture HTML file. |
| examples/custom-agent/fixtures/vulnerable.py | Add SQL injection fixture Python file. |
| examples/custom-agent/fixtures/clean.go | Add “clean” fixture Go file (excluded from builds). |
| examples/custom-agent/eval.yaml | Add example eval spec targeting a custom agent. |
| examples/ci/README.md | Update CI docs to reference binary install method. |
| docs/TUTORIAL.md | Update tutorial to mention custom agents and refine install steps. |
| docs/SKILLS_CI_INTEGRATION.md | Update CI integration to prefer binary install and revise troubleshooting. |
| docs/RELEASE.md | Update release instructions to tag-push-first workflow narrative. |
| docs/GUIDE.md | Update guide for .agent.md support and new install/source build guidance. |
| docs/GETTING-STARTED.md | Update getting started to mention custom agents and remove SKILL type. |
| docs/DEMO-GUIDE.md | Update demo guide install instructions wording and minor cleanup. |
| DEMO-SCRIPT.md | Update demo script install comment. |
| cmd/waza/workspace_resolve.go | Pass configured eval filename into workspace detection; improve error text. |
| cmd/waza/workspace_resolve_test.go | Add test for configured eval filename resolution. |
| cmd/waza/tokens/suggest.go | Ensure shared Copilot client is shutdown on suggest completion. |
| cmd/waza/tokens/helpers.go | Pass configured eval filename into token command detection. |
| cmd/waza/newtask/converters.go | Update copilot log parsing to typed event data helpers. |
| cmd/waza/newtask/converters_test.go | Update test case stimulus naming. |
| cmd/waza/dev/loop.go | Pass configured eval filename into dev workspace detection. |
| cmd/waza/dev/copilot.go | Shutdown shared Copilot client; use project config for workspace detection. |
| cmd/waza/cmd_suggest.go | Shutdown shared Copilot client after suggest command. |
| cmd/waza/cmd_serve.go | Only start MCP stdio server when stdin is a terminal to avoid EOF crash. |
| cmd/waza/cmd_run_test.go | Update runner injection to EvalRunner and config types. |
| cmd/waza/cmd_run_signal_test.go | Add SIGTERM cancellation integration test for waza run. |
| cmd/waza/cmd_quality.go | Shutdown shared Copilot client after quality command. |
| cmd/waza/cmd_new.go | Respect project file naming config in scaffolding; remove SKILL type metadata. |
| cmd/waza/cmd_new_test.go | Add tests for custom file naming; assert SKILL frontmatter no longer includes type. |
| cmd/waza/cmd_new_task.go | Ensure shared Copilot client shutdown in new-task command teardown. |
| cmd/waza/cmd_new_task_test.go | Update test case stimulus naming. |
| cmd/waza/cmd_models.go | Ensure shared Copilot client is shutdown in models command teardown. |
| cmd/waza/cmd_grade.go | Update to LoadEvalSpec and config types. |
| cmd/waza/cmd_eval_test.go | Add test verifying eval scaffolding respects custom file naming config. |
| cmd/waza/cmd_coverage.go | Extend coverage discovery to .agent.md while preserving SKILL priority. |
| cmd/waza/cmd_coverage_test.go | Add coverage tests for agent discovery and SKILL priority. |
| cmd/waza/cmd_check.go | Use configured detect options when attempting workspace auto-detection. |
| cmd/waza/cmd_check_workspace_test.go | Add tests for nested skill path detection with separated evals. |
| .squad/decisions/inbox/saul-doc-audit-2025-07-15.md | Add doc audit decision record for onboarding entry points. |
| .squad/decisions/inbox/livingston-custom-agent-docs.md | Add decision record for custom agent docs structure. |
| .squad/decisions/inbox/linus-mock-includes-resources.md | Add decision record for mock engine echo behavior. |
| .squad/decisions/inbox/linus-custom-agent-support.md | Add decision record for representing agents as SkillInfo. |
| .squad/decisions/inbox/linus-agent-tool-injection.md | Add decision record for implicit tool_constraint injection. |
| .squad/agents/saul/history.md | Update documentation audit history notes. |
| .squad/agents/livingston/history.md | Update Livingston’s history with custom agent docs completion. |
| .github/workflows/waza-eval.yml | Expand workflow path filters to include execution/orchestration/grader changes. |
| .github/workflows/pages.yml | Trigger Pages build after successful Release; bump Node version. |
Copilot's findings
Files not reviewed (2)
- site/package-lock.json: Language not supported
- web/package-lock.json: Language not supported
- Files reviewed: 192/202 changed files
- Comments generated: 8
Resolve latest main conflict by preserving BYOK provider wiring alongside the shared Copilot client lifecycle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Check the capping notice write result to satisfy errcheck. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use utils.Ptr for shared client bool options and make the worker auto-size test safe on single-core runners. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
left a comment
There was a problem hiding this comment.
The requested changes have been addressed and the review threads are resolved.
waza run --workers 6on a suite of 3 tasks + 3 trigger tests only ever runs 3 concurrent sessions because tasks and triggers are executed in serial phases, each with its own worker semaphore. The issue asked for research and recommendations; this PR adds the design doc and ships Phase 1 of the plan (R2 + R3).Adds
docs/design/135-improve-concurrency.md(following the existingdocs/design/convention) capturing the diagnosis and a phased proposal, including a §8 "Critique resolutions" section that addresses the B1/B2/B3 review feedback.Diagnosis
cmd/waza/cmd_run.go:runner.RunBenchmarkfully drains beforetrigger.Runner.RunDetailedstarts. Each phase has its own semaphore sized atWorkers, soWorkers > items_per_phaseis unused capacity.internal/graders/prompt_grader.go(gradeIndependent,runPairwiseOnce) callscopilot.NewClient(...)+client.Stop()every invocation, spawning/tearing down the embedded Copilot CLI subprocess each time. The engine itself already shares one client; graders don't.Recommendations (prioritized)
Jobinterface, build the pool once incmd_run.go, haveTestRunnerandtrigger.RunnerexposePlan() []Job, then submit both sets and demux results byJob.Kind(). Fixes the reported 3/6 utilization. Follow-up PR.execution.SharedClient(...)behind async.Once, switchprompt_graderto it, and drop the per-callStop(). Removes the dominant local-run latency. Implemented in this PR.--workersdefault tomin(NumCPU, totalJobs, 8)(currently hard-coded4) and log a one-line "capped to N (phase)" notice. Implemented in this PR. The conservativeDefaultWorkerCap = 8ceiling addresses B1 from the design review until R4 lands.--workers(job concurrency) from--sessions(Copilot session concurrency) as a second semaphore inside the pool. Defer until the single knob proves insufficient.Implementation in this PR
R2 — shared Copilot SDK client
internal/execution/sdkclient.goexposingSharedClient(opts) CopilotClientandShutdownSharedClient(ctx) error, both guarded bysync.Once.CopilotEnginenow defaults to the shared client and tracks anownsClientflag;Shutdownonly callsclient.Stop()for engines that own their client (test path). The shared SDK process is stopped exactly once via a top-leveldefer execution.ShutdownSharedClient(...)incmd_run.go::runCommandE, so it survives the per---modelengine loop.CopilotEngineexposesCopilotClient()so the runner can plumb the shared client to graders.graders.Contextgains aCopilotClient execution.CopilotClientfield.TestRunner.buildGraderContextpopulates it via an interface assertion against the engine.prompt_grader.gradeIndependentandrunPairwiseOnceuse a new sharedopenGradingSessionhelper that prefersContext.CopilotClientwhen set (no Start/Stop overhead, session isDisconnect'd on cleanup) and falls back to the legacy per-callcopilot.NewClient(...)/Stop()path otherwise so direct grader tests (prompt_grader_test.go) keep working.R3 — auto-sized workers
internal/orchestration/workers.gowithResolveWorkers(requested, jobs, phase, out) intandDefaultWorkerCap = 8.--workersis unset: defaults tomin(runtime.NumCPU(), jobs, 8).--workersexceeds the job count: clamps and emitsworkers=N capped to M (phase).TestRunner.runConcurrentandtrigger.Runner.RunDetailed. The CLI's "Parallel:" line now printsworkers=autowhen the auto-default applies.Design doc updates
Context.CopilotClient), B3 (process-wide singleton lifecycle decoupled from per-engineShutdown), N4 (use symbol names), and S3 (revised order to R3 → R2 → R1 → R4).Tests
internal/execution/sdkclient_test.go—SharedClientreturns the same instance,ShutdownSharedClientis idempotent and safe when no client was constructed.internal/orchestration/workers_test.go— auto-sizing respects the cap, explicit values are clamped to job count and emit a notice, no notice when within budget.internal/execution,internal/orchestration,internal/trigger,internal/graders, andcmd/waza.Open questions for reviewers
Jobinterface ininternal/orchestration, or keep packages self-contained and havePlan()return work descriptors that the CLI hands to a thin pool? (R1 follow-up)Clientacross N concurrentCreateSessioncalls without per-callStart/Stop. The implementation now relies on this.