Summary
The Waza Evaluation CI workflow (waza-eval.yml) fails on main for the default examples/code-explainer/eval.yaml run. All 4 tasks fail with _output_contains: matched 0/N required strings and avg_dur=0ms.
Reproduction
git clone --depth 1 https://github.com/microsoft/waza.git /tmp/waza-main
cd /tmp/waza-main
go build -o waza ./cmd/waza
./waza run examples/code-explainer/eval.yaml --context-dir examples/code-explainer/fixtures
Result:
Total Tests: 4
Succeeded: 0
Failed: 4
Success Rate: 0.0%
Duration: 666ms
✗ Explain JavaScript Async/Await — _output_contains missing: [async fetch]
✗ Explain List Comprehension — _output_contains missing: [list]
✗ Explain Python Recursion — _output_contains missing: [recursive factorial]
✗ Explain SQL JOIN Query — _output_contains missing: [join]
Trigger accuracy: 100% — only the task execution / output_contains check is broken.
Suspected cause
Likely fallout from the recent vocabulary rename refactor (BenchmarkSpec→EvalSpec, TestRunner→EvalRunner — #222) merged on 2026-04-28. The previous successful eval CI run was on 2026-04-21 (PR #137).
The mock engine returns "Mock response for: <user prompt>" — that string never contained the expected substrings (e.g. 'async', 'fetch'). So either:
- The expectations should be looking at
req.Resources (file content) instead of req.Message
- The mock engine previously echoed file contents and a recent refactor dropped that behavior
- The expectations themselves were misaligned with the mock and only worked accidentally before
Impact
Repro on PR #226
PR #226 surfaced this — the CI failure is identical to the one reproducible on a fresh main clone.
Summary
The
Waza EvaluationCI workflow (waza-eval.yml) fails onmainfor the defaultexamples/code-explainer/eval.yamlrun. All 4 tasks fail with_output_contains: matched 0/N required stringsandavg_dur=0ms.Reproduction
git clone --depth 1 https://github.com/microsoft/waza.git /tmp/waza-main cd /tmp/waza-main go build -o waza ./cmd/waza ./waza run examples/code-explainer/eval.yaml --context-dir examples/code-explainer/fixturesResult:
Trigger accuracy: 100% — only the task execution / output_contains check is broken.
Suspected cause
Likely fallout from the recent vocabulary rename refactor (
BenchmarkSpec→EvalSpec,TestRunner→EvalRunner— #222) merged on 2026-04-28. The previous successful eval CI run was on 2026-04-21 (PR #137).The mock engine returns
"Mock response for: <user prompt>"— that string never contained the expected substrings (e.g. 'async', 'fetch'). So either:req.Resources(file content) instead ofreq.MessageImpact
examples/**orskills/**(paths that triggerwaza-eval.yml)Repro on PR #226
PR #226 surfaced this — the CI failure is identical to the one reproducible on a fresh
mainclone.