bug: Waza Evaluation CI fails on main — code-explainer mock eval returns 0% pass rate

## Summary

The `Waza Evaluation` CI workflow (`waza-eval.yml`) fails on `main` for the default `examples/code-explainer/eval.yaml` run. All 4 tasks fail with `_output_contains: matched 0/N required strings` and `avg_dur=0ms`.

## Reproduction

```bash
git clone --depth 1 https://github.com/microsoft/waza.git /tmp/waza-main
cd /tmp/waza-main
go build -o waza ./cmd/waza
./waza run examples/code-explainer/eval.yaml --context-dir examples/code-explainer/fixtures
```

Result:
```
Total Tests:    4
Succeeded:      0
Failed:         4
Success Rate:   0.0%
Duration:       666ms

  ✗ Explain JavaScript Async/Await — _output_contains missing: [async fetch]
  ✗ Explain List Comprehension     — _output_contains missing: [list]
  ✗ Explain Python Recursion       — _output_contains missing: [recursive factorial]
  ✗ Explain SQL JOIN Query         — _output_contains missing: [join]
```

Trigger accuracy: 100% — only the task execution / output_contains check is broken.

## Suspected cause

Likely fallout from the recent vocabulary rename refactor (`BenchmarkSpec`→`EvalSpec`, `TestRunner`→`EvalRunner` — #222) merged on 2026-04-28. The previous successful eval CI run was on 2026-04-21 (PR #137).

The mock engine returns `"Mock response for: <user prompt>"` — that string never contained the expected substrings (e.g. 'async', 'fetch'). So either:
1. The expectations should be looking at `req.Resources` (file content) instead of `req.Message`
2. The mock engine previously echoed file contents and a recent refactor dropped that behavior
3. The expectations themselves were misaligned with the mock and only worked accidentally before

## Impact

- Blocks any PR that touches `examples/**` or `skills/**` (paths that trigger `waza-eval.yml`)
- E.g., PR #226 (custom agent support) shows this as a CI failure even though the change is unrelated

## Repro on PR #226

PR #226 surfaced this — the CI failure is identical to the one reproducible on a fresh `main` clone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Waza Evaluation CI fails on main — code-explainer mock eval returns 0% pass rate #227

Summary

Reproduction

Suspected cause

Impact

Repro on PR #226

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

bug: Waza Evaluation CI fails on main — code-explainer mock eval returns 0% pass rate #227

Description

Summary

Reproduction

Suspected cause

Impact

Repro on PR #226

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions