Skip to content

bug: Waza Evaluation CI fails on main — code-explainer mock eval returns 0% pass rate #227

Description

@spboyer

Summary

The Waza Evaluation CI workflow (waza-eval.yml) fails on main for the default examples/code-explainer/eval.yaml run. All 4 tasks fail with _output_contains: matched 0/N required strings and avg_dur=0ms.

Reproduction

git clone --depth 1 https://github.com/microsoft/waza.git /tmp/waza-main
cd /tmp/waza-main
go build -o waza ./cmd/waza
./waza run examples/code-explainer/eval.yaml --context-dir examples/code-explainer/fixtures

Result:

Total Tests:    4
Succeeded:      0
Failed:         4
Success Rate:   0.0%
Duration:       666ms

  ✗ Explain JavaScript Async/Await — _output_contains missing: [async fetch]
  ✗ Explain List Comprehension     — _output_contains missing: [list]
  ✗ Explain Python Recursion       — _output_contains missing: [recursive factorial]
  ✗ Explain SQL JOIN Query         — _output_contains missing: [join]

Trigger accuracy: 100% — only the task execution / output_contains check is broken.

Suspected cause

Likely fallout from the recent vocabulary rename refactor (BenchmarkSpecEvalSpec, TestRunnerEvalRunner#222) merged on 2026-04-28. The previous successful eval CI run was on 2026-04-21 (PR #137).

The mock engine returns "Mock response for: <user prompt>" — that string never contained the expected substrings (e.g. 'async', 'fetch'). So either:

  1. The expectations should be looking at req.Resources (file content) instead of req.Message
  2. The mock engine previously echoed file contents and a recent refactor dropped that behavior
  3. The expectations themselves were misaligned with the mock and only worked accidentally before

Impact

Repro on PR #226

PR #226 surfaced this — the CI failure is identical to the one reproducible on a fresh main clone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions