feat: Add multi-trial flakiness detection for evals by spboyer · Pull Request #103 · microsoft/waza

spboyer · 2026-03-10T13:29:43Z

Reopened from #91 (approved, but GitHub's merge cache was stuck showing CONFLICTING despite no actual conflicts).

Original PR: #91

Closes #84

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ide test - Updated computeTestStats to separate StatusError from StatusFailed buckets - Updated unit tests to verify error/fail separation - Added cmd/waza test for --trials flag override logic - Fixes double-counting of error runs as failures

Copilot

Pull request overview

Adds multi-trial (“flakiness”) support to eval runs by introducing a --trials CLI override and expanding per-task statistics so results can report pass rate, run counts, and a flakiness percentage.

Changes:

Add --trials flag to waza run, with validation and spec override behavior.
Extend TestStats + computeTestStats() to track passed/failed/error/total runs and compute flakiness_percent.
Update CLI output + docs and add tests covering the new stats/flag behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents the new `--trials` flag for `waza run`.
README.md	Adds `--trials` to the CLI reference table.
cmd/waza/cmd_run.go	Introduces `--trials` flag, validates it, and conditionally overrides `config.trials_per_task`; prints flakiness percent in summaries.
cmd/waza/cmd_run_test.go	Adds tests for `--trials` parsing/validation/override and a nil-cmd override path.
internal/orchestration/runner.go	Updates `computeTestStats()` to separate passed/failed/error runs and compute `FlakinessPercent`.
internal/orchestration/runner_test.go	Adds regression test ensuring `StatusError` runs don’t count as failed.
internal/orchestration/runner_orchestration_test.go	Adds a mixed-outcome flakiness percent test for `computeTestStats()`.
internal/models/outcome.go	Extends `TestStats` JSON shape with run counts and `flakiness_percent`.

- Fix task YAML schema: use inputs.prompt instead of prompts[] - Fix gofmt indentation on test function - Clarify --trials docs: omit to use config, explicit values must be >= 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer

All 4 review comments addressed in 810c412:

✅ Fixed gofmt indentation — entire test function properly tabbed
✅ Fixed task YAML schema — uses inputs: { prompt: test } matching TestStimulus struct
✅ Updated CLI docs — clarified omit=use config, explicit values must be >= 1
✅ Updated README — same wording fix for --trials description

Packages waza as an azd extension, allowing users to run waza commands via azd waza <command>. Contributes to microsoft#62 ## What's New - extension.yaml — Extension manifest defining microsoft.azd.waza with commands: init, generate, run, compare, tokens - build.ps1 / build.sh — Cross-platform build scripts for creating extension binaries (Windows, macOS, Linux across amd64/arm64) - registry.json — Extension registry metadata for distribution (for testing only) - version.txt — Version tracking file ## Usage ### Install the custom extension source used while testing ``` azd ext source add -n waza -t url -l "https://raw.githubusercontent.com/wbreza/waza/refs/heads/azd-extension/registry.json" ``` ### Install the extension ``` azd extension install microsoft.azd.waza ``` ### Run waza commands through azd ``` azd waza run examples/code-explainer/eval.yaml -v azd waza init my-eval --interactive ```

spboyer and others added 5 commits March 10, 2026 08:58

feat: add multi-trial flakiness detection #84

ad248a0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address review feedback on PR #91

72bfc14

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address PR #91 flakiness review comments

94b2513

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: add clarifying comment on dual-path trials override logic

8db32f5

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer requested review from chlowell and richardpark-msft as code owners March 10, 2026 13:29

Copilot AI review requested due to automatic review settings March 10, 2026 13:29

github-actions Bot enabled auto-merge (squash) March 10, 2026 13:30

Copilot started reviewing on behalf of spboyer March 10, 2026 13:30 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread cmd/waza/cmd_run_test.go Outdated

Comment thread cmd/waza/cmd_run_test.go Outdated

Comment thread site/src/content/docs/reference/cli.mdx Outdated

Comment thread README.md Outdated

fix: address PR #103 review feedback

810c412

- Fix task YAML schema: use inputs.prompt instead of prompts[] - Fix gofmt indentation on test function - Clarify --trials docs: omit to use config, explicit values must be >= 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer commented Mar 10, 2026

View reviewed changes

chlowell approved these changes Mar 10, 2026

View reviewed changes

github-actions Bot merged commit bac0893 into main Mar 10, 2026
6 checks passed

spboyer mentioned this pull request Mar 12, 2026

Release v0.21.0 #122

Merged

4 tasks

spboyer mentioned this pull request Apr 22, 2026

feat: Multi-trial flakiness detection for evals #84

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add multi-trial flakiness detection for evals#103

feat: Add multi-trial flakiness detection for evals#103
github-actions[bot] merged 6 commits into
mainfrom
squad/84-flakiness-v2

spboyer commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spboyer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

spboyer commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spboyer commented Mar 10, 2026 •

edited

Loading