Skip to content

feat: Add multi-trial flakiness detection for evals#103

Merged
github-actions[bot] merged 6 commits into
mainfrom
squad/84-flakiness-v2
Mar 10, 2026
Merged

feat: Add multi-trial flakiness detection for evals#103
github-actions[bot] merged 6 commits into
mainfrom
squad/84-flakiness-v2

Conversation

@spboyer

@spboyer spboyer commented Mar 10, 2026

Copy link
Copy Markdown
Member

Reopened from #91 (approved, but GitHub's merge cache was stuck showing CONFLICTING despite no actual conflicts).

Original PR: #91

Closes #84

spboyer and others added 5 commits March 10, 2026 08:58
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses wbreza review feedback on PR #91.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ide test

- Updated computeTestStats to separate StatusError from StatusFailed buckets
- Updated unit tests to verify error/fail separation
- Added cmd/waza test for --trials flag override logic
- Fixes double-counting of error runs as failures
Copilot AI review requested due to automatic review settings March 10, 2026 13:29
@github-actions github-actions Bot enabled auto-merge (squash) March 10, 2026 13:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-trial (“flakiness”) support to eval runs by introducing a --trials CLI override and expanding per-task statistics so results can report pass rate, run counts, and a flakiness percentage.

Changes:

  • Add --trials flag to waza run, with validation and spec override behavior.
  • Extend TestStats + computeTestStats() to track passed/failed/error/total runs and compute flakiness_percent.
  • Update CLI output + docs and add tests covering the new stats/flag behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Documents the new --trials flag for waza run.
README.md Adds --trials to the CLI reference table.
cmd/waza/cmd_run.go Introduces --trials flag, validates it, and conditionally overrides config.trials_per_task; prints flakiness percent in summaries.
cmd/waza/cmd_run_test.go Adds tests for --trials parsing/validation/override and a nil-cmd override path.
internal/orchestration/runner.go Updates computeTestStats() to separate passed/failed/error runs and compute FlakinessPercent.
internal/orchestration/runner_test.go Adds regression test ensuring StatusError runs don’t count as failed.
internal/orchestration/runner_orchestration_test.go Adds a mixed-outcome flakiness percent test for computeTestStats().
internal/models/outcome.go Extends TestStats JSON shape with run counts and flakiness_percent.

Comment thread cmd/waza/cmd_run_test.go Outdated
Comment thread cmd/waza/cmd_run_test.go Outdated
Comment thread site/src/content/docs/reference/cli.mdx Outdated
Comment thread README.md Outdated
- Fix task YAML schema: use inputs.prompt instead of prompts[]
- Fix gofmt indentation on test function
- Clarify --trials docs: omit to use config, explicit values must be >= 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@spboyer spboyer left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 4 review comments addressed in 810c412:

  1. ✅ Fixed gofmt indentation — entire test function properly tabbed
  2. ✅ Fixed task YAML schema — uses inputs: { prompt: test } matching TestStimulus struct
  3. ✅ Updated CLI docs — clarified omit=use config, explicit values must be >= 1
  4. ✅ Updated README — same wording fix for --trials description

@github-actions github-actions Bot merged commit bac0893 into main Mar 10, 2026
6 checks passed
richardpark-msft pushed a commit to richardpark-msft/waza that referenced this pull request Mar 10, 2026
Packages waza as an azd extension, allowing users to run waza commands
via azd waza <command>.
Contributes to microsoft#62

## What's New

- extension.yaml — Extension manifest defining microsoft.azd.waza with
commands: init, generate, run, compare,
  tokens
- build.ps1 / build.sh — Cross-platform build scripts for creating
extension binaries (Windows, macOS, Linux across
   amd64/arm64)
- registry.json — Extension registry metadata for distribution (for
testing only)
   - version.txt — Version tracking file

##  Usage
  
  ### Install the custom extension source used while testing
```
azd ext source add -n waza -t url -l "https://raw.githubusercontent.com/wbreza/waza/refs/heads/azd-extension/registry.json"
```
### Install the extension
```
azd extension install microsoft.azd.waza
```

### Run waza commands through azd
```
azd waza run examples/code-explainer/eval.yaml -v
azd waza init my-eval --interactive
```
@spboyer spboyer mentioned this pull request Mar 12, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Multi-trial flakiness detection for evals

4 participants