feat: Add multi-trial flakiness detection for evals#103
Merged
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ide test - Updated computeTestStats to separate StatusError from StatusFailed buckets - Updated unit tests to verify error/fail separation - Added cmd/waza test for --trials flag override logic - Fixes double-counting of error runs as failures
Contributor
There was a problem hiding this comment.
Pull request overview
Adds multi-trial (“flakiness”) support to eval runs by introducing a --trials CLI override and expanding per-task statistics so results can report pass rate, run counts, and a flakiness percentage.
Changes:
- Add
--trialsflag towaza run, with validation and spec override behavior. - Extend
TestStats+computeTestStats()to track passed/failed/error/total runs and computeflakiness_percent. - Update CLI output + docs and add tests covering the new stats/flag behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents the new --trials flag for waza run. |
| README.md | Adds --trials to the CLI reference table. |
| cmd/waza/cmd_run.go | Introduces --trials flag, validates it, and conditionally overrides config.trials_per_task; prints flakiness percent in summaries. |
| cmd/waza/cmd_run_test.go | Adds tests for --trials parsing/validation/override and a nil-cmd override path. |
| internal/orchestration/runner.go | Updates computeTestStats() to separate passed/failed/error runs and compute FlakinessPercent. |
| internal/orchestration/runner_test.go | Adds regression test ensuring StatusError runs don’t count as failed. |
| internal/orchestration/runner_orchestration_test.go | Adds a mixed-outcome flakiness percent test for computeTestStats(). |
| internal/models/outcome.go | Extends TestStats JSON shape with run counts and flakiness_percent. |
- Fix task YAML schema: use inputs.prompt instead of prompts[] - Fix gofmt indentation on test function - Clarify --trials docs: omit to use config, explicit values must be >= 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
commented
Mar 10, 2026
spboyer
left a comment
Member
Author
There was a problem hiding this comment.
All 4 review comments addressed in 810c412:
- ✅ Fixed gofmt indentation — entire test function properly tabbed
- ✅ Fixed task YAML schema — uses
inputs: { prompt: test }matching TestStimulus struct - ✅ Updated CLI docs — clarified omit=use config, explicit values must be >= 1
- ✅ Updated README — same wording fix for --trials description
chlowell
approved these changes
Mar 10, 2026
richardpark-msft
pushed a commit
to richardpark-msft/waza
that referenced
this pull request
Mar 10, 2026
Packages waza as an azd extension, allowing users to run waza commands via azd waza <command>. Contributes to microsoft#62 ## What's New - extension.yaml — Extension manifest defining microsoft.azd.waza with commands: init, generate, run, compare, tokens - build.ps1 / build.sh — Cross-platform build scripts for creating extension binaries (Windows, macOS, Linux across amd64/arm64) - registry.json — Extension registry metadata for distribution (for testing only) - version.txt — Version tracking file ## Usage ### Install the custom extension source used while testing ``` azd ext source add -n waza -t url -l "https://raw.githubusercontent.com/wbreza/waza/refs/heads/azd-extension/registry.json" ``` ### Install the extension ``` azd extension install microsoft.azd.waza ``` ### Run waza commands through azd ``` azd waza run examples/code-explainer/eval.yaml -v azd waza init my-eval --interactive ```
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reopened from #91 (approved, but GitHub's merge cache was stuck showing CONFLICTING despite no actual conflicts).
Original PR: #91
Closes #84