feat: Add multi-trial flakiness detection for evals by spboyer · Pull Request #91 · microsoft/waza

spboyer · 2026-03-05T01:54:14Z

Partially addresses #84 (Adds CLI flag, but full flakiness config pending)

codecov-commenter · 2026-03-05T01:57:42Z

Codecov Report

❌ Patch coverage is 77.41935% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@091a1f9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
internal/orchestration/runner.go	76.19%	5 Missing ⚠️
cmd/waza/cmd_run.go	80.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #91   +/-   ##
=======================================
  Coverage        ?   72.32%           
=======================================
  Files           ?      129           
  Lines           ?    14459           
  Branches        ?        0           
=======================================
  Hits            ?    10458           
  Misses          ?     3228           
  Partials        ?      773

Flag	Coverage Δ
go-implementation	`72.32% <77.41%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds CLI and result/statistics enhancements to support running eval tasks multiple times and surfacing flakiness/pass-rate information in outputs.

Changes:

Document a new --trials flag for waza run and add CLI parsing/validation + spec override behavior.
Extend per-task TestStats with run counts and flakiness percentage, and print flakiness in the CLI summary.
Add tests around --trials handling and basic flakiness percent computation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents `--trials` flag for multi-trial runs.
internal/orchestration/runner_orchestration_test.go	Adds unit test for flakiness percent/run counts in `computeTestStats`.
internal/orchestration/runner.go	Computes additional per-task stats (`PassedRuns`, `FailedRuns`, `ErrorRuns`, `TotalRuns`, `FlakinessPercent`).
internal/models/outcome.go	Extends `TestStats` JSON model with run counts + `flakiness_percent`.
cmd/waza/cmd_run_test.go	Adds tests for `--trials` parsing, validation, and spec override reflected in output JSON.
cmd/waza/cmd_run.go	Introduces `--trials` flag, validates it, applies it to `spec.Config.RunsPerTest`, and prints flakiness percent in summary.
README.md	Documents `--trials` flag in CLI options table.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md	Adds squad session log (process/workflow).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md	Adds squad session log (token diff strategy).
.squad/decisions.md	Records decisions (includes token-diff strategy + workflow directive).

Comments suppressed due to low confidence (3)

internal/orchestration/runner_orchestration_test.go:350

This test only covers pass/fail runs where Validations is populated, so it won’t catch the common error-path cases where StatusError runs have Validations == nil (e.g., engine.Execute error / grader execution error). Adding an assertion that an error run reduces PassRate and increments ErrorRuns (and does not increment PassedRuns) would prevent regressions in the stats bucketing logic.

func TestComputeTestStats_FlakinessPercent(t *testing.T) {
	runner := NewTestRunner(config.NewBenchmarkConfig(&models.BenchmarkSpec{}), nil)
	runs := []models.RunResult{
		{
			Status:     models.StatusPassed,
			DurationMs: 10,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusPassed,
			DurationMs: 20,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusFailed,
			DurationMs: 30,
			Validations: map[string]models.GraderResults{
				"check": {Passed: false, Score: 0},
			},
		},
	}

	stats := runner.computeTestStats(runs)
	require.NotNil(t, stats)
	assert.Equal(t, 3, stats.TotalRuns)
	assert.Equal(t, 2, stats.PassedRuns)
	assert.Equal(t, 1, stats.FailedRuns)
	assert.Equal(t, 0, stats.ErrorRuns)
	assert.InDelta(t, 66.6667, stats.PassRate*100, 0.1)
	assert.True(t, stats.Flaky)
	assert.InDelta(t, 33.3333, stats.FlakinessPercent, 0.1)
}

.squad/decisions.md:258

This PR is scoped/titled as implementing multi-trial flakiness detection for evals (issue #84), but this change adds a decision entry for an unrelated topic (token diff distribution strategy for issue #81). Consider moving the #81 decision/log updates to a separate PR (or to the PR for #81) so this PR remains focused and easier to review/revert.

## 2026-03-05: Token Diff Distribution Strategy (Issue #81)

**By:** Rusty (Lead / Architect)  
**Issue:** #81  
**Status:** APPROVED

### What

For the GitHub Action token budget PR comment feature (#81), **implement `waza tokens diff` CLI command + lightweight wrapper action**, not action-only or CLI-only.

### Implementation

internal/orchestration/runner.go:1302

The current counters can double-count error runs: a StatusError run that also has failing validations increments both failed (via the else) and errored, and a StatusError run with nil validations increments passed and errored. This can make PassedRuns+FailedRuns+ErrorRuns differ from TotalRuns and makes the individual run counts misleading. Suggest counting each run into exactly one bucket (passed/failed/error) based on run.Status, then computing PassRate from those buckets.


		if run.Status == models.StatusError {
			errored++
		}

spboyer

Verified by Rusty (Opus 4.6) — LGTM ✅

Solid implementation of multi-trial flakiness detection:

--trials\ flag with proper validation (>=1), clean override of \spec.Config.RunsPerTest\
\FlakinessPercent\ metric uses correct minority-outcomes formula
New stats fields (PassedRuns, FailedRuns, ErrorRuns, TotalRuns) provide full trial breakdown
Summary display updated with flakiness percentage
Tests cover flag parsing, invalid values, override behavior, and stats computation
Docs updated: README, CLI reference
CI green on ubuntu + windows + lint

Note: Can't self-approve via API (same account). Setting auto-merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ft#91)

wbreza

Code Review: PR #91 - feat: Add multi-trial flakiness detection

What Looks Good

Smart flag behavior - --trials only overrides when explicitly provided
Clean validation - rejects --trials 0 with clear error
Correct flakiness formula - minority outcomes / total gives 0-50% range
Backward compatible switch on run.Status with fallthrough
Comprehensive tests and docs updated

Low (1 finding)

ErrorRuns is a subset of FailedRuns (PassedRuns + FailedRuns = TotalRuns). Consider documenting this in outcome.go for JSON consumers.

Overall Assessment: Approve

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

wbreza

Code Review: PR #91 — feat: Add multi-trial flakiness detection

✅ What Looks Good

cmd.Flags().Changed("trials") correctly prevents accidental override of spec config
Flakiness formula — minority-outcome metric is symmetric and intuitive
Error runs counted as failures — ErrorRuns ⊂ FailedRuns, PassedRuns + FailedRuns == TotalRuns
Comprehensive tests — flag parsing, invalid value, spec override, flakiness %, error counting
Docs updated — README and site CLI reference both cover the new flag

Findings Summary

Priority	Count
Critical	0
High	0
Medium	0
Low	1
Total	1

Overall Assessment: Approve — clean, well-tested feature addition. One low readability note.

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer · 2026-03-09T22:07:06Z

Addressed wbreza review feedback in 25ff560:

Added clarifying comment explaining dual-path logic for shouldOverrideTrials (CLI uses Changed(), tests fall back to trials > 0)

Addresses wbreza review feedback on PR microsoft#91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer · 2026-03-10T13:29:35Z

Reopening with fresh branch to resolve GitHub merge status cache issue.

* feat: add multi-trial flakiness detection #84 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address review feedback on PR #91 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR #91 flakiness review comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add clarifying comment on dual-path trials override logic Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(runner): StatusError counts as error not fail, add --trials override test - Updated computeTestStats to separate StatusError from StatusFailed buckets - Updated unit tests to verify error/fail separation - Added cmd/waza test for --trials flag override logic - Fixes double-counting of error runs as failures * fix: address PR #103 review feedback - Fix task YAML schema: use inputs.prompt instead of prompts[] - Fix gofmt indentation on test function - Clarify --trials docs: omit to use config, explicit values must be >= 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 5, 2026 01:54

spboyer requested review from chlowell and richardpark-msft as code owners March 5, 2026 01:54

spboyer self-assigned this Mar 5, 2026

github-actions Bot enabled auto-merge (squash) March 5, 2026 01:54

Copilot started reviewing on behalf of spboyer March 5, 2026 01:55 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Comment thread internal/orchestration/runner.go Outdated

spboyer commented Mar 5, 2026

View reviewed changes

spboyer mentioned this pull request Mar 5, 2026

feat: Add trigger heuristic grader #90

Merged

spboyer force-pushed the squad/84-flakiness-detection branch from fdcf8f8 to 4801230 Compare March 5, 2026 17:12

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address review feedback on PR #91

bc242b4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 5, 2026 17:44

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address review feedback on PR #91

0e90d56

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer force-pushed the squad/84-flakiness-detection branch from bc242b4 to 0e90d56 Compare March 5, 2026 17:46

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Comment thread site/src/content/docs/reference/cli.mdx Outdated

Comment thread cmd/waza/cmd_run.go Outdated

Comment thread cmd/waza/cmd_run.go

Comment thread cmd/waza/cmd_run.go Outdated

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address PR #91 flakiness review comments

16eac6a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chlowell pushed a commit to chlowell/waza that referenced this pull request Mar 5, 2026

feat: implement guided requirements gathering (microsoft#53) (microso…

3059b0f

…ft#91)

wbreza approved these changes Mar 5, 2026

View reviewed changes

spboyer and others added 3 commits March 5, 2026 19:04

feat: add multi-trial flakiness detection microsoft#84

84c124a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address review feedback on PR microsoft#91

ae12099

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address PR microsoft#91 flakiness review comments

ab61526

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 6, 2026 00:04

spboyer force-pushed the squad/84-flakiness-detection branch from 16eac6a to ab61526 Compare March 6, 2026 00:04

Copilot started reviewing on behalf of spboyer March 6, 2026 00:06 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Comment thread cmd/waza/cmd_run.go

wbreza previously approved these changes Mar 6, 2026

View reviewed changes

Comment thread cmd/waza/cmd_run.go

spboyer added a commit that referenced this pull request Mar 9, 2026

fix: add clarifying comment on dual-path trials override logic

25ff560

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: add clarifying comment on dual-path trials override logic

caad166

Addresses wbreza review feedback on PR microsoft#91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer dismissed wbreza’s stale review via caad166 March 9, 2026 22:24

chlowell approved these changes Mar 10, 2026

View reviewed changes

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: address review feedback on PR #91

35bfe97

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: address PR #91 flakiness review comments

36257fc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: add clarifying comment on dual-path trials override logic

f65eebe

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer closed this Mar 10, 2026

auto-merge was automatically disabled March 10, 2026 12:51
Pull request was closed

spboyer reopened this Mar 10, 2026

github-actions Bot enabled auto-merge (squash) March 10, 2026 12:51

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: address review feedback on PR #91

72bfc14

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: address PR #91 flakiness review comments

94b2513

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer added a commit that referenced this pull request Mar 10, 2026

fix: add clarifying comment on dual-path trials override logic

8db32f5

Addresses wbreza review feedback on PR #91. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer closed this Mar 10, 2026

auto-merge was automatically disabled March 10, 2026 13:29
Pull request was closed

spboyer mentioned this pull request Mar 10, 2026

feat: Add multi-trial flakiness detection for evals #103

Merged

Uh oh!

Conversation

spboyer commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #91 - feat: Add multi-trial flakiness detection

What Looks Good

Low (1 finding)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #91 — feat: Add multi-trial flakiness detection

✅ What Looks Good

Findings Summary

Uh oh!

Uh oh!

spboyer commented Mar 9, 2026

Uh oh!

spboyer commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

spboyer commented Mar 5, 2026 •

edited

Loading

codecov-commenter commented Mar 5, 2026 •

edited

Loading