Summary
Add support for running eval tasks multiple times (--trials N) and computing pass rate statistics to detect flaky results.
Motivation
LLM-based evals are inherently non-deterministic. A task might pass 4 out of 5 times — is that a real failure or just noise? Running multiple trials and reporting pass rates helps distinguish genuine issues from flakiness, enabling confident quality gates in CI.
Proposed Implementation
CLI flag: waza run eval.yaml --trials 5
- Run each task N times (default: 1, configurable per-task or globally)
- Aggregate results per task:
- Pass rate:
passes / trials (e.g., 4/5 = 80%)
- Confidence: flag tasks below threshold (e.g., <100% = flaky)
- Report in results JSON:
{
"task": "deploy-prompt",
"trials": 5,
"passes": 4,
"pass_rate": 0.8,
"flaky": true,
"individual_results": [...]
}
Dashboard integration
The web dashboard should show pass rate and flakiness badges alongside task results.
Eval YAML support
settings:
trials: 3
flaky_threshold: 0.8 # tasks below this are marked flaky
tasks:
- name: stable-task
trials: 1 # override per-task
Acceptance Criteria
Summary
Add support for running eval tasks multiple times (
--trials N) and computing pass rate statistics to detect flaky results.Motivation
LLM-based evals are inherently non-deterministic. A task might pass 4 out of 5 times — is that a real failure or just noise? Running multiple trials and reporting pass rates helps distinguish genuine issues from flakiness, enabling confident quality gates in CI.
Proposed Implementation
CLI flag:
waza run eval.yaml --trials 5passes / trials(e.g., 4/5 = 80%){ "task": "deploy-prompt", "trials": 5, "passes": 4, "pass_rate": 0.8, "flaky": true, "individual_results": [...] }Dashboard integration
The web dashboard should show pass rate and flakiness badges alongside task results.
Eval YAML support
Acceptance Criteria
--trials NCLI flag runs each task N times