Skip to content

feat: Multi-trial flakiness detection for evals #84

Description

@spboyer

Summary

Add support for running eval tasks multiple times (--trials N) and computing pass rate statistics to detect flaky results.

Motivation

LLM-based evals are inherently non-deterministic. A task might pass 4 out of 5 times — is that a real failure or just noise? Running multiple trials and reporting pass rates helps distinguish genuine issues from flakiness, enabling confident quality gates in CI.

Proposed Implementation

CLI flag: waza run eval.yaml --trials 5

  1. Run each task N times (default: 1, configurable per-task or globally)
  2. Aggregate results per task:
    • Pass rate: passes / trials (e.g., 4/5 = 80%)
    • Confidence: flag tasks below threshold (e.g., <100% = flaky)
  3. Report in results JSON:
{
  "task": "deploy-prompt",
  "trials": 5,
  "passes": 4,
  "pass_rate": 0.8,
  "flaky": true,
  "individual_results": [...]
}

Dashboard integration

The web dashboard should show pass rate and flakiness badges alongside task results.

Eval YAML support

settings:
  trials: 3
  flaky_threshold: 0.8  # tasks below this are marked flaky

tasks:
  - name: stable-task
    trials: 1  # override per-task

Acceptance Criteria

  • --trials N CLI flag runs each task N times
  • Pass rate computed and included in results JSON
  • Flaky tasks flagged based on configurable threshold
  • Per-task trial override in eval.yaml
  • Dashboard shows pass rates and flakiness indicators
  • Tests covering: single trial (default), multi-trial aggregation, flaky detection

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgoPull requests that update go codepriority:p1This sprintsquad:linusAssigned to Linus (Backend Developer)

Fields

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions