Skip to content

feat: add output_contains_any expectation field#203

Merged
spboyer merged 2 commits into
mainfrom
squad/137-output-contains-any
Apr 21, 2026
Merged

feat: add output_contains_any expectation field#203
spboyer merged 2 commits into
mainfrom
squad/137-output-contains-any

Conversation

@spboyer

@spboyer spboyer commented Apr 21, 2026

Copy link
Copy Markdown
Member

Summary

Adds MayInclude (output_contains_any) to TestExpectation, which passes when any of the listed strings appear in the agent output. This completes the expectation-level text check trio:

YAML field Go field Semantics
output_contains MustInclude ALL strings must appear (score = matched/total)
output_not_contains MustExclude NONE may appear (score = absent/total)
output_contains_any MayInclude ANY one must appear (binary 1.0/0.0)

What changed

  • internal/models/testcase.go — Added MayInclude []string field with yaml/json tags
  • internal/graders/run.go — Added evaluateExpectations() that evaluates all three expectation fields and synthesizes GraderResults. Wired into RunAll() after spec/task graders. All checks are case-insensitive.
  • internal/graders/run_test.go — 4 new test functions covering each field individually and combined
  • internal/models/testcase_test.go — YAML parsing test for the new field

Example YAML

expected:
  output_contains_any:
    - "option_a"
    - "option_b"
    - "option_c"

Note

MustInclude and MustExclude were previously defined in the struct but never evaluated. This PR wires up all three fields.

Working as Linus (Backend Developer)

Closes #137

Copilot AI review requested due to automatic review settings April 21, 2026 17:19
@spboyer spboyer added the squad:linus Assigned to Linus (Backend Developer) label Apr 21, 2026
@spboyer spboyer requested a review from wbreza as a code owner April 21, 2026 17:19
@github-actions github-actions Bot enabled auto-merge (squash) April 21, 2026 17:19
@spboyer spboyer force-pushed the squad/137-output-contains-any branch from 9341a64 to 7b861f7 Compare April 21, 2026 17:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for an expectation-level “any-of” output text check (output_contains_anyMayInclude) and wires expectation-based text validations into the grading pipeline alongside existing spec/task graders.

Changes:

  • Add MayInclude []string to models.TestExpectation with YAML/JSON tags.
  • Evaluate MustInclude, MustExclude, and MayInclude in graders.RunAll() via a new evaluateExpectations() helper.
  • Add unit tests for YAML parsing and expectation evaluation behavior.
Show a summary per file
File Description
internal/models/testcase.go Extends TestExpectation with MayInclude (output_contains_any).
internal/graders/run.go Adds expectation evaluation and merges synthesized results into RunAll() output.
internal/graders/run_test.go Adds tests covering each expectation field and combined behavior.
internal/models/testcase_test.go Adds YAML parsing test for output_contains_any.

Copilot's findings

Comments suppressed due to low confidence (2)

internal/graders/run.go:114

  • These synthetic expectation results don’t set GraderResults.Type. type is required in JSON output (no omitempty) and is used in reporting (e.g., JUnit/web API). Please populate Type (likely models.GraderKindText) for this result.
results["_output_not_contains"] = models.GraderResults{
Name:     "_output_not_contains",
Score:    score,
Passed:   score == 1.0,
Feedback: feedback,
Weight:   1.0,
}

internal/graders/run.go:139

  • These synthetic expectation results don’t set GraderResults.Type. type is required in JSON output (no omitempty) and is used in reporting (e.g., JUnit/web API). Please populate Type (likely models.GraderKindText) for this result.
results["_output_contains_any"] = models.GraderResults{
Name:     "_output_contains_any",
Score:    score,
Passed:   foundAny,
Feedback: feedback,
Weight:   1.0,
}
  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment thread internal/graders/run.go
Comment thread internal/graders/run.go Outdated
Comment thread internal/graders/run_test.go
Comment thread internal/graders/run.go Outdated
Copilot AI review requested due to automatic review settings April 21, 2026 17:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (2)

internal/graders/run.go:112

  • The synthetic expectation-derived GraderResults for output_not_contains don’t set Type. Since Type is required and surfaced in reports, populate it (likely models.GraderKindText) here too.
		results["_output_not_contains"] = models.GraderResults{
			Name:     "_output_not_contains",
			Score:    score,
			Passed:   score == 1.0,
			Feedback: feedback,

internal/graders/run.go:137

  • The synthetic expectation-derived GraderResults for output_contains_any don’t set Type. Please set Type (likely models.GraderKindText) so downstream output/reporting includes a correct grader kind.
		results["_output_contains_any"] = models.GraderResults{
			Name:     "_output_contains_any",
			Score:    score,
			Passed:   foundAny,
			Feedback: feedback,
  • Files reviewed: 13/13 changed files
  • Comments generated: 2

Comment on lines +468 to +472
### output_contains_any

**Type:** array of strings

At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says “All checks are case-insensitive” under output_contains_any, but the implementation applies case-insensitive matching to output_contains and output_not_contains as well. Consider moving this note to a shared place (or repeating it under each field) to avoid implying only output_contains_any is case-insensitive.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +79
results := evaluateExpectations(tc, gCtx)
r, ok := results["_output_contains_any"]
assert.True(t, ok)
assert.Equal(t, 1.0, r.Score)
assert.True(t, r.Passed)
assert.Contains(t, r.Feedback, "beta")
})

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These expectation-evaluation tests assert Score/Passed/Feedback, but don’t assert that the synthesized results populate GraderResults.Type. Adding an assertion here (and in the other expectation tests) would catch regressions since Type is required in output/reporting.

Copilot uses AI. Check for mistakes.
@spboyer spboyer force-pushed the squad/137-output-contains-any branch from a2dcb4a to 545b44a Compare April 21, 2026 18:22
Copilot AI added 2 commits April 21, 2026 14:27
Add MayInclude (output_contains_any) to TestExpectation, which passes
when ANY of the listed strings appear in the agent output. This
completes the expectation-level text check trio alongside the existing
MustInclude (output_contains) and MustExclude (output_not_contains).

Also wires up all three expectation fields in RunAll via the new
evaluateExpectations helper — these fields were previously defined but
never evaluated.

Closes #137

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/137-output-contains-any branch from 545b44a to 16fa7e7 Compare April 21, 2026 18:28
@spboyer spboyer merged commit 7fc7f07 into main Apr 21, 2026
4 of 5 checks passed
@spboyer spboyer deleted the squad/137-output-contains-any branch April 21, 2026 18:28
spboyer added a commit that referenced this pull request Apr 21, 2026
The integration test step runs `waza run` with the mock executor,
which produces generic output that won't match output_contains
expectations. This is expected — the test validates that waza
completes without crashing, not that mock evals pass.

Root cause: PR #203 (v0.27.0) wired up evaluateExpectations() which
made output_contains checks actually execute. Before that, these
fields were defined but never evaluated, so the integration test
passed silently.

Exit code 1 (eval failures) is now allowed. Exit codes >1 (crashes,
panics) still fail CI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

squad:linus Assigned to Linus (Backend Developer)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for the TestExpectation model field MayInclude (which maps to the yaml output_contains_any field.

3 participants