feat: add output_contains_any expectation field#203
Conversation
9341a64 to
7b861f7
Compare
There was a problem hiding this comment.
Pull request overview
Adds first-class support for an expectation-level “any-of” output text check (output_contains_any → MayInclude) and wires expectation-based text validations into the grading pipeline alongside existing spec/task graders.
Changes:
- Add
MayInclude []stringtomodels.TestExpectationwith YAML/JSON tags. - Evaluate
MustInclude,MustExclude, andMayIncludeingraders.RunAll()via a newevaluateExpectations()helper. - Add unit tests for YAML parsing and expectation evaluation behavior.
Show a summary per file
| File | Description |
|---|---|
| internal/models/testcase.go | Extends TestExpectation with MayInclude (output_contains_any). |
| internal/graders/run.go | Adds expectation evaluation and merges synthesized results into RunAll() output. |
| internal/graders/run_test.go | Adds tests covering each expectation field and combined behavior. |
| internal/models/testcase_test.go | Adds YAML parsing test for output_contains_any. |
Copilot's findings
Comments suppressed due to low confidence (2)
internal/graders/run.go:114
- These synthetic expectation results don’t set
GraderResults.Type.typeis required in JSON output (noomitempty) and is used in reporting (e.g., JUnit/web API). Please populateType(likelymodels.GraderKindText) for this result.
results["_output_not_contains"] = models.GraderResults{
Name: "_output_not_contains",
Score: score,
Passed: score == 1.0,
Feedback: feedback,
Weight: 1.0,
}
internal/graders/run.go:139
- These synthetic expectation results don’t set
GraderResults.Type.typeis required in JSON output (noomitempty) and is used in reporting (e.g., JUnit/web API). Please populateType(likelymodels.GraderKindText) for this result.
results["_output_contains_any"] = models.GraderResults{
Name: "_output_contains_any",
Score: score,
Passed: foundAny,
Feedback: feedback,
Weight: 1.0,
}
- Files reviewed: 6/6 changed files
- Comments generated: 4
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (2)
internal/graders/run.go:112
- The synthetic expectation-derived GraderResults for
output_not_containsdon’t setType. SinceTypeis required and surfaced in reports, populate it (likelymodels.GraderKindText) here too.
results["_output_not_contains"] = models.GraderResults{
Name: "_output_not_contains",
Score: score,
Passed: score == 1.0,
Feedback: feedback,
internal/graders/run.go:137
- The synthetic expectation-derived GraderResults for
output_contains_anydon’t setType. Please setType(likelymodels.GraderKindText) so downstream output/reporting includes a correct grader kind.
results["_output_contains_any"] = models.GraderResults{
Name: "_output_contains_any",
Score: score,
Passed: foundAny,
Feedback: feedback,
- Files reviewed: 13/13 changed files
- Comments generated: 2
| ### output_contains_any | ||
|
|
||
| **Type:** array of strings | ||
|
|
||
| At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive. |
There was a problem hiding this comment.
This section says “All checks are case-insensitive” under output_contains_any, but the implementation applies case-insensitive matching to output_contains and output_not_contains as well. Consider moving this note to a shared place (or repeating it under each field) to avoid implying only output_contains_any is case-insensitive.
| results := evaluateExpectations(tc, gCtx) | ||
| r, ok := results["_output_contains_any"] | ||
| assert.True(t, ok) | ||
| assert.Equal(t, 1.0, r.Score) | ||
| assert.True(t, r.Passed) | ||
| assert.Contains(t, r.Feedback, "beta") | ||
| }) |
There was a problem hiding this comment.
These expectation-evaluation tests assert Score/Passed/Feedback, but don’t assert that the synthesized results populate GraderResults.Type. Adding an assertion here (and in the other expectation tests) would catch regressions since Type is required in output/reporting.
a2dcb4a to
545b44a
Compare
Add MayInclude (output_contains_any) to TestExpectation, which passes when ANY of the listed strings appear in the agent output. This completes the expectation-level text check trio alongside the existing MustInclude (output_contains) and MustExclude (output_not_contains). Also wires up all three expectation fields in RunAll via the new evaluateExpectations helper — these fields were previously defined but never evaluated. Closes #137 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
545b44a to
16fa7e7
Compare
The integration test step runs `waza run` with the mock executor, which produces generic output that won't match output_contains expectations. This is expected — the test validates that waza completes without crashing, not that mock evals pass. Root cause: PR #203 (v0.27.0) wired up evaluateExpectations() which made output_contains checks actually execute. Before that, these fields were defined but never evaluated, so the integration test passed silently. Exit code 1 (eval failures) is now allowed. Exit codes >1 (crashes, panics) still fail CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds
MayInclude(output_contains_any) toTestExpectation, which passes when any of the listed strings appear in the agent output. This completes the expectation-level text check trio:output_containsMustIncludeoutput_not_containsMustExcludeoutput_contains_anyMayIncludeWhat changed
internal/models/testcase.go— AddedMayInclude []stringfield with yaml/json tagsinternal/graders/run.go— AddedevaluateExpectations()that evaluates all three expectation fields and synthesizesGraderResults. Wired intoRunAll()after spec/task graders. All checks are case-insensitive.internal/graders/run_test.go— 4 new test functions covering each field individually and combinedinternal/models/testcase_test.go— YAML parsing test for the new fieldExample YAML
Note
MustIncludeandMustExcludewere previously defined in the struct but never evaluated. This PR wires up all three fields.Working as Linus (Backend Developer)
Closes #137