Comparing changes

* fix: webserver test skips when frontend assets not built The TestIndexHTMLReferencesExistingAssets test only checked whether the assets/ directory existed, but not whether it contained actual JS/CSS bundles. On Windows CI (and any env where npm run build hasn't run), the directory could exist empty or with non-bundle files, causing the test to proceed and fail when the SPA fallback served index.html instead of the expected asset content types. Now the test also verifies that assets/ contains at least one .js or .css file before proceeding, and skips with a clear message otherwise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: spelling typo + Windows-compatible absolute path test - Fix 'artefact' → 'artifact' misspelling in webserver test - Use runtime.GOOS to pick platform-correct absolute path in suggest test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add tool_calls grader (#187) Adds a new tool_calls grader that validates which tools an agent called during execution. Supports four constraint types: - required_tools: tools that must appear in the session - forbidden_tools: tools that must not appear in the session - min_calls: minimum total tool call count - max_calls: maximum total tool call count Partial scoring: score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time. Includes 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integration. Closes #187 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add tool_calls grader to site docs and schema reference (#187) - Add tool_calls to at-a-glance table in graders.mdx - Add full Tool Calls section with config options, examples, and comparison tip - Add tool_calls to grader type enum in schema.mdx - Site builds clean Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: errcheck lint violation in tool_calls grader test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: allow task prompt to be loaded from file (#157) Add prompt_file field to TestStimulus as an alternative to inline prompt. When prompt_file is set, the file content is read and used as the prompt message. The path is resolved relative to the task YAML file's directory. Validation: - Error if both prompt and prompt_file are set - Error if prompt_file doesn't exist Includes 7 test cases covering file load, subdirectory paths, mutual exclusivity, missing file, inline fallback, empty inputs, and multiline. Updates task.schema.json with prompt_file property and oneOf constraint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: reject unix-style absolute paths on Windows in normalizeGeneratedPath filepath.IsAbs does not recognize /etc/evil.yaml as absolute on Windows since it lacks a drive letter. Add explicit check for leading / to catch cross-platform absolute path injection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address review feedback on prompt_file (#157) - Reject absolute paths and path traversal (security, per runner pattern) - Clear MessageFile after resolve to avoid leaking paths in serialized output - Add minLength: 1 to prompt_file in JSON schema - Add 3 new tests: absolute path, path traversal, MessageFile clearing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add prompt_file to eval-yaml guide and schema reference (#157) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: implement max_response_time_ms behavior rule (#136) Add MaxResponseTimeMs field to BehaviorRules and implement timing compliance check in ComputeBehaviorMetrics. Changes: - Add MaxResponseTimeMs int64 to BehaviorRules (testcase.go) - Add MaxResponseTimeMs, ActualResponseTimeMs, MaxResponseTimeMsPassed to BehaviorMetrics (behavior.go) - Check run.DurationMs <= rules.MaxResponseTimeMs when set - Include MaxResponseTimeMsPassed in AllConstraintsPassed() - Update computeEfficiency from 4×0.25 to 5×0.20 categories - Add max_response_time_ms to JSON schema (task.schema.json) - Add 4 new test cases: under/at/over limit, combined failure - Update existing test expected efficiency scores Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add max_response_time_ms to eval-yaml guide and schema reference Document the new behavior rule field in both the Writing Eval Specs guide and the YAML Schema reference page. Includes field table, usage examples, and description of efficiency scoring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: gofmt formatting for behavior metrics files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add output_contains_any expectation field (#137) Add MayInclude (output_contains_any) to TestExpectation, which passes when ANY of the listed strings appear in the agent output. This completes the expectation-level text check trio alongside the existing MustInclude (output_contains) and MustExclude (output_not_contains). Also wires up all three expectation fields in RunAll via the new evaluateExpectations helper — these fields were previously defined but never evaluated. Closes #137 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: gofmt formatting and misspelling in run.go Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Apr 21, 2026

This comparison is taking too long to generate.

Uh oh!