Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: microsoft/waza
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 8a6129a
Choose a base ref
...
head repository: microsoft/waza
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 7fc7f07
Choose a head ref
  • 5 commits
  • 18 files changed
  • 2 contributors

Commits on Apr 21, 2026

  1. fix: webserver test skips when frontend assets not built (#204)

    * fix: webserver test skips when frontend assets not built
    
    The TestIndexHTMLReferencesExistingAssets test only checked whether the
    assets/ directory existed, but not whether it contained actual JS/CSS
    bundles. On Windows CI (and any env where npm run build hasn't run),
    the directory could exist empty or with non-bundle files, causing the
    test to proceed and fail when the SPA fallback served index.html
    instead of the expected asset content types.
    
    Now the test also verifies that assets/ contains at least one .js or
    .css file before proceeding, and skips with a clear message otherwise.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: spelling typo + Windows-compatible absolute path test
    
    - Fix 'artefact' → 'artifact' misspelling in webserver test
    - Use runtime.GOOS to pick platform-correct absolute path in suggest test
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    0cfa604 View commit details
    Browse the repository at this point in the history
  2. feat: add tool_calls grader (#187) (#202)

    * feat: add tool_calls grader (#187)
    
    Adds a new tool_calls grader that validates which tools an agent called
    during execution. Supports four constraint types:
    
    - required_tools: tools that must appear in the session
    - forbidden_tools: tools that must not appear in the session
    - min_calls: minimum total tool call count
    - max_calls: maximum total tool call count
    
    Partial scoring: score = passed_checks / total_checks. Each constraint
    counts as one check. Constructor validates parameters at creation time.
    
    Includes 25 tests covering constructor validation, required/forbidden
    tools, call count bounds, combined checks, partial scoring, details
    output, edge cases, and factory integration.
    
    Closes #187
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * docs: add tool_calls grader to site docs and schema reference (#187)
    
    - Add tool_calls to at-a-glance table in graders.mdx
    - Add full Tool Calls section with config options, examples, and comparison tip
    - Add tool_calls to grader type enum in schema.mdx
    - Site builds clean
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: errcheck lint violation in tool_calls grader test
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    7bd731f View commit details
    Browse the repository at this point in the history
  3. feat: allow task prompt to be loaded from file (#157) (#200)

    * feat: allow task prompt to be loaded from file (#157)
    
    Add prompt_file field to TestStimulus as an alternative to inline prompt.
    When prompt_file is set, the file content is read and used as the prompt
    message. The path is resolved relative to the task YAML file's directory.
    
    Validation:
    - Error if both prompt and prompt_file are set
    - Error if prompt_file doesn't exist
    
    Includes 7 test cases covering file load, subdirectory paths, mutual
    exclusivity, missing file, inline fallback, empty inputs, and multiline.
    Updates task.schema.json with prompt_file property and oneOf constraint.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: reject unix-style absolute paths on Windows in normalizeGeneratedPath
    
    filepath.IsAbs does not recognize /etc/evil.yaml as absolute on Windows
    since it lacks a drive letter. Add explicit check for leading / to catch
    cross-platform absolute path injection.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: address review feedback on prompt_file (#157)
    
    - Reject absolute paths and path traversal (security, per runner pattern)
    - Clear MessageFile after resolve to avoid leaking paths in serialized output
    - Add minLength: 1 to prompt_file in JSON schema
    - Add 3 new tests: absolute path, path traversal, MessageFile clearing
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * docs: add prompt_file to eval-yaml guide and schema reference (#157)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    0540774 View commit details
    Browse the repository at this point in the history
  4. feat: implement max_response_time_ms behavior rule (#201)

    * feat: implement max_response_time_ms behavior rule (#136)
    
    Add MaxResponseTimeMs field to BehaviorRules and implement timing
    compliance check in ComputeBehaviorMetrics.
    
    Changes:
    - Add MaxResponseTimeMs int64 to BehaviorRules (testcase.go)
    - Add MaxResponseTimeMs, ActualResponseTimeMs, MaxResponseTimeMsPassed
      to BehaviorMetrics (behavior.go)
    - Check run.DurationMs <= rules.MaxResponseTimeMs when set
    - Include MaxResponseTimeMsPassed in AllConstraintsPassed()
    - Update computeEfficiency from 4×0.25 to 5×0.20 categories
    - Add max_response_time_ms to JSON schema (task.schema.json)
    - Add 4 new test cases: under/at/over limit, combined failure
    - Update existing test expected efficiency scores
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * docs: add max_response_time_ms to eval-yaml guide and schema reference
    
    Document the new behavior rule field in both the Writing Eval Specs
    guide and the YAML Schema reference page. Includes field table,
    usage examples, and description of efficiency scoring.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: gofmt formatting for behavior metrics files
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    6fb2f65 View commit details
    Browse the repository at this point in the history
  5. feat: add output_contains_any expectation field (#203)

    * feat: add output_contains_any expectation field (#137)
    
    Add MayInclude (output_contains_any) to TestExpectation, which passes
    when ANY of the listed strings appear in the agent output. This
    completes the expectation-level text check trio alongside the existing
    MustInclude (output_contains) and MustExclude (output_not_contains).
    
    Also wires up all three expectation fields in RunAll via the new
    evaluateExpectations helper — these fields were previously defined but
    never evaluated.
    
    Closes #137
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: gofmt formatting and misspelling in run.go
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    7fc7f07 View commit details
    Browse the repository at this point in the history
Loading