Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: microsoft/waza
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.25.0
Choose a base ref
...
head repository: microsoft/waza
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.26.0
Choose a head ref
  • 18 commits
  • 47 files changed
  • 9 contributors

Commits on Apr 21, 2026

  1. fix: macOS install + trigger test off-by-1 count (#164, #184) (#193)

    * fix: install.sh uses shasum on macOS when sha256sum unavailable (#164)
    
    The install script was failing on macOS because it prioritized sha256sum
    over shasum. While sha256sum exists on some macOS systems (via Homebrew),
    the BSD version doesn't support the -c flag needed for checksum verification.
    
    This fix:
    - Prioritizes shasum (native on macOS, supports -c flag)
    - Falls back to sha256sum only if it supports the -c flag
    - Exits with an error if no compatible utility is found (rather than
      skipping verification with a warning)
    
    Fixes #164
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: trigger test result count off by 1 (#184)
    
    ComputeTriggerMetrics used weighted sums (confidence-adjusted) for the
    integer TP/FP/TN/FN counts. Medium-confidence prompts contributed 0.5
    instead of 1.0, so groups with 6 high + 2 medium prompts reported 7
    instead of 8. Fix: track actual result counts for the integer fields
    while keeping weighted values for precision/recall/F1/accuracy.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    9ff4bb3 View commit details
    Browse the repository at this point in the history
  2. docs: update demo guide and add CI/CD integration guide (#112, #89) (#…

    …194)
    
    - Fix DEMO-SCRIPT.md to match current CLI commands
      - Remove references to 'waza generate' command (doesn't exist)
      - Replace with 'waza new skill' and 'waza new eval'
      - Remove outdated flags: --log, --suggestions, --trials, --fail-threshold
      - Replace with current flags: --session-log, --session-dir, --task, --parallel
      - Update Part 5+ sections to reflect current CLI behavior
    
    - Add comprehensive CI/CD integration guide (docs/CI-CD-GUIDE.md)
      - GitHub Actions examples (basic, multi-model, baseline comparison)
      - Azure DevOps pipeline examples
      - Secrets management for both platforms
      - Best practices: caching, quality gates, parallel execution, logging
      - Troubleshooting guide with common issues
      - Advanced workflows: approval gates, trend tracking
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    cd914c6 View commit details
    Browse the repository at this point in the history
  3. fix: validate grader config required fields (#195)

    * fix: validate grader config required fields (#113)
    
    Grader configurations now validate type-specific required fields at parse time:
    - code graders require at least one assertion in config.assertions
    - diff graders require at least one file in config.expected_files
    - json_schema graders require config.schema or config.schema_file
    - program graders require config.command
    - trigger graders require config.skill_path
    - action_sequence graders require config.expected_actions
    - skill_invocation graders require config.required_skills
    - tool_constraint graders require config.expect_tools or config.reject_tools
    - file graders require at least one of must_exist, must_not_exist, or content_patterns
    
    The strict YAML parser (KnownFields) already catches fields at the wrong nesting level.
    This change adds semantic validation to catch graders with empty/missing required fields.
    
    Validation is enforced in both GraderConfig (spec-level) and ValidatorInline (task-level)
    graders via their UnmarshalYAML methods.
    
    Fixes #113
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: run gofmt on spec.go and testcase.go
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: update test fixtures for file/diff grader required config validation
    
    Test fixtures for coverage report tests used file and diff graders
    without required config fields, causing parse failures after the
    grader config validation added in #113. Updated fixtures to include
    valid config (must_exist and expected_files respectively).
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    Co-authored-by: Shayne Boyer <spboyer@users.noreply.github.com>
    3 people authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    900c10f View commit details
    Browse the repository at this point in the history
  4. fix: diff grader reads post-execution workspace files (#165) (#196)

    The diff grader was reading workspace files from the filesystem after the
    Copilot SDK session had disconnected. Since session.Disconnect() may
    restore workspace files to their pre-execution state, the grader would
    see the original file content instead of the agent's modifications.
    
    Fix: capture all workspace file contents into memory before the session
    disconnects, and have the diff grader prefer these captured files over
    the on-disk workspace. This guarantees graders always see the true
    post-execution state regardless of SDK disconnect behavior.
    
    Changes:
    - Add WorkspaceFiles field to ExecutionResponse and graders.Context
    - Add captureWorkspaceFiles() that snapshots workspace before disconnect
    - Add readWorkspaceFile() to diff grader that prefers captured files
      over filesystem reads, with forward-slash normalization for
      cross-platform consistency
    - Add tests for the capture function and grader behavior
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    f1a0fe6 View commit details
    Browse the repository at this point in the history
  5. chore(deps): Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

    Bumps [smol-toml](https://github.com/squirrelchat/smol-toml) from 1.6.0 to 1.6.1.
    - [Release notes](https://github.com/squirrelchat/smol-toml/releases)
    - [Commits](squirrelchat/smol-toml@v1.6.0...v1.6.1)
    
    ---
    updated-dependencies:
    - dependency-name: smol-toml
      dependency-version: 1.6.1
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    2d03487 View commit details
    Browse the repository at this point in the history
  6. chore(deps): Bump picomatch in /site (#159)

    Bumps  and [picomatch](https://github.com/micromatch/picomatch). These dependencies needed to be updated together.
    
    Updates `picomatch` from 4.0.3 to 4.0.4
    - [Release notes](https://github.com/micromatch/picomatch/releases)
    - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md)
    - [Commits](micromatch/picomatch@4.0.3...4.0.4)
    
    Updates `picomatch` from 2.3.1 to 2.3.2
    - [Release notes](https://github.com/micromatch/picomatch/releases)
    - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md)
    - [Commits](micromatch/picomatch@4.0.3...4.0.4)
    
    ---
    updated-dependencies:
    - dependency-name: picomatch
      dependency-version: 4.0.4
      dependency-type: indirect
    - dependency-name: picomatch
      dependency-version: 2.3.2
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    08b024f View commit details
    Browse the repository at this point in the history
  7. chore(deps): Bump picomatch from 4.0.3 to 4.0.4 in /web (#160)

    Bumps [picomatch](https://github.com/micromatch/picomatch) from 4.0.3 to 4.0.4.
    - [Release notes](https://github.com/micromatch/picomatch/releases)
    - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md)
    - [Commits](micromatch/picomatch@4.0.3...4.0.4)
    
    ---
    updated-dependencies:
    - dependency-name: picomatch
      dependency-version: 4.0.4
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    d2cc5b0 View commit details
    Browse the repository at this point in the history
  8. chore(deps): Bump astro from 5.17.3 to 5.18.1 in /site (#163)

    Bumps [astro](https://github.com/withastro/astro/tree/HEAD/packages/astro) from 5.17.3 to 5.18.1.
    - [Release notes](https://github.com/withastro/astro/releases)
    - [Changelog](https://github.com/withastro/astro/blob/astro@5.18.1/packages/astro/CHANGELOG.md)
    - [Commits](https://github.com/withastro/astro/commits/astro@5.18.1/packages/astro)
    
    ---
    updated-dependencies:
    - dependency-name: astro
      dependency-version: 5.18.1
      dependency-type: direct:production
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    2dc6f07 View commit details
    Browse the repository at this point in the history
  9. chore(deps): Bump vite from 6.4.1 to 6.4.2 in /site (#182)

    Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.4.1 to 6.4.2.
    - [Release notes](https://github.com/vitejs/vite/releases)
    - [Changelog](https://github.com/vitejs/vite/blob/v6.4.2/packages/vite/CHANGELOG.md)
    - [Commits](https://github.com/vitejs/vite/commits/v6.4.2/packages/vite)
    
    ---
    updated-dependencies:
    - dependency-name: vite
      dependency-version: 6.4.2
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    78902cd View commit details
    Browse the repository at this point in the history
  10. chore(deps): Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#…

    …185)
    
    Bumps [go.opentelemetry.io/otel/sdk](https://github.com/open-telemetry/opentelemetry-go) from 1.42.0 to 1.43.0.
    - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
    - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
    - [Commits](open-telemetry/opentelemetry-go@v1.42.0...v1.43.0)
    
    ---
    updated-dependencies:
    - dependency-name: go.opentelemetry.io/otel/sdk
      dependency-version: 1.43.0
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    13f7624 View commit details
    Browse the repository at this point in the history
  11. chore(deps-dev): Bump vite from 6.4.1 to 6.4.2 in /web (#192)

    Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.4.1 to 6.4.2.
    - [Release notes](https://github.com/vitejs/vite/releases)
    - [Changelog](https://github.com/vitejs/vite/blob/v6.4.2/packages/vite/CHANGELOG.md)
    - [Commits](https://github.com/vitejs/vite/commits/v6.4.2/packages/vite)
    
    ---
    updated-dependencies:
    - dependency-name: vite
      dependency-version: 6.4.2
      dependency-type: direct:development
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    abad478 View commit details
    Browse the repository at this point in the history
  12. chore(deps): Bump defu from 6.1.4 to 6.1.6 in /site (#181)

    Bumps [defu](https://github.com/unjs/defu) from 6.1.4 to 6.1.6.
    - [Release notes](https://github.com/unjs/defu/releases)
    - [Changelog](https://github.com/unjs/defu/blob/main/CHANGELOG.md)
    - [Commits](unjs/defu@v6.1.4...v6.1.6)
    
    ---
    updated-dependencies:
    - dependency-name: defu
      dependency-version: 6.1.6
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    2ce74fe View commit details
    Browse the repository at this point in the history
  13. Make it so the debug logging is more useful (#152)

    * Make it so the debug logging is more useful.
    
    - Omits some entries, like report_intent, that just add noise to the debugging process. Had to have some state to do this since only the tool.execution_start indicates that it's a report_intent event.
    - Dive into the arguments, input and context parameters, which usually contain important information.
    - Output the selected model, and the producer (ie: the agent)
    
    * Whoops, they're debug now
    
    * Addressing all the copilot feedback, and restructuring the code that does the event testing a bit to make it more readable.
    
    ---------
    
    Co-authored-by: Richard Park <ripark@microsoft.com>
    richardpark-msft and Richard Park authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    9309566 View commit details
    Browse the repository at this point in the history
  14. run --output-dir groups files by timestamp (#153)

    * `run --output-dir` groups files by timestamp
    
    * tweak the docs
    chlowell authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    5890653 View commit details
    Browse the repository at this point in the history
  15. fix: --discover finds eval.yaml in project-root evals/{name}/ layout (#…

    …44)
    
    * Initial plan
    
    * fix: --discover finds eval.yaml in project-layout evals/{name}/ directory
    
    Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
    
    * fix: avoid exact path comparison in TestDiscoverProjectLayout (Windows symlink short paths)
    
    Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
    Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
    Copilot and spboyer authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    c99a318 View commit details
    Browse the repository at this point in the history
  16. fix: update jsonrpc test fixture for grader validation (#113)

    The grader config validation from PR #195 correctly rejects code
    graders without assertions. Updated the test eval YAML to include
    a config.assertions field.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    Copilot committed Apr 21, 2026
    Configuration menu
    Copy the full SHA
    b2fe957 View commit details
    Browse the repository at this point in the history
  17. docs: add cache command, prompt mode, and complete schema reference (#…

    …198)
    
    - Add waza cache clear command to CLI reference with flags and examples
    - Add mode field to Prompt grader in graders guide (independent/pairwise)
    - Add missing config fields to schema reference: max_attempts, group_by,
      fail_fast, skill_directories, required_skills, mcp_servers
    - All 16 documentation pages build successfully
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    015392f View commit details
    Browse the repository at this point in the history
  18. test: add coverage for suggest and jsonrpc packages (#199)

    * test: add coverage for suggest and jsonrpc packages
    
    Boost suggest from 68% to 83% and jsonrpc from 69% to 90%.
    
    suggest package:
    - grader_docs_test.go: GraderSummaries format, LoadGraderDocs with
      fstest.MapFS (nil, empty, mixed valid/invalid, whitespace trimming)
    - prompt_test.go: renderSelectionPrompt, renderImplementationPrompt,
      renderPrompt with various data shapes and empty fields
    - helpers_test.go: orDefault, phrasesToText, summarizeBody,
      extractYAML, normalizeGeneratedPath, filterValidGraderTypes,
      parseGraderSelection edge cases
    - resolve_test.go: resolveSkillFile, loadSkill, buildPromptData,
      WriteToDir edge cases (empty paths, traversal, invalid YAML)
    
    jsonrpc package:
    - methods_test.go: MethodRegistry CRUD, overwrite, empty name,
      handler error/params, RegisterHandlers verification
    - handlers_extra_test.go: task.list/task.get success paths,
      run.cancel success/already-completed, eval.* edge cases,
      TCP listener start/close/serve, malformed YAML validation
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * test: add coverage for storage and mcp packages
    
    - storage: 39.1% → 54.6% (new tests for azure_blob pure functions,
      store helpers, local.go edge cases)
    - mcp: 58.3% → 89.4% (new tests for task_list, quickLinkCheck,
      resolveDir, ServeStdio, hasIDField, dispatchTool, skill check)
    
    Replaced all skipped mock tests in azure_blob_test.go with real tests
    for sanitizePathSegment, stringPtr, getMetadata, isCI, blobToResultSummary,
    outcomeToResultSummary, and NewAzureBlobStore validation.
    
    Note: storage cannot reach 70% without extracting a blob client interface
    from AzureBlobStore. The remaining 0% functions (Upload, Download, List,
    Compare, findBlobBySuffix, findBlobByMetadata) all call *azblob.Client
    directly. A follow-up PR can introduce the interface for full testability.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: address errcheck lint violations in storage tests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: address Copilot review feedback — remove sleep, use bufio scanner, fix paths
    
    - Replace time.Sleep with immediate connection (listener already active)
    - Use bufio.Scanner for newline-delimited JSON instead of raw conn.Read
    - Replace hard-coded absolute path with t.TempDir()-based path
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: cross-platform path tests and errcheck lint issues
    
    - Use filepath.FromSlash for path assertions in normalizeGeneratedPath tests
    - Use runtime.GOOS to pick platform-appropriate absolute path in rejection test
    - Satisfy errcheck linter by explicitly discarding os.Setenv/Unsetenv/WriteFile errors in tests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * fix: remaining errcheck violations in mcp coverage tests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    spboyer and Copilot authored Apr 21, 2026
    Configuration menu
    Copy the full SHA
    8a6129a View commit details
    Browse the repository at this point in the history
Loading