Skip to content

Add in custom YAML deserializers for our config#106

Merged
github-actions[bot] merged 15 commits into
microsoft:mainfrom
richardpark-msft:wz-yaml-fixes
Mar 10, 2026
Merged

Add in custom YAML deserializers for our config#106
github-actions[bot] merged 15 commits into
microsoft:mainfrom
richardpark-msft:wz-yaml-fixes

Conversation

@richardpark-msft

Copy link
Copy Markdown
Member

Changes our YAML serialization/deserialization to be done one-way. This is a change I need for another place, where I auto-generate YAML files and want to make sure all the schemas are consistent.

Copilot AI review requested due to automatic review settings March 10, 2026 20:19
@github-actions github-actions Bot enabled auto-merge (squash) March 10, 2026 20:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors grader configuration loading to use typed, polymorphic YAML decoding (models.GraderParameters) instead of map[string]any + mapstructure, and updates grader constructors/call sites accordingly to support one-way, schema-consistent YAML generation.

Changes:

  • Add typed grader parameter structs and polymorphic YAML decoding for eval and task grader configs.
  • Simplify graders.Create and grader constructors to accept typed parameter structs.
  • Update tests and orchestration runner logic to use the new typed config flow.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/models/grader_params.go Introduces typed GraderParameters and decoding helpers for per-grader YAML config payloads.
internal/models/grader_params_test.go Adds coverage for polymorphic decode behavior for spec graders and task validators.
internal/models/spec.go Switches spec grader config from map[string]any to typed GraderParameters via custom YAML unmarshal.
internal/models/testcase.go Switches task inline validator config from map[string]any to typed GraderParameters via custom YAML unmarshal.
internal/graders/grader.go Replaces mapstructure-based factory with type-switch on models.GraderParameters.
internal/orchestration/runner.go Updates grader creation and applies defaults via typed parameters.
internal/graders/* + *_test.go Updates individual graders/tests to accept typed parameter structs.
internal/orchestration/runner_orchestration_test.go Updates orchestration tests to use typed parameter structs.
cmd/waza/cmd_run_suggest_test.go Updates test grader params to match typed schema (contains as []string).
go.mod Moves lipgloss to indirect dependency.
internal/orchestration/judge_model_test.go Removes injectJudgeModel unit tests (function deleted).

Comment thread internal/models/grader_params.go Outdated
Comment thread internal/models/grader_params.go Outdated
Comment thread internal/graders/file_grader.go Outdated
Comment thread internal/orchestration/runner.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 10, 2026 20:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Comment thread internal/models/spec.go
Comment thread internal/models/testcase.go
Comment thread internal/orchestration/runner.go
Comment thread internal/models/grader_params.go
Copilot AI review requested due to automatic review settings March 10, 2026 20:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Comment thread internal/models/grader_params_test.go
Comment thread internal/graders/file_grader.go Outdated
Comment thread internal/models/grader_params.go
Copilot AI review requested due to automatic review settings March 10, 2026 20:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 4 comments.

Comment thread internal/webapi/additional_test.go
Comment thread internal/models/events_test.go
Comment thread internal/execution/session_events_collector_test.go
Comment thread internal/orchestration/runner.go
@github-actions github-actions Bot merged commit d3e8714 into microsoft:main Mar 10, 2026
6 checks passed
@richardpark-msft richardpark-msft deleted the wz-yaml-fixes branch March 10, 2026 21:12
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 67.35751% with 63 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@f3371ce). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/models/grader_params.go 34.78% 28 Missing and 2 partials ⚠️
internal/graders/grader.go 56.00% 11 Missing ⚠️
internal/graders/diff_grader.go 50.00% 4 Missing ⚠️
internal/models/spec.go 82.60% 2 Missing and 2 partials ⚠️
internal/models/testcase.go 80.95% 2 Missing and 2 partials ⚠️
internal/orchestration/runner.go 75.00% 4 Missing ⚠️
internal/execution/copilot_client_wrappers.go 0.00% 2 Missing ⚠️
internal/graders/inline_script_grader.go 75.00% 1 Missing and 1 partial ⚠️
internal/execution/copilot.go 0.00% 0 Missing and 1 partial ⚠️
internal/graders/prompt_grader.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #106   +/-   ##
=======================================
  Coverage        ?   72.97%           
=======================================
  Files           ?      131           
  Lines           ?    14817           
  Branches        ?        0           
=======================================
  Hits            ?    10812           
  Misses          ?     3204           
  Partials        ?      801           
Flag Coverage Δ
go-implementation 72.97% <67.35%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

richardpark-msft pushed a commit to richardpark-msft/waza that referenced this pull request Mar 11, 2026
…rosoft#160)

Closes microsoft#106

Adapts Azure ML's tool_call evaluator rubrics (tool_call_accuracy,
tool_selection, tool_input_accuracy, tool_output_utilization) as
waza-compatible YAML configs for the prompt grader.

## What's included

| File | Evaluates | Scale |
|------|-----------|-------|
| `tool_call_accuracy.yaml` | Overall tool call effectiveness | 1–5
ordinal → 0.0–1.0 |
| `tool_selection.yaml` | Right tools chosen, none missed | Binary →
0.0/1.0 |
| `tool_input_accuracy.yaml` | Parameter correctness | Binary → 0.0/1.0
|
| `tool_output_utilization.yaml` | Correct use of tool results | Binary
→ 0.0/1.0 |
| `README.md` | Usage guide and rubric structure docs | — |

## Dependencies

These are config artifacts (YAML + docs). They become usable once the
`prompt` grader (microsoft#104) merges.

## Source

Adapted from [Azure ML built-in
evaluators](https://github.com/Azure/azureml-assets/tree/main/assets/evaluators/builtin)
(MIT License).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
richardpark-msft pushed a commit to richardpark-msft/waza that referenced this pull request Mar 11, 2026
…t#161)

Closes microsoft#107

Adapts Azure ML's task evaluation rubrics as waza-compatible YAML
configs for the prompt grader.

## Rubrics added

| Rubric | Score type | Scale | Description |
|--------|-----------|-------|-------------|
| `task_adherence` | binary flag | 0.0 / 1.0 | 3-dimension eval
(goal/rule/procedure); flagged=true on any material failure |
| `task_completion` | binary | 0.0 / 1.0 | Was the task fully completed?
Outcome-focused |
| `intent_resolution` | ordinal 1-5 | 0.0–1.0 | How well did the agent
resolve the user's intent? |
| `response_completeness` | ordinal 1-5 | 0.0–1.0 | How thoroughly does
the response cover ground truth? |

## Structure

Each rubric YAML includes:
- `evaluation_criteria` — detailed rubric text adapted from Azure ML
`.prompty` files
- `rating_levels` — scoring scale with descriptions
- `score_normalization` — raw score → 0.0-1.0 mapping
- `input_mapping` — waza graders.Context → rubric input mapping
- `chain_of_thought` — step-by-step LLM judge instructions

## Source

Adapted from
[Azure/azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators)
evaluators:
- `TaskAdherenceEvaluator`
- `TaskCompletionEvaluator`
- `IntentResolutionEvaluator`
- `ResponseCompletenessEvaluator`

> Note: The `examples/rubrics/README.md` is being created separately in
microsoft#106.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants