Skip to content

Doing bulk mechanical renames using Go refactor/rename#128

Closed
richardpark-msft wants to merge 14 commits into
microsoft:mainfrom
richardpark-msft:wz-bug-cant-find-skills
Closed

Doing bulk mechanical renames using Go refactor/rename#128
richardpark-msft wants to merge 14 commits into
microsoft:mainfrom
richardpark-msft:wz-bug-cant-find-skills

Conversation

@richardpark-msft

@richardpark-msft richardpark-msft commented Mar 13, 2026

Copy link
Copy Markdown
Member
  • BenchmarkSpec to EvalSpec
  • Config to EvalConfig
  • TaskSpec renames
    • TestCase to TaskSpec
    • TestStimulus -> TaskInputs
    • ValidatorInline -> Grader
  • Grader.Kind -> Grader.Type
  • Some tests were also renamed, as well as local variables
  • Some comments (those were search and replace)

Richard Park added 3 commits March 13, 2026 02:05
- TestCase to TaskSpec
- TestStimulus -> TaskInputs
- ValidatorInline -> Grader
Copilot AI review requested due to automatic review settings March 13, 2026 02:12
@github-actions github-actions Bot enabled auto-merge (squash) March 13, 2026 02:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a large set of mechanical renames across the evaluation (“benchmark”) pipeline, updating core model types and propagating those changes through orchestration, graders, caching, config, CLI, JSON-RPC handlers, and tests.

Changes:

  • Renames models.BenchmarkSpecmodels.EvalSpec and models.Configmodels.EvalConfig across runtime code and tests.
  • Renames task-level structures (TestCaseTaskSpec, TestStimulusTaskInputs, ValidatorInlineGrader) and updates call sites.
  • Renames grader identity/type plumbing (GraderKindGraderType, Grader.Kind()Grader.Type()), updating all grader implementations and tests.

Reviewed changes

Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/trigger/runner_test.go Updates tests to use EvalSpec instead of BenchmarkSpec.
internal/transcript/transcript_test.go Updates transcript test fixtures to TaskSpec / TaskInputs.
internal/transcript/transcript.go Updates BuildTaskTranscript to accept TaskSpec and read from Inputs.
internal/suggest/suggest.go Updates YAML validation to unmarshal into EvalSpec.
internal/orchestration/runner_test.go Updates runner tests for EvalSpec / EvalConfig and task structs.
internal/orchestration/runner_orchestration_test.go Updates orchestration tests for renamed spec/task/grader fields and Type().
internal/orchestration/runner.go Converts runner task loading/execution/grading flow to TaskSpec / TaskInputs.
internal/orchestration/filter_test.go Updates filter tests to operate on TaskSpec.
internal/orchestration/filter.go Updates filter APIs to accept/return TaskSpec.
internal/orchestration/csv_integration_test.go Updates CSV task generation tests to EvalSpec and TaskSpec.Inputs.
internal/orchestration/baseline_test.go Updates baseline tests to use EvalSpec / EvalConfig.
internal/models/taskspec_test.go Adds coverage for should_trigger YAML → ExpectedTrigger pointer behavior.
internal/models/taskspec.go Renames task model types; updates inline grader struct to Grader with Type.
internal/models/spec.go Renames spec/config types to EvalSpec / EvalConfig and updates grader kind type.
internal/models/outcome.go Renames GraderKindGraderType and updates results typing.
internal/models/grader_params_test.go Updates polymorphic parameter tests to read task-level graders via tc.Graders.
internal/models/grader_params.go Updates parameter decoding to accept GraderType.
internal/models/baseline_test.go Updates baseline YAML test to unmarshal into EvalSpec.
internal/jsonrpc/handlers.go Updates JSON-RPC eval handling to use EvalSpec / EvalConfig.
internal/graders/trigger_grader_test.go Updates trigger grader tests for Type() and TaskSpec.Inputs.
internal/graders/trigger_grader.go Updates trigger grader to implement Type() and read prompt from Inputs.
internal/graders/tool_constraint_grader_test.go Updates tests from Kind() to Type().
internal/graders/tool_constraint_grader.go Updates grader interface implementation to Type().
internal/graders/text_grader_test.go Updates tests from Kind() to Type().
internal/graders/text_grader.go Updates grader interface implementation to Type().
internal/graders/skill_invocation_grader_test.go Updates tests from Kind() to Type().
internal/graders/skill_invocation_grader.go Updates grader interface implementation to Type().
internal/graders/run.go Updates runner entrypoint to accept TaskSpec and task-level Graders.
internal/graders/prompt_grader.go Updates prompt grader interface to Type() and result typing to GraderType.
internal/graders/program_grader_test.go Updates tests from Kind() to Type().
internal/graders/program_grader.go Updates grader interface implementation to Type().
internal/graders/json_schema_grader_test.go Updates tests from Kind() to Type().
internal/graders/json_schema_grader.go Updates grader interface implementation to Type().
internal/graders/inline_script_grader_test.go Updates tests from Kind() to Type().
internal/graders/inline_script_grader.go Updates grader interface implementation to Type().
internal/graders/grader.go Updates the grader interface (Type()) and context to reference TaskSpec.
internal/graders/file_grader_test.go Updates tests from Kind() to Type().
internal/graders/file_grader.go Updates grader interface implementation to Type().
internal/graders/diff_grader.go Updates grader interface implementation to Type().
internal/graders/behavior_grader_test.go Updates tests from Kind() to Type().
internal/graders/behavior_grader.go Updates grader interface implementation to Type().
internal/graders/action_sequence_grader_test.go Updates tests from Kind() to Type().
internal/graders/action_sequence_grader.go Updates grader interface implementation to Type().
internal/config/config_test.go Updates config tests to pass EvalSpec.
internal/config/config.go Updates BenchmarkConfig to store/return *EvalSpec.
internal/cache/cache_test.go Updates cache tests to use EvalSpec and TaskSpec.
internal/cache/cache.go Updates cache key computation to accept EvalSpec + TaskSpec and read resources from Inputs.
cmd/waza/newtask/converters_test.go Updates new-task converter tests for TaskSpec and task-level Graders.
cmd/waza/newtask/converters.go Updates Copilot log → task converter to build TaskSpec with Inputs + Graders.
cmd/waza/cmd_run_suggest_test.go Updates suggest-related tests to use EvalSpec / EvalConfig.
cmd/waza/cmd_run_suggest.go Propagates EvalSpec through suggestion/report generation helpers and task loading.
cmd/waza/cmd_run.go Updates single-model execution path to accept *EvalSpec.
cmd/waza/cmd_new_task_test.go Updates end-to-end new-task test expectations to TaskSpec/Graders.
cmd/waza/cmd_grade.go Updates grading helpers to accept EvalSpec / TaskSpec.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread internal/jsonrpc/handlers.go
Comment thread internal/graders/run.go Outdated
Comment thread internal/models/baseline_test.go
Comment thread internal/models/grader_params_test.go
@codecov-commenter

codecov-commenter commented Mar 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.44311% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3068653). Learn more about missing BASE report.

Files with missing lines Patch % Lines
cmd/waza/cmd_run_suggest.go 66.66% 7 Missing and 2 partials ⚠️
internal/orchestration/runner.go 82.69% 8 Missing and 1 partial ⚠️
internal/jsonrpc/handlers.go 37.50% 4 Missing and 1 partial ⚠️
internal/graders/run.go 0.00% 4 Missing ⚠️
internal/graders/prompt_grader.go 0.00% 3 Missing ⚠️
cmd/waza/cmd_grade.go 88.88% 0 Missing and 1 partial ⚠️
cmd/waza/cmd_new_task.go 50.00% 0 Missing and 1 partial ⚠️
cmd/waza/newtask/converters.go 90.00% 1 Missing ⚠️
internal/graders/diff_grader.go 0.00% 1 Missing ⚠️
internal/models/spec.go 80.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #128   +/-   ##
=======================================
  Coverage        ?   73.51%           
=======================================
  Files           ?      138           
  Lines           ?    15785           
  Branches        ?        0           
=======================================
  Hits            ?    11605           
  Misses          ?     3338           
  Partials        ?      842           
Flag Coverage Δ
go-implementation 73.51% <78.44%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 13, 2026 02:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR performs a broad mechanical refactor/rename across the evaluation pipeline, updating core model types (specs, task/test definitions, and graders) and propagating those renames through orchestration, graders, config, cache, CLI, and related tests.

Changes:

  • Rename core models: BenchmarkSpecEvalSpec, ConfigEvalConfig, TestCaseTaskSpec, TestStimulusTaskInputs, inline task validators→Grader with KindType.
  • Update orchestration/runner, graders, caching, JSON-RPC handlers, and CLI code to use the new types/fields.
  • Add/update tests to cover renamed structures (including a new test for should_trigger decoding).

Reviewed changes

Copilot reviewed 53 out of 54 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/trigger/runner_test.go Update trigger runner tests to use EvalSpec.
internal/transcript/transcript_test.go Update transcript tests to use TaskSpec/TaskInputs.
internal/transcript/transcript.go Accept *TaskSpec and read Inputs.Message for transcripts.
internal/suggest/suggest.go Unmarshal eval YAML into EvalSpec instead of BenchmarkSpec.
internal/orchestration/runner_test.go Update orchestration runner tests for EvalSpec/TaskSpec.
internal/orchestration/runner_orchestration_test.go Update orchestration integration tests for renamed grader/task structures.
internal/orchestration/runner.go Migrate task loading/execution path to TaskSpec and Inputs.
internal/orchestration/filter_test.go Update filtering tests to operate on []*TaskSpec.
internal/orchestration/filter.go Update filter API to operate on []*TaskSpec.
internal/orchestration/csv_integration_test.go Update CSV task generation tests to validate Inputs.Message.
internal/orchestration/baseline_test.go Update baseline orchestration tests to use EvalSpec/EvalConfig.
internal/models/taskspec_test.go Add coverage for should_trigger decoding into ExpectedTrigger.
internal/models/taskspec.go Rename task model types and inline graders; update YAML unmarshalling accordingly.
internal/models/spec.go Rename spec/config types to EvalSpec/EvalConfig and update validation.
internal/models/outcome.go Rename GraderKind type to GraderType (constants retained).
internal/models/grader_params_test.go Update grader parameter decoding test to use Graders.
internal/models/grader_params.go Update parameter decoding entrypoint to accept GraderType.
internal/models/baseline_test.go Update baseline YAML parsing test to use EvalSpec.
internal/jsonrpc/handlers.go Update eval get/validate handlers to use EvalSpec and EvalConfig.
internal/graders/trigger_grader_test.go Update trigger grader tests for Type() and TaskSpec.Inputs.
internal/graders/trigger_grader.go Rename grader interface method to Type() and update prompt access.
internal/graders/tool_constraint_grader_test.go Update tool-constraint grader tests for Type().
internal/graders/tool_constraint_grader.go Implement Type() instead of Kind().
internal/graders/text_grader_test.go Update text grader tests for Type().
internal/graders/text_grader.go Implement Type() instead of Kind().
internal/graders/skill_invocation_grader_test.go Update skill invocation grader tests for Type().
internal/graders/skill_invocation_grader.go Implement Type() instead of Kind().
internal/graders/run.go Run task-level graders from TaskSpec.Graders and validate Type.
internal/graders/prompt_grader.go Rename grader interface method to Type() and propagate into results.
internal/graders/program_grader_test.go Update program grader tests for Type().
internal/graders/program_grader.go Implement Type() instead of Kind().
internal/graders/json_schema_grader_test.go Update JSON schema grader tests for Type().
internal/graders/json_schema_grader.go Implement Type() instead of Kind().
internal/graders/inline_script_grader_test.go Update inline-script grader tests for Type().
internal/graders/inline_script_grader.go Implement Type() instead of Kind().
internal/graders/grader.go Update grader interface to Type() and context to reference *TaskSpec.
internal/graders/file_grader_test.go Update file grader tests for Type().
internal/graders/file_grader.go Implement Type() instead of Kind().
internal/graders/diff_grader.go Implement Type() instead of Kind().
internal/graders/behavior_grader_test.go Update behavior grader tests for Type().
internal/graders/behavior_grader.go Implement Type() instead of Kind().
internal/graders/action_sequence_grader_test.go Update action-sequence grader tests for Type().
internal/graders/action_sequence_grader.go Implement Type() instead of Kind().
internal/config/config_test.go Update config tests to use EvalSpec.
internal/config/config.go Store *EvalSpec in BenchmarkConfig and update getter signature.
internal/cache/cache_test.go Update cache tests for EvalSpec/TaskSpec/Inputs.
internal/cache/cache.go Update cache key inputs to EvalSpec/TaskSpec and use Inputs.Resources.
cmd/waza/newtask/converters_test.go Update task generation tests for TaskSpec and inline Graders.
cmd/waza/newtask/converters.go Emit *TaskSpec, populate Inputs.Message, and append Graders.
cmd/waza/cmd_run_suggest_test.go Update suggest tests to use EvalSpec/EvalConfig.
cmd/waza/cmd_run_suggest.go Update suggest pipeline to accept *EvalSpec and load []*TaskSpec.
cmd/waza/cmd_run.go Update run path to accept *EvalSpec in runSingleModel.
cmd/waza/cmd_new_task_test.go Update end-to-end task generation test expected TaskSpec shape.
cmd/waza/cmd_grade.go Update grading path to operate on *EvalSpec and *TaskSpec.
Comments suppressed due to low confidence (1)

internal/models/taskspec.go:17

  • TaskSpec.Inputs was renamed from Stimulus, but its JSON tag is still json:"stimulus". This makes the JSON representation inconsistent with the field name and the task schema (which uses inputs), and is likely an accidental leftover from the mechanical rename. Consider changing the JSON tag to json:"inputs" (or removing the JSON tag if TaskSpec is not meant to be JSON-serialized).

You can also share your feedback on Copilot code review. Take the survey.

Comment thread internal/orchestration/runner_orchestration_test.go Outdated
Copilot AI review requested due to automatic review settings March 13, 2026 02:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Bulk mechanical rename sweep aligning core evaluation/task model naming across the Go codebase (EvalSpec/EvalConfig, TaskSpec/TaskInputs, Grader.Type) and updating all call sites accordingly.

Changes:

  • Renamed BenchmarkSpecEvalSpec and ConfigEvalConfig across config loading/validation, orchestration, CLI, and JSON-RPC handlers.
  • Renamed task model TestCaseTaskSpec (with TestStimulusTaskInputs) and updated YAML/loader functions (LoadTaskSpec, task filtering, CSV task generation).
  • Renamed grader APIs (Grader.Kind()Grader.Type(), GraderKindGraderType) and updated built-in graders + tests.

Reviewed changes

Copilot reviewed 62 out of 62 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/trigger/runner_test.go Updates runner tests to construct configs from models.EvalSpec.
internal/transcript/transcript_test.go Updates transcript test to use TaskSpec/TaskInputs.
internal/transcript/transcript.go Updates BuildTaskTranscript to accept *models.TaskSpec and read from Inputs.
internal/suggest/suggest_test.go Updates test comment to reference EvalSpec.
internal/suggest/suggest.go Validates suggested YAML by unmarshalling into models.EvalSpec.
internal/suggest/prompt.go Updates prompt wording to require EvalSpec YAML.
internal/orchestration/runner_test.go Updates orchestration runner tests to use EvalSpec, EvalConfig, TaskSpec.
internal/orchestration/runner_orchestration_test.go Updates orchestration integration-style tests for renamed models and grader fields.
internal/orchestration/runner.go Renames task loading/execution pipeline to TaskSpec terminology and updates request building.
internal/orchestration/filter_test.go Updates filter tests for FilterTaskSpecs and TaskSpec helpers.
internal/orchestration/filter.go Renames filter entrypoint to FilterTaskSpecs and updates argument types.
internal/orchestration/csv_integration_test.go Updates CSV task-loading tests to loadTaskSpecsFromCSV and EvalSpec.
internal/orchestration/baseline_test.go Updates baseline tests to construct EvalSpec/EvalConfig.
internal/models/taskspec_test.go Updates loader test names and calls to LoadTaskSpec.
internal/models/taskspec.go Renames task structs to TaskSpec/TaskInputs and inline validators to Grader.
internal/models/spec_test.go Renames spec/task loader tests to new loader functions/types.
internal/models/spec.go Renames spec structs/functions to EvalSpec/EvalConfig/LoadEvalSpec.
internal/models/outcome.go Renames GraderKind type to GraderType and updates result fields accordingly.
internal/models/grader_params_test.go Updates parameter decoding tests for LoadEvalSpec/LoadTaskSpec and Graders.
internal/models/grader_params.go Updates grader-parameter decoding signature to accept GraderType.
internal/models/baseline_test.go Updates baseline YAML serialization tests to use EvalSpec.
internal/jsonrpc/handlers.go Updates JSON-RPC handlers to load EvalSpec and return updated config types.
internal/graders/trigger_grader_test.go Updates trigger grader tests for Type() and TaskSpec context.
internal/graders/trigger_grader.go Updates trigger grader interface to Type() and reads prompt from TaskSpec.Inputs.
internal/graders/tool_constraint_grader_test.go Updates tool-constraint grader tests for Type().
internal/graders/tool_constraint_grader.go Updates tool-constraint grader interface to Type().
internal/graders/text_grader_test.go Updates text grader tests for Type().
internal/graders/text_grader.go Updates text grader interface to Type().
internal/graders/skill_invocation_grader_test.go Updates skill-invocation grader tests for Type().
internal/graders/skill_invocation_grader.go Updates skill-invocation grader interface to Type().
internal/graders/run.go Updates RunAll signature to accept *models.TaskSpec and uses tc.Graders.
internal/graders/prompt_grader_test.go Updates prompt grader test to load EvalSpec.
internal/graders/prompt_grader.go Updates prompt grader interface to Type() and result typing.
internal/graders/program_grader_test.go Updates program grader tests for Type().
internal/graders/program_grader.go Updates program grader interface to Type().
internal/graders/json_schema_grader_test.go Updates JSON-schema grader tests for Type().
internal/graders/json_schema_grader.go Updates JSON-schema grader interface to Type().
internal/graders/inline_script_grader_test.go Updates inline-script grader tests for Type().
internal/graders/inline_script_grader.go Updates inline-script grader interface to Type().
internal/graders/grader.go Updates grader interface (Type) and grading context (TaskSpec).
internal/graders/file_grader_test.go Updates file grader tests for Type().
internal/graders/file_grader.go Updates file grader interface to Type().
internal/graders/diff_grader.go Updates diff grader interface to Type().
internal/graders/behavior_grader_test.go Updates behavior grader tests for Type().
internal/graders/behavior_grader.go Updates behavior grader interface to Type().
internal/graders/action_sequence_grader_test.go Updates action-sequence grader tests for Type().
internal/graders/action_sequence_grader.go Updates action-sequence grader interface to Type().
internal/execution/copilot_test.go Renames local test table variable from testCases to taskSpecs.
internal/config/config_test.go Updates config tests to pass *models.EvalSpec.
internal/config/config.go Updates BenchmarkConfig to hold *models.EvalSpec and adjusts getter types.
internal/cache/cache_test.go Updates cache tests to use EvalSpec/TaskSpec/TaskInputs.
internal/cache/cache.go Updates cache key inputs and fixture enumeration to use TaskSpec.Inputs.Resources.
cmd/waza/newtask/converters_test.go Updates converter test to expect TaskSpec and inline Graders using Type.
cmd/waza/newtask/converters.go Renames converter API to CreateTaskSpecFromCopilotLog and updates produced model fields.
cmd/waza/cmd_run_suggest_test.go Updates suggest tests to construct EvalSpec/EvalConfig.
cmd/waza/cmd_run_suggest.go Updates suggest pipeline to accept *models.EvalSpec and load TaskSpecs.
cmd/waza/cmd_run.go Updates run command to load EvalSpec and pass it through execution.
cmd/waza/cmd_new_task_test.go Updates new-task e2e test to load/compare TaskSpec with Graders.
cmd/waza/cmd_new_task.go Updates new-task generation pipeline to use CreateTaskSpecFromCopilotLog.
cmd/waza/cmd_grade.go Updates grade command to load EvalSpec and grade TaskSpec runs.
README.md Updates internal/models documentation line to refer to EvalSpec/TaskSpec.
AGENTS.md Updates architecture notes and naming table to reflect new model names.
Comments suppressed due to low confidence (3)

internal/models/spec_test.go:83

  • Several test function names look mangled by the mechanical rename (e.g., TestBenchmarkEvaltsDeserialization). Consider renaming these to clear TestEvalSpec_...-style names so test intent is obvious and consistent.
    internal/models/taskspec.go:17
  • TaskSpec.Inputs still has the JSON tag json:"stimulus". Since task.get returns TaskSpec as JSON (see JSON-RPC handler), this will emit stimulus instead of inputs and is inconsistent with the YAML/schema. Update the JSON tag to json:"inputs" (and consider omitempty if appropriate).
    internal/orchestration/runner.go:631
  • The comment and error message here still refer to "test cases" even though this function loads TaskSpecs. Update wording (and the error message) to "tasks"/"task specs" to match the new naming.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread internal/models/spec_test.go Outdated
Comment thread internal/orchestration/runner.go Outdated
Comment thread cmd/waza/cmd_new_task.go Outdated
Copilot AI review requested due to automatic review settings March 13, 2026 02:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR performs a broad mechanical rename across the Go codebase to align terminology around “evals”, “tasks”, and “graders”, including updating loaders, runners, JSON-RPC handlers, and tests.

Changes:

  • Rename core models/types: BenchmarkSpecEvalSpec, ConfigEvalConfig, TestCaseTaskSpec, TestStimulusTaskInputs, ValidatorInlineGrader, Grader.KindGrader.Type.
  • Update orchestration, graders, cache, transcript, JSON-RPC handlers, and CLI paths to use the renamed types/APIs.
  • Rename and adjust tests/docs to match the new naming.

Reviewed changes

Copilot reviewed 62 out of 62 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/trigger/runner_test.go Updates runner tests to construct EvalSpec.
internal/transcript/transcript_test.go Updates transcript tests to use TaskSpec/TaskInputs.
internal/transcript/transcript.go Updates transcript builder signature/field access for TaskSpec.
internal/suggest/suggest_test.go Updates comment/expectations to refer to EvalSpec.
internal/suggest/suggest.go Updates YAML validation to unmarshal into EvalSpec.
internal/suggest/prompt.go Updates generated prompt text to reference EvalSpec.
internal/orchestration/runner_test.go Updates orchestration runner tests for EvalSpec/TaskSpec.
internal/orchestration/runner_orchestration_test.go Updates orchestration tests for EvalSpec/task graders naming.
internal/orchestration/runner.go Renames task loading/filtering/execution plumbing to TaskSpec.
internal/orchestration/filter_test.go Renames filter tests + helpers to FilterTaskSpecs.
internal/orchestration/filter.go Renames filter API to operate on TaskSpec.
internal/orchestration/csv_integration_test.go Renames CSV task generation tests to TaskSpec.
internal/orchestration/baseline_test.go Updates baseline tests to use EvalSpec/EvalConfig.
internal/models/taskspec_test.go Renames loader tests to LoadTaskSpec.
internal/models/taskspec.go Introduces TaskSpec/TaskInputs/Grader rename and loader rename.
internal/models/spec_test.go Renames spec loader tests to LoadEvalSpec and task loader tests to LoadTaskSpec.
internal/models/spec.go Renames spec model/loader to EvalSpec/LoadEvalSpec and config type to EvalConfig.
internal/models/outcome.go Renames GraderKind type to GraderType (constants retained).
internal/models/grader_params_test.go Updates polymorphic grading parameter tests for EvalSpec/TaskSpec.
internal/models/grader_params.go Updates parameter decoding to accept GraderType.
internal/models/baseline_test.go Updates baseline serialization tests to EvalSpec.
internal/jsonrpc/handlers.go Updates eval/task JSON-RPC handlers to load EvalSpec/TaskSpec.
internal/graders/trigger_grader_test.go Updates trigger grader tests to Type() + TaskSpec context.
internal/graders/trigger_grader.go Renames grader method to Type() and switches to TaskSpec in context.
internal/graders/tool_constraint_grader_test.go Updates tests to assert Type() instead of Kind().
internal/graders/tool_constraint_grader.go Renames grader method to Type().
internal/graders/text_grader_test.go Updates tests to assert Type().
internal/graders/text_grader.go Renames grader method to Type().
internal/graders/skill_invocation_grader_test.go Updates tests to assert Type().
internal/graders/skill_invocation_grader.go Renames grader method to Type().
internal/graders/run.go Updates runner to accept TaskSpec and iterate TaskSpec.Graders.
internal/graders/prompt_grader_test.go Updates spec loader to LoadEvalSpec.
internal/graders/prompt_grader.go Renames grader method to Type() and updates result construction.
internal/graders/program_grader_test.go Updates tests to assert Type().
internal/graders/program_grader.go Renames grader method to Type().
internal/graders/json_schema_grader_test.go Updates tests to assert Type().
internal/graders/json_schema_grader.go Renames grader method to Type().
internal/graders/inline_script_grader_test.go Updates tests to assert Type().
internal/graders/inline_script_grader.go Renames grader method to Type().
internal/graders/grader.go Renames interface method to Type() and context field to TaskSpec.
internal/graders/file_grader_test.go Updates tests to assert Type().
internal/graders/file_grader.go Renames grader method to Type().
internal/graders/diff_grader.go Renames grader method to Type().
internal/graders/behavior_grader_test.go Updates tests to assert Type().
internal/graders/behavior_grader.go Renames grader method to Type().
internal/graders/action_sequence_grader_test.go Updates tests to assert Type().
internal/graders/action_sequence_grader.go Renames grader method to Type().
internal/execution/copilot_test.go Renames local vars in tests from test-case terminology to task terminology.
internal/config/config_test.go Updates config tests to build configs with EvalSpec.
internal/config/config.go Updates BenchmarkConfig to store an *EvalSpec.
internal/cache/cache_test.go Updates cache tests to use EvalSpec/TaskSpec.
internal/cache/cache.go Updates cache key computation to accept EvalSpec/TaskSpec.
cmd/waza/newtask/converters_test.go Renames converter tests to CreateTaskSpecFromCopilotLog.
cmd/waza/newtask/converters.go Renames converter API to produce TaskSpec with task-level graders.
cmd/waza/cmd_run_suggest_test.go Updates suggest tests to use EvalSpec/EvalConfig.
cmd/waza/cmd_run_suggest.go Updates suggest plumbing to accept *EvalSpec and load TaskSpecs.
cmd/waza/cmd_run.go Updates eval runner to load EvalSpec and pass it through.
cmd/waza/cmd_new_task_test.go Updates end-to-end new-task tests to load TaskSpec.
cmd/waza/cmd_new_task.go Updates command implementation to use the renamed newtask converter API.
cmd/waza/cmd_grade.go Updates grading command to load EvalSpec and grade TaskSpec runs.
README.md Updates repository structure docs to reflect new type names.
AGENTS.md Updates architecture docs to reflect new filenames/type names.
Comments suppressed due to low confidence (3)

internal/models/taskspec.go:16

  • TaskSpec.Inputs is now the canonical field name, but the JSON tag is still json:"stimulus". Since task.get returns *models.TaskSpec directly, this makes the JSON-RPC payload inconsistent with the rename (and likely with the YAML/schema key inputs). Consider updating the JSON tag to inputs (or, if backward compatibility is required, return a separate DTO from the handler to keep the old field name).

This issue also appears on line 62 of the same file.
internal/models/taskspec.go:66

  • Grader.Identifier is still the Go field name used throughout the codebase, but its JSON tag was changed to json:"name". Because task.get returns the TaskSpec directly, this is an API breaking change and also inconsistent with models.GraderConfig.Identifier (which serializes as identifier) and models.GraderResults.Name (which serializes as identifier). Consider keeping the JSON tag as identifier (or rename the field to Name everywhere) to avoid surprising RPC consumers.
    cmd/waza/cmd_run_suggest.go:518
  • This error message still says "failed to load test case" even though loadTaskSpecsFromFiles loads TaskSpecs. Renaming it to "failed to load task spec" (or "task") would keep terminology consistent with the rest of the renamed code.
	for _, path := range testFiles {
		tc, err := models.LoadTaskSpec(path)
		if err != nil {
			return nil, fmt.Errorf("failed to load test case %s: %w", path, err)
		}

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +628 to 631
tc, err := models.LoadTaskSpec(path)
if err != nil {
return nil, fmt.Errorf("failed to load test case %s: %w", path, err)
}
Comment on lines +481 to 484
yml, err := yaml.Marshal(taskSpec)
if err != nil {
return nil, fmt.Errorf("marshaling test case %s: %w", tc.TestID, err)
return nil, fmt.Errorf("marshaling test case %s: %w", taskSpec.TestID, err)
}
Comment thread internal/cache/cache.go
Comment on lines 28 to +34
// CacheKey generates a unique cache key for a test case run
// The key is based on:
// - spec content (name, config, graders)
// - task content (test case definition)
// - model ID
// - fixture file hashes
func CacheKey(spec *models.BenchmarkSpec, task *models.TestCase, fixtureDir string) (string, error) {
func CacheKey(spec *models.EvalSpec, task *models.TaskSpec, fixtureDir string) (string, error) {
Comment on lines 14 to +45
// If taskPatterns and tagPatterns are specified the result is the intersection of the matches between them.
// If both taskPatterns and tagPatterns are empty, all test cases are returned.
func FilterTestCases(testCases []*models.TestCase, taskPatterns []string, tagPatterns []string) ([]*models.TestCase, error) {
func FilterTaskSpecs(taskSpecs []*models.TaskSpec, taskPatterns []string, tagPatterns []string) ([]*models.TaskSpec, error) {
if len(taskPatterns) == 0 && len(tagPatterns) == 0 {
return testCases, nil
return taskSpecs, nil
}

var matched []*models.TestCase
var matched []*models.TaskSpec

for _, tc := range testCases {
taskNameMatch, err := matchesTaskOrDisplayName(tc, taskPatterns)
for _, taskSpec := range taskSpecs {
taskNameMatch, err := matchesTaskOrDisplayName(taskSpec, taskPatterns)

if err != nil {
return nil, err
}

tagNameMatch, err := matchesTags(tc, tagPatterns)
tagNameMatch, err := matchesTags(taskSpec, tagPatterns)

if err != nil {
return nil, err
}

if taskNameMatch && tagNameMatch {
matched = append(matched, tc)
matched = append(matched, taskSpec)
}
}

return matched, nil
}

// matchesTaskOrDisplayName reports whether a test case's DisplayName or TestID matches any pattern.
func matchesTaskOrDisplayName(tc *models.TestCase, patterns []string) (bool, error) {
func matchesTaskOrDisplayName(tc *models.TaskSpec, patterns []string) (bool, error) {
@richardpark-msft

Copy link
Copy Markdown
Member Author

Going to take a run at this with copilot and just not bother trying to rebase/merge :)

auto-merge was automatically disabled March 17, 2026 16:45

Pull request was closed

@richardpark-msft richardpark-msft deleted the wz-bug-cant-find-skills branch March 17, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants