Adding in `wz new task from-prompt` by richardpark-msft · Pull Request #110 · microsoft/waza

richardpark-msft · 2026-03-11T02:26:48Z

Adding in a new command to let users fire off a prompt, and use the transcript to generate a task.yaml, with graders and prompt filled in.

Copilot

Pull request overview

This PR adds a new CLI workflow to generate task.yaml files by recording a Copilot prompt run (waza new task record), and refactors waza new into subcommands (notably waza new skill). It also updates docs/examples to reflect the new CLI shape and improves grader-type error reporting.

Changes:

Introduce waza new task record to execute a prompt via Copilot, read the session log, and emit a generated task YAML with inferred validators.
Restructure waza new to use waza new skill (and update docs/tests accordingly).
Add a helper to list valid grader kinds and improve unsupported-grader error messages.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Updates `waza new` docs and adds reference docs for `waza new task record`.
site/src/content/docs/index.mdx	Updates quickstart snippet to `waza new skill`.
site/src/content/docs/getting-started.mdx	Updates getting started command example to `waza new skill`.
site/src/content/docs/about.mdx	Updates feature list to `waza new skill`.
internal/utils/copilot_log_iterator.go	Adds an iterator for Copilot `events.jsonl` logs.
internal/scaffold/scaffold.go	Updates package comment to reflect `waza new skill`.
internal/models/outcome.go	Adds `AllGraderKinds()` helper.
internal/graders/grader.go	Improves error message for unsupported grader configurations.
cmd/waza/newtask/converters.go	Adds log→task conversion logic to infer validators from a recorded session.
cmd/waza/cmd_new_task.go	Implements `waza new task record` command.
cmd/waza/cmd_new.go	Refactors `waza new` into subcommands (skill/task).
cmd/waza/cmd_new_test.go	Updates tests to call the new skill subcommand constructor and adjusts alias expectations.
cmd/waza/cmd_init.go	Updates messaging/docs strings to `waza new skill`.
cmd/waza/cmd_init_test.go	Updates init tests’ expected strings / command constructor usage.
cmd/waza/cmd_check_test.go	Updates comment to reflect `waza new skill`.
README.md	Updates CLI snippets and documents the new task-record command.
docs/TUTORIAL.md	Updates `waza new` usage in tutorial.
docs/SKILLS_CI_INTEGRATION.md	Updates CI guide to `waza new skill`.
docs/SKILL-BEST-PRACTICES.md	Updates best practices doc to `waza new skill`.
docs/GUIDE.md	Updates guide snippets and adds docs for the new task-record command.
docs/GETTING-STARTED.md	Updates getting started doc to `waza new skill`.
go.mod	Promotes `github.com/braydonk/yaml` to a direct dependency (used by new task command).
go.sum	Dependency checksum updates aligned with `go.mod` changes.

Comments suppressed due to low confidence (1)

cmd/waza/newtask/converters.go:114

tools[*e.Data.ToolCallID] can be nil if the log is missing a corresponding ToolExecutionStart (or the ID is missing), and the subsequent t.End = ... / t.Success = ... will panic. Add a presence check (and nil-check ToolCallID/Success) before updating the tool entry, or reuse the existing tool-call correlation logic in internal/models.FilterToolCalls.

		case copilot.ToolExecutionComplete:
			t := tools[*e.Data.ToolCallID]
			t.End = e.Timestamp
			t.Success = *e.Data.Success

codecov-commenter · 2026-03-11T02:41:06Z

Codecov Report

❌ Patch coverage is 70.81967% with 89 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@d1b8c25). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
cmd/waza/cmd_new_task.go	78.41%	22 Missing and 8 partials ⚠️
internal/utils/copilot_log_iterator.go	0.00%	23 Missing ⚠️
cmd/waza/newtask/converters.go	84.90%	9 Missing and 7 partials ⚠️
internal/models/outcome.go	0.00%	16 Missing ⚠️
internal/execution/copilot_client_wrappers.go	33.33%	2 Missing ⚠️
cmd/waza/cmd_init.go	66.66%	1 Missing ⚠️
internal/graders/grader.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #110   +/-   ##
=======================================
  Coverage        ?   73.32%           
=======================================
  Files           ?      135           
  Lines           ?    15521           
  Branches        ?        0           
=======================================
  Hits            ?    11381           
  Misses          ?     3308           
  Partials        ?      832

Flag	Coverage Δ
go-implementation	`73.32% <70.81%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (5)

cmd/waza/newtask/converters.go:96

ToolExecutionStart dereferences e.Data.ToolCallID without checking for nil. Other event-processing code treats these as optional pointers; add a nil guard before using it (and skip the event if missing).

		case copilot.ToolExecutionStart:
			toolsInOrder = append(toolsInOrder, *e.Data.ToolCallID)

cmd/waza/newtask/converters.go:128

This won’t compile: new("<no skill name>") is invalid Go (new expects a type). If you need a fallback value, assign through a local string variable (e.g., s := ""; name = &s) or handle nil without allocating.

			if name == nil {
				name = new("<no skill name>")
			}

cmd/waza/newtask/converters.go:114

ToolExecutionComplete can panic: it dereferences ToolCallID/Success and assumes the corresponding start event was seen (tools[*id]). Add nil checks and handle missing entries (skip or return a descriptive error).

		case copilot.ToolExecutionComplete:
			t := tools[*e.Data.ToolCallID]
			t.End = e.Timestamp
			t.Success = *e.Data.Success

cmd/waza/newtask/converters.go:107

This block assumes ToolName (and the decoded args pointer) are non-nil; if the log omits them, *e.Data.ToolName / *ta will panic. Add nil checks and either skip or surface a descriptive error for incomplete events.

			tools[*e.Data.ToolCallID] = &tool{
				Start:     e.Timestamp,
				Name:      *e.Data.ToolName,
				Arguments: *ta,
			}

site/src/content/docs/reference/cli.mdx:182

The heredoc example implies waza new skill reads metadata from stdin in non-interactive mode, but the implementation currently ignores stdin and uses defaults when not in a TTY. Update/remove this example to match actual behavior.

waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

Copilot

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.

You can also share your feedback on Copilot code review. Take the survey.

…rb now)

…up 'eval', like the rest of them.

…ommands so it's clear why two commands with the same syntax produce different results

…ere completely untested.

…ead of what it is now (implicit, and always on)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

cmd/waza/newtask/converters.go:156

Same compilation issue as above: new("<no path>") is invalid Go. Replace with a string variable and take its address (or keep path nil and skip adding an entry when required fields are missing).

			path := e.Data.Path
			if path == nil {
				path = new("<no path>")
			}

You can also share your feedback on Copilot code review. Take the survey.

wbreza

Code Review: PR #110 — Adding in `wz new task record`

✅ What Looks Good

Clean interface export — copilotClient → CopilotClient properly exports interfaces for external mock generation
Good use of Go iterators — NewCopilotLogIterator uses iter.Seq2 idiomatically with proper error handling
Improved grader error messages — AllGraderKinds() provides helpful valid-types list on invalid grader config
Comprehensive mock infrastructure — go:generate setup with proper DI via options struct
Solid E2E test — cmd_new_task_test.go covers the full happy path with mocked dependencies

🟠 High (4 findings)

1. Consider from-prompt as default behavior or flag instead of subcommand — Currently waza new task from-prompt <prompt> <path> uses from-prompt as a subcommand. If there's only one way to create tasks right now, consider making it the default behavior of waza new task (with --from-prompt as a flag if more sources are added later). Running waza new task with no subcommand currently shows help — would a default action be more ergonomic?

2. Brittle default text grader — entire response as ContainsCS — converters.go:215-223 — The generated task YAML uses the full concatenated assistant response as a case-sensitive contains check. LLM responses are non-deterministic, so this grader will fail on re-runs. Consider extracting key phrases instead, or adding a YAML comment noting this should be customized.

3. converters.go — 1 test for 218 lines of code — CreateTestCaseFromCopilotLog handles 6 event types (UserMessage, ToolExecutionStart/Complete, AssistantMessage/Delta, SkillInvoked) but has only 1 happy-path test. Missing coverage for: nil ToolCallID, mapstructure decode failures, tool-not-found in map, report_intent filter, nil skill name/path defaults.

4. Hardcoded default model claude-sonnet-4.5 — cmd_new_task.go:291 — Model name hardcoded as flag default. Consider reading from config/env (WAZA_DEFAULT_MODEL) or deferring to the Copilot SDK's own default.

🟡 Medium (4 findings)

5. --testname should be --test-name — All other multi-word flags use kebab-case (--disable-repo-skills, --judge-model). This one doesn't.

6. context.Background() in shutdown defer — cmd_new_task.go:165 — Creates a new background context instead of inheriting cmd.Context(). If user presses Ctrl+C, shutdown still waits up to 1 minute.

7. RunE closure is ~160 lines — cmd_new_task.go:129-288 — Single function handles 6 different concerns. Consider extracting task definitions into helper functions for readability and testability.

8. new("<no skill name>") shadows Go builtin — converters.go:158,163 — Confusing for readers who expect new(T) to allocate a zero-value pointer. Use explicit s := "<no skill name>"; name = &s instead.

🟢 Low (2 findings)

9. Missing error context in skill discovery — cmd_new_task.go:182-186 — Wrap error with: fmt.Errorf("failed to discover skills in %q: %%w", rootDir, err)

10. Task list wrapper swallows AddTask error — cmd_new_task.go:324 — _ = tlw.inner.AddTask(options) silently ignores errors.

📌 Documentation Pass Recommended

After this PR lands, recommend a focused round of doc updates to ensure all docs match the new waza new command structure (waza new skill, waza new task, waza new eval). Some references to the old waza new [skill-name] and waza generate patterns may still exist across README, guides, and site content.

Summary

Priority	Count
Critical	0
High	4
Medium	4
Low	2

Overall Assessment: Comment — solid new feature with good DI/mock patterns. Key items to address: the brittle text grader default, test coverage for converters, and flag naming consistency.

…ally.

…into wz-record

…perly.

Copilot

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 2 comments.

You can also share your feedback on Copilot code review. Take the survey.

Copilot

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.

You can also share your feedback on Copilot code review. Take the survey.

Related to microsoft#105 Proactive test cases for the action_sequence grader. Written from the spec while Linus implements. Covers: exact_match, in_order_match, any_order_match modes, score calculations, error cases, and factory integration. **Test breakdown (31 test cases):** - Constructor validation: 6 tests (missing actions, empty actions, invalid mode, valid modes) - exact_match mode: 6 tests (pass, extra step, missing step, wrong order, empty actual, single tool) - in_order_match mode: 5 tests (exact seq, extras interspersed, wrong order, missing step, subset order) - any_order_match mode: 5 tests (reordered, extras, missing, duplicate expected, insufficient duplicates) - Score calculations: 4 tests (perfect, partial, zero, details verification + f1 check) - Edge cases: 3 tests (nil session, nil tools_used, duration recording) - Factory integration: 2 tests (Create success, Create invalid params error) **Note:** These tests depend on the implementation in the companion PR. Merge the implementation PR first.

Copilot AI review requested due to automatic review settings March 11, 2026 02:26

richardpark-msft force-pushed the wz-record branch from 154b5f0 to 5ffa03a Compare March 11, 2026 02:27

Copilot started reviewing on behalf of richardpark-msft March 11, 2026 02:28 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 11, 2026 02:44

Copilot started reviewing on behalf of richardpark-msft March 11, 2026 02:44 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 11, 2026 20:55

Copilot started reviewing on behalf of richardpark-msft March 11, 2026 20:55 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Comment thread cmd/waza/cmd_new_task.go

Comment thread cmd/waza/cmd_new_task.go

Comment thread cmd/waza/newtask/converters.go

Comment thread cmd/waza/newtask/converters.go

Richard Park added 18 commits March 11, 2026 21:19

Adding in a "new task record" command.

d3f342a

Updating any spots that used 'new' before (skill is under it's own ve…

80ee69b

…rb now)

the new command had another command as a subcommand. Needed to hoist …

f2da836

…up 'eval', like the rest of them.

Remove generater alias for now.

8549269

For simplicity, use the same YAML library.

0d0bdb1

Remove references to the alias.

516308d

fix spelling mistake.

9084e98

Use command line param

2b2ba14

Create the directory, just in case

b9cc549

Pass in the perm bits.

8e72cc1

Adding in some basic tests.

30e29d6

Removing the interactive bit, also updating descriptions of the two c…

d4510a3

…ommands so it's clear why two commands with the same syntax produce different results

Fixing a few review issues, and adding in some tests for areas that w…

c009dd2

…ere completely untested.

Add some more nil pointer checks.

76548fa

Cleaning up more spots where 'interactive' was an explicit flag, inst…

de08d60

…ead of what it is now (implicit, and always on)

missed another pointer deref

df8cc25

Use cobra.Command's output stream for consistency.

db2b5b2

more PR feedback.

d6dbad8

Copilot AI review requested due to automatic review settings March 11, 2026 21:19

Update docs/SKILL-BEST-PRACTICES.md

6c00094

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot started reviewing on behalf of richardpark-msft March 11, 2026 23:15 View session

richardpark-msft closed this Mar 11, 2026

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Comment thread cmd/waza/newtask/converters.go

richardpark-msft reopened this Mar 11, 2026

richardpark-msft marked this pull request as draft March 11, 2026 23:20

github-actions Bot enabled auto-merge (squash) March 11, 2026 23:20

wbreza reviewed Mar 11, 2026

View reviewed changes

Comment thread cmd/waza/cmd_new_task.go Outdated

Comment thread cmd/waza/newtask/converters.go

Comment thread cmd/waza/newtask/converters_test.go

Richard Park added 2 commits March 12, 2026 01:37

Use a workspace and detect context instead of discovering skills manu…

ced9189

…ally.

Merge branch 'wz-record' of https://github.com/richardpark-msft/waza …

5430f58

…into wz-record

richardpark-msft mentioned this pull request Mar 12, 2026

Add in some tests for edge cases when parsing copilot logs #115

Closed

Removing disable repo skills - no need for it, we discover things pro…

04921a6

…perly.

Copilot AI review requested due to automatic review settings March 12, 2026 01:50

Copilot started reviewing on behalf of richardpark-msft March 12, 2026 01:51 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Comment thread cmd/waza/cmd_new_task.go

Comment thread cmd/waza/cmd_new_task.go

Oops, gotta remove that last test... :)

9261ba1

richardpark-msft marked this pull request as ready for review March 12, 2026 02:01

Copilot AI review requested due to automatic review settings March 12, 2026 02:01

Copilot started reviewing on behalf of richardpark-msft March 12, 2026 02:02 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread docs/GUIDE.md Outdated

Comment thread cmd/waza/cmd_new_task.go

Comment thread internal/utils/copilot_log_iterator.go

Comment thread site/src/content/docs/reference/cli.mdx

richardpark-msft changed the title ~~Adding in wz new task record~~ Adding in wz new task from-prompt Mar 12, 2026

Docs still mentioned --disable-repo-skills

f3d8bfc

richardpark-msft disabled auto-merge March 12, 2026 02:12

richardpark-msft enabled auto-merge (squash) March 12, 2026 02:13

spboyer approved these changes Mar 12, 2026

View reviewed changes

richardpark-msft merged commit 3f73d31 into microsoft:main Mar 12, 2026
6 checks passed

richardpark-msft deleted the wz-record branch March 12, 2026 17:04

spboyer mentioned this pull request Mar 12, 2026

Release v0.21.0 #122

Merged

4 tasks

spboyer mentioned this pull request May 23, 2026

Generate a task.yaml from a copilot session log #7

Closed

Uh oh!

Conversation

richardpark-msft commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #110 — Adding in wz new task record

✅ What Looks Good

🟠 High (4 findings)

🟡 Medium (4 findings)

🟢 Low (2 findings)

📌 Documentation Pass Recommended

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

codecov-commenter commented Mar 11, 2026 •

edited

Loading

Code Review: PR #110 — Adding in `wz new task record`