Skip to content

Adding in wz new task from-prompt#110

Merged
richardpark-msft merged 33 commits into
microsoft:mainfrom
richardpark-msft:wz-record
Mar 12, 2026
Merged

Adding in wz new task from-prompt#110
richardpark-msft merged 33 commits into
microsoft:mainfrom
richardpark-msft:wz-record

Conversation

@richardpark-msft

Copy link
Copy Markdown
Member

Adding in a new command to let users fire off a prompt, and use the transcript to generate a task.yaml, with graders and prompt filled in.

Copilot AI review requested due to automatic review settings March 11, 2026 02:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new CLI workflow to generate task.yaml files by recording a Copilot prompt run (waza new task record), and refactors waza new into subcommands (notably waza new skill). It also updates docs/examples to reflect the new CLI shape and improves grader-type error reporting.

Changes:

  • Introduce waza new task record to execute a prompt via Copilot, read the session log, and emit a generated task YAML with inferred validators.
  • Restructure waza new to use waza new skill (and update docs/tests accordingly).
  • Add a helper to list valid grader kinds and improve unsupported-grader error messages.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Updates waza new docs and adds reference docs for waza new task record.
site/src/content/docs/index.mdx Updates quickstart snippet to waza new skill.
site/src/content/docs/getting-started.mdx Updates getting started command example to waza new skill.
site/src/content/docs/about.mdx Updates feature list to waza new skill.
internal/utils/copilot_log_iterator.go Adds an iterator for Copilot events.jsonl logs.
internal/scaffold/scaffold.go Updates package comment to reflect waza new skill.
internal/models/outcome.go Adds AllGraderKinds() helper.
internal/graders/grader.go Improves error message for unsupported grader configurations.
cmd/waza/newtask/converters.go Adds log→task conversion logic to infer validators from a recorded session.
cmd/waza/cmd_new_task.go Implements waza new task record command.
cmd/waza/cmd_new.go Refactors waza new into subcommands (skill/task).
cmd/waza/cmd_new_test.go Updates tests to call the new skill subcommand constructor and adjusts alias expectations.
cmd/waza/cmd_init.go Updates messaging/docs strings to waza new skill.
cmd/waza/cmd_init_test.go Updates init tests’ expected strings / command constructor usage.
cmd/waza/cmd_check_test.go Updates comment to reflect waza new skill.
README.md Updates CLI snippets and documents the new task-record command.
docs/TUTORIAL.md Updates waza new usage in tutorial.
docs/SKILLS_CI_INTEGRATION.md Updates CI guide to waza new skill.
docs/SKILL-BEST-PRACTICES.md Updates best practices doc to waza new skill.
docs/GUIDE.md Updates guide snippets and adds docs for the new task-record command.
docs/GETTING-STARTED.md Updates getting started doc to waza new skill.
go.mod Promotes github.com/braydonk/yaml to a direct dependency (used by new task command).
go.sum Dependency checksum updates aligned with go.mod changes.
Comments suppressed due to low confidence (1)

cmd/waza/newtask/converters.go:114

  • tools[*e.Data.ToolCallID] can be nil if the log is missing a corresponding ToolExecutionStart (or the ID is missing), and the subsequent t.End = ... / t.Success = ... will panic. Add a presence check (and nil-check ToolCallID/Success) before updating the tool entry, or reuse the existing tool-call correlation logic in internal/models.FilterToolCalls.
		case copilot.ToolExecutionComplete:
			t := tools[*e.Data.ToolCallID]
			t.End = e.Timestamp
			t.Success = *e.Data.Success

Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/cmd_new_task.go Outdated
Comment thread cmd/waza/cmd_new_task.go
Comment thread docs/GETTING-STARTED.md Outdated
Comment thread cmd/waza/cmd_new_task.go Outdated
Comment thread README.md Outdated
Comment thread site/src/content/docs/reference/cli.mdx
Comment thread cmd/waza/cmd_new.go Outdated
Comment thread docs/TUTORIAL.md Outdated
@codecov-commenter

codecov-commenter commented Mar 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.81967% with 89 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@d1b8c25). Learn more about missing BASE report.

Files with missing lines Patch % Lines
cmd/waza/cmd_new_task.go 78.41% 22 Missing and 8 partials ⚠️
internal/utils/copilot_log_iterator.go 0.00% 23 Missing ⚠️
cmd/waza/newtask/converters.go 84.90% 9 Missing and 7 partials ⚠️
internal/models/outcome.go 0.00% 16 Missing ⚠️
internal/execution/copilot_client_wrappers.go 33.33% 2 Missing ⚠️
cmd/waza/cmd_init.go 66.66% 1 Missing ⚠️
internal/graders/grader.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #110   +/-   ##
=======================================
  Coverage        ?   73.32%           
=======================================
  Files           ?      135           
  Lines           ?    15521           
  Branches        ?        0           
=======================================
  Hits            ?    11381           
  Misses          ?     3308           
  Partials        ?      832           
Flag Coverage Δ
go-implementation 73.32% <70.81%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings March 11, 2026 02:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (5)

cmd/waza/newtask/converters.go:96

  • ToolExecutionStart dereferences e.Data.ToolCallID without checking for nil. Other event-processing code treats these as optional pointers; add a nil guard before using it (and skip the event if missing).
		case copilot.ToolExecutionStart:
			toolsInOrder = append(toolsInOrder, *e.Data.ToolCallID)

cmd/waza/newtask/converters.go:128

  • This won’t compile: new("<no skill name>") is invalid Go (new expects a type). If you need a fallback value, assign through a local string variable (e.g., s := ""; name = &s) or handle nil without allocating.
			if name == nil {
				name = new("<no skill name>")
			}

cmd/waza/newtask/converters.go:114

  • ToolExecutionComplete can panic: it dereferences ToolCallID/Success and assumes the corresponding start event was seen (tools[*id]). Add nil checks and handle missing entries (skip or return a descriptive error).
		case copilot.ToolExecutionComplete:
			t := tools[*e.Data.ToolCallID]
			t.End = e.Timestamp
			t.Success = *e.Data.Success

cmd/waza/newtask/converters.go:107

  • This block assumes ToolName (and the decoded args pointer) are non-nil; if the log omits them, *e.Data.ToolName / *ta will panic. Add nil checks and either skip or surface a descriptive error for incomplete events.
			tools[*e.Data.ToolCallID] = &tool{
				Start:     e.Timestamp,
				Name:      *e.Data.ToolName,
				Arguments: *ta,
			}

site/src/content/docs/reference/cli.mdx:182

  • The heredoc example implies waza new skill reads metadata from stdin in non-interactive mode, but the implementation currently ignores stdin and uses defaults when not in a TTY. Update/remove this example to match actual behavior.
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

Comment thread cmd/waza/newtask/converters.go Outdated
Comment thread cmd/waza/cmd_new_task.go Outdated
Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/cmd_new_task.go
Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/cmd_new_task.go Outdated
Comment thread cmd/waza/cmd_new_task.go
Comment thread internal/graders/grader.go Outdated
Comment thread cmd/waza/newtask/converters.go
Comment thread README.md Outdated
Copilot AI review requested due to automatic review settings March 11, 2026 20:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment thread cmd/waza/cmd_new_task.go
Comment thread cmd/waza/cmd_new_task.go
Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/newtask/converters.go
Copilot AI review requested due to automatic review settings March 11, 2026 21:19
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

cmd/waza/newtask/converters.go:156

  • Same compilation issue as above: new("<no path>") is invalid Go. Replace with a string variable and take its address (or keep path nil and skip adding an entry when required fields are missing).
			path := e.Data.Path
			if path == nil {
				path = new("<no path>")
			}

You can also share your feedback on Copilot code review. Take the survey.

Comment thread cmd/waza/newtask/converters.go
@richardpark-msft richardpark-msft marked this pull request as draft March 11, 2026 23:20
@github-actions github-actions Bot enabled auto-merge (squash) March 11, 2026 23:20

@wbreza wbreza left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #110 — Adding in wz new task record

✅ What Looks Good

  • Clean interface exportcopilotClientCopilotClient properly exports interfaces for external mock generation
  • Good use of Go iteratorsNewCopilotLogIterator uses iter.Seq2 idiomatically with proper error handling
  • Improved grader error messagesAllGraderKinds() provides helpful valid-types list on invalid grader config
  • Comprehensive mock infrastructurego:generate setup with proper DI via options struct
  • Solid E2E testcmd_new_task_test.go covers the full happy path with mocked dependencies

🟠 High (4 findings)

1. Consider from-prompt as default behavior or flag instead of subcommand — Currently waza new task from-prompt <prompt> <path> uses from-prompt as a subcommand. If there's only one way to create tasks right now, consider making it the default behavior of waza new task (with --from-prompt as a flag if more sources are added later). Running waza new task with no subcommand currently shows help — would a default action be more ergonomic?

2. Brittle default text grader — entire response as ContainsCSconverters.go:215-223 — The generated task YAML uses the full concatenated assistant response as a case-sensitive contains check. LLM responses are non-deterministic, so this grader will fail on re-runs. Consider extracting key phrases instead, or adding a YAML comment noting this should be customized.

3. converters.go — 1 test for 218 lines of codeCreateTestCaseFromCopilotLog handles 6 event types (UserMessage, ToolExecutionStart/Complete, AssistantMessage/Delta, SkillInvoked) but has only 1 happy-path test. Missing coverage for: nil ToolCallID, mapstructure decode failures, tool-not-found in map, report_intent filter, nil skill name/path defaults.

4. Hardcoded default model claude-sonnet-4.5cmd_new_task.go:291 — Model name hardcoded as flag default. Consider reading from config/env (WAZA_DEFAULT_MODEL) or deferring to the Copilot SDK's own default.

🟡 Medium (4 findings)

5. --testname should be --test-name — All other multi-word flags use kebab-case (--disable-repo-skills, --judge-model). This one doesn't.

6. context.Background() in shutdown defercmd_new_task.go:165 — Creates a new background context instead of inheriting cmd.Context(). If user presses Ctrl+C, shutdown still waits up to 1 minute.

7. RunE closure is ~160 linescmd_new_task.go:129-288 — Single function handles 6 different concerns. Consider extracting task definitions into helper functions for readability and testability.

8. new("<no skill name>") shadows Go builtinconverters.go:158,163 — Confusing for readers who expect new(T) to allocate a zero-value pointer. Use explicit s := "<no skill name>"; name = &s instead.

🟢 Low (2 findings)

9. Missing error context in skill discoverycmd_new_task.go:182-186 — Wrap error with: fmt.Errorf("failed to discover skills in %q: %%w", rootDir, err)

10. Task list wrapper swallows AddTask errorcmd_new_task.go:324_ = tlw.inner.AddTask(options) silently ignores errors.

📌 Documentation Pass Recommended

After this PR lands, recommend a focused round of doc updates to ensure all docs match the new waza new command structure (waza new skill, waza new task, waza new eval). Some references to the old waza new [skill-name] and waza generate patterns may still exist across README, guides, and site content.

Summary

Priority Count
Critical 0
High 4
Medium 4
Low 2

Overall Assessment: Comment — solid new feature with good DI/mock patterns. Key items to address: the brittle text grader default, test coverage for converters, and flag naming consistency.

Comment thread cmd/waza/cmd_new_task.go Outdated
Comment thread cmd/waza/newtask/converters.go
Comment thread cmd/waza/newtask/converters_test.go
Copilot AI review requested due to automatic review settings March 12, 2026 01:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 2 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment thread cmd/waza/cmd_new_task.go
Comment thread cmd/waza/cmd_new_task.go
@richardpark-msft richardpark-msft marked this pull request as ready for review March 12, 2026 02:01
Copilot AI review requested due to automatic review settings March 12, 2026 02:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment thread README.md Outdated
Comment thread docs/GUIDE.md Outdated
Comment thread cmd/waza/cmd_new_task.go
Comment thread internal/utils/copilot_log_iterator.go
Comment thread site/src/content/docs/reference/cli.mdx
@richardpark-msft richardpark-msft changed the title Adding in wz new task record Adding in wz new task from-prompt Mar 12, 2026
@richardpark-msft richardpark-msft enabled auto-merge (squash) March 12, 2026 02:13
@richardpark-msft richardpark-msft merged commit 3f73d31 into microsoft:main Mar 12, 2026
6 checks passed
@richardpark-msft richardpark-msft deleted the wz-record branch March 12, 2026 17:04
@spboyer spboyer mentioned this pull request Mar 12, 2026
4 tasks
richardpark-msft pushed a commit to richardpark-msft/waza that referenced this pull request Mar 13, 2026
Related to microsoft#105

Proactive test cases for the action_sequence grader. Written from the
spec while Linus implements.

Covers: exact_match, in_order_match, any_order_match modes, score
calculations, error cases, and factory integration.

**Test breakdown (31 test cases):**
- Constructor validation: 6 tests (missing actions, empty actions,
invalid mode, valid modes)
- exact_match mode: 6 tests (pass, extra step, missing step, wrong
order, empty actual, single tool)
- in_order_match mode: 5 tests (exact seq, extras interspersed, wrong
order, missing step, subset order)
- any_order_match mode: 5 tests (reordered, extras, missing, duplicate
expected, insufficient duplicates)
- Score calculations: 4 tests (perfect, partial, zero, details
verification + f1 check)
- Edge cases: 3 tests (nil session, nil tools_used, duration recording)
- Factory integration: 2 tests (Create success, Create invalid params
error)

**Note:** These tests depend on the implementation in the companion PR.
Merge the implementation PR first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants