Adding in wz new task from-prompt#110
Conversation
154b5f0 to
5ffa03a
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a new CLI workflow to generate task.yaml files by recording a Copilot prompt run (waza new task record), and refactors waza new into subcommands (notably waza new skill). It also updates docs/examples to reflect the new CLI shape and improves grader-type error reporting.
Changes:
- Introduce
waza new task recordto execute a prompt via Copilot, read the session log, and emit a generated task YAML with inferred validators. - Restructure
waza newto usewaza new skill(and update docs/tests accordingly). - Add a helper to list valid grader kinds and improve unsupported-grader error messages.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Updates waza new docs and adds reference docs for waza new task record. |
| site/src/content/docs/index.mdx | Updates quickstart snippet to waza new skill. |
| site/src/content/docs/getting-started.mdx | Updates getting started command example to waza new skill. |
| site/src/content/docs/about.mdx | Updates feature list to waza new skill. |
| internal/utils/copilot_log_iterator.go | Adds an iterator for Copilot events.jsonl logs. |
| internal/scaffold/scaffold.go | Updates package comment to reflect waza new skill. |
| internal/models/outcome.go | Adds AllGraderKinds() helper. |
| internal/graders/grader.go | Improves error message for unsupported grader configurations. |
| cmd/waza/newtask/converters.go | Adds log→task conversion logic to infer validators from a recorded session. |
| cmd/waza/cmd_new_task.go | Implements waza new task record command. |
| cmd/waza/cmd_new.go | Refactors waza new into subcommands (skill/task). |
| cmd/waza/cmd_new_test.go | Updates tests to call the new skill subcommand constructor and adjusts alias expectations. |
| cmd/waza/cmd_init.go | Updates messaging/docs strings to waza new skill. |
| cmd/waza/cmd_init_test.go | Updates init tests’ expected strings / command constructor usage. |
| cmd/waza/cmd_check_test.go | Updates comment to reflect waza new skill. |
| README.md | Updates CLI snippets and documents the new task-record command. |
| docs/TUTORIAL.md | Updates waza new usage in tutorial. |
| docs/SKILLS_CI_INTEGRATION.md | Updates CI guide to waza new skill. |
| docs/SKILL-BEST-PRACTICES.md | Updates best practices doc to waza new skill. |
| docs/GUIDE.md | Updates guide snippets and adds docs for the new task-record command. |
| docs/GETTING-STARTED.md | Updates getting started doc to waza new skill. |
| go.mod | Promotes github.com/braydonk/yaml to a direct dependency (used by new task command). |
| go.sum | Dependency checksum updates aligned with go.mod changes. |
Comments suppressed due to low confidence (1)
cmd/waza/newtask/converters.go:114
tools[*e.Data.ToolCallID]can be nil if the log is missing a corresponding ToolExecutionStart (or the ID is missing), and the subsequentt.End = .../t.Success = ...will panic. Add a presence check (and nil-checkToolCallID/Success) before updating the tool entry, or reuse the existing tool-call correlation logic ininternal/models.FilterToolCalls.
case copilot.ToolExecutionComplete:
t := tools[*e.Data.ToolCallID]
t.End = e.Timestamp
t.Success = *e.Data.Success
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #110 +/- ##
=======================================
Coverage ? 73.32%
=======================================
Files ? 135
Lines ? 15521
Branches ? 0
=======================================
Hits ? 11381
Misses ? 3308
Partials ? 832
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 23 changed files in this pull request and generated 10 comments.
Comments suppressed due to low confidence (5)
cmd/waza/newtask/converters.go:96
ToolExecutionStartdereferencese.Data.ToolCallIDwithout checking for nil. Other event-processing code treats these as optional pointers; add a nil guard before using it (and skip the event if missing).
case copilot.ToolExecutionStart:
toolsInOrder = append(toolsInOrder, *e.Data.ToolCallID)
cmd/waza/newtask/converters.go:128
- This won’t compile:
new("<no skill name>")is invalid Go (new expects a type). If you need a fallback value, assign through a local string variable (e.g., s := ""; name = &s) or handle nil without allocating.
if name == nil {
name = new("<no skill name>")
}
cmd/waza/newtask/converters.go:114
ToolExecutionCompletecan panic: it dereferencesToolCallID/Successand assumes the corresponding start event was seen (tools[*id]). Add nil checks and handle missing entries (skip or return a descriptive error).
case copilot.ToolExecutionComplete:
t := tools[*e.Data.ToolCallID]
t.End = e.Timestamp
t.Success = *e.Data.Success
cmd/waza/newtask/converters.go:107
- This block assumes
ToolName(and the decoded args pointer) are non-nil; if the log omits them,*e.Data.ToolName/*tawill panic. Add nil checks and either skip or surface a descriptive error for incomplete events.
tools[*e.Data.ToolCallID] = &tool{
Start: e.Timestamp,
Name: *e.Data.ToolName,
Arguments: *ta,
}
site/src/content/docs/reference/cli.mdx:182
- The heredoc example implies
waza new skillreads metadata from stdin in non-interactive mode, but the implementation currently ignores stdin and uses defaults when not in a TTY. Update/remove this example to match actual behavior.
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.
You can also share your feedback on Copilot code review. Take the survey.
…up 'eval', like the rest of them.
…ommands so it's clear why two commands with the same syntax produce different results
…ere completely untested.
…ead of what it is now (implicit, and always on)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
cmd/waza/newtask/converters.go:156
- Same compilation issue as above:
new("<no path>")is invalid Go. Replace with astringvariable and take its address (or keeppathnil and skip adding an entry when required fields are missing).
path := e.Data.Path
if path == nil {
path = new("<no path>")
}
You can also share your feedback on Copilot code review. Take the survey.
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #110 — Adding in wz new task record
✅ What Looks Good
- Clean interface export —
copilotClient→CopilotClientproperly exports interfaces for external mock generation - Good use of Go iterators —
NewCopilotLogIteratorusesiter.Seq2idiomatically with proper error handling - Improved grader error messages —
AllGraderKinds()provides helpful valid-types list on invalid grader config - Comprehensive mock infrastructure —
go:generatesetup with proper DI via options struct - Solid E2E test —
cmd_new_task_test.gocovers the full happy path with mocked dependencies
🟠 High (4 findings)
1. Consider from-prompt as default behavior or flag instead of subcommand — Currently waza new task from-prompt <prompt> <path> uses from-prompt as a subcommand. If there's only one way to create tasks right now, consider making it the default behavior of waza new task (with --from-prompt as a flag if more sources are added later). Running waza new task with no subcommand currently shows help — would a default action be more ergonomic?
2. Brittle default text grader — entire response as ContainsCS — converters.go:215-223 — The generated task YAML uses the full concatenated assistant response as a case-sensitive contains check. LLM responses are non-deterministic, so this grader will fail on re-runs. Consider extracting key phrases instead, or adding a YAML comment noting this should be customized.
3. converters.go — 1 test for 218 lines of code — CreateTestCaseFromCopilotLog handles 6 event types (UserMessage, ToolExecutionStart/Complete, AssistantMessage/Delta, SkillInvoked) but has only 1 happy-path test. Missing coverage for: nil ToolCallID, mapstructure decode failures, tool-not-found in map, report_intent filter, nil skill name/path defaults.
4. Hardcoded default model claude-sonnet-4.5 — cmd_new_task.go:291 — Model name hardcoded as flag default. Consider reading from config/env (WAZA_DEFAULT_MODEL) or deferring to the Copilot SDK's own default.
🟡 Medium (4 findings)
5. --testname should be --test-name — All other multi-word flags use kebab-case (--disable-repo-skills, --judge-model). This one doesn't.
6. context.Background() in shutdown defer — cmd_new_task.go:165 — Creates a new background context instead of inheriting cmd.Context(). If user presses Ctrl+C, shutdown still waits up to 1 minute.
7. RunE closure is ~160 lines — cmd_new_task.go:129-288 — Single function handles 6 different concerns. Consider extracting task definitions into helper functions for readability and testability.
8. new("<no skill name>") shadows Go builtin — converters.go:158,163 — Confusing for readers who expect new(T) to allocate a zero-value pointer. Use explicit s := "<no skill name>"; name = &s instead.
🟢 Low (2 findings)
9. Missing error context in skill discovery — cmd_new_task.go:182-186 — Wrap error with: fmt.Errorf("failed to discover skills in %q: %%w", rootDir, err)
10. Task list wrapper swallows AddTask error — cmd_new_task.go:324 — _ = tlw.inner.AddTask(options) silently ignores errors.
📌 Documentation Pass Recommended
After this PR lands, recommend a focused round of doc updates to ensure all docs match the new waza new command structure (waza new skill, waza new task, waza new eval). Some references to the old waza new [skill-name] and waza generate patterns may still exist across README, guides, and site content.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 4 |
| Medium | 4 |
| Low | 2 |
Overall Assessment: Comment — solid new feature with good DI/mock patterns. Key items to address: the brittle text grader default, test coverage for converters, and flag naming consistency.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 2 comments.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.
You can also share your feedback on Copilot code review. Take the survey.
wz new task recordwz new task from-prompt
Related to microsoft#105 Proactive test cases for the action_sequence grader. Written from the spec while Linus implements. Covers: exact_match, in_order_match, any_order_match modes, score calculations, error cases, and factory integration. **Test breakdown (31 test cases):** - Constructor validation: 6 tests (missing actions, empty actions, invalid mode, valid modes) - exact_match mode: 6 tests (pass, extra step, missing step, wrong order, empty actual, single tool) - in_order_match mode: 5 tests (exact seq, extras interspersed, wrong order, missing step, subset order) - any_order_match mode: 5 tests (reordered, extras, missing, duplicate expected, insufficient duplicates) - Score calculations: 4 tests (perfect, partial, zero, details verification + f1 check) - Edge cases: 3 tests (nil session, nil tools_used, duration recording) - Factory integration: 2 tests (Create success, Create invalid params error) **Note:** These tests depend on the implementation in the companion PR. Merge the implementation PR first.
Adding in a new command to let users fire off a prompt, and use the transcript to generate a task.yaml, with graders and prompt filled in.