Summary
When the pass-2 LLM request in waza suggest fails at the engine level, the command always exits with:
parsing suggest response: response is not valid suggestion YAML
regardless of what actually went wrong. The real error is captured in ExecutionResponse.ErrorMsg but never surfaced, making the command appear flaky/broken when the underlying cause is a transient request failure.
Environment
- waza
v0.33.0 (binary install via install.sh)
- macOS (arm64)
- Default
copilot-sdk executor / bundled Copilot CLI
Repro
waza suggest .github/skills/<any-skill> --debug --format json
Fails most of the time in our repo (23 skills; tried multiple skills, multiple days). It occasionally succeeds, which is consistent with a transient/server-side rejection rather than a YAML formatting problem.
Debug evidence
With --debug, the event stream shows pass 1 (grader selection) completing normally, then pass 2 (eval generation, ~47 KB prompt) dying ~130 ms after the turn starts — no assistant.message, no assistant.turn_end — far too fast for any generation to have happened:
15:14:27.223 type=user.message (pass-2 implementation prompt, ~47KB)
15:14:27.225 type=session.title_changed
15:14:27.424 type=assistant.turn_start turnID=0
15:14:27.424 type=session.usage_info
15:14:27.556 type=session.info
parsing suggest response: response is not valid suggestion YAML
For comparison, pass 1 in the same run produced assistant.message (graders: [trigger, file, text, prompt]) and assistant.turn_end before session.idle.
Root cause
CopilotEngine.Execute intentionally reports inline conversation errors via the response rather than a Go error (internal/execution/copilot.go#L478-L509):
// errors that are returned inline, as part of the conversation, also come back
// in the returned error. Rather than having one of those fun functions that returns
// both an error and a result, I'll just put the error message in the ExecutionResponse.
errMsg = err.Error()
...
ErrorMsg: errMsg,
Success: err == nil,
But suggest.Generate only checks the Go error and resp == nil (internal/suggest/suggest.go#L93-L107):
resp, err := engine.Execute(implCtx, &execution.ExecutionRequest{Message: implPrompt})
...
if resp == nil {
return nil, errors.New("empty engine response")
}
suggestion, err := ParseResponse(resp.FinalOutput)
When the engine fails, resp.FinalOutput is empty, ParseResponse("") falls through to the catch-all at internal/suggest/suggest.go#L217, and the actual resp.ErrorMsg is discarded. The same gap applies to the pass-1 call at internal/suggest/suggest.go#L73 (a pass-1 failure silently degrades to "no grader docs").
Suggested fix
In suggest.Generate, after each engine.Execute call:
if !resp.Success {
return nil, fmt.Errorf("engine execution failed: %s", resp.ErrorMsg)
}
(or include resp.ErrorMsg in the parse-failure message when FinalOutput is empty). Bonus: distinguishing "model returned nothing" from "model returned unparseable YAML" in ParseResponse would make the remaining genuine parse failures easier to debug too.
Related
Summary
When the pass-2 LLM request in
waza suggestfails at the engine level, the command always exits with:regardless of what actually went wrong. The real error is captured in
ExecutionResponse.ErrorMsgbut never surfaced, making the command appear flaky/broken when the underlying cause is a transient request failure.Environment
v0.33.0(binary install viainstall.sh)copilot-sdkexecutor / bundled Copilot CLIRepro
Fails most of the time in our repo (23 skills; tried multiple skills, multiple days). It occasionally succeeds, which is consistent with a transient/server-side rejection rather than a YAML formatting problem.
Debug evidence
With
--debug, the event stream shows pass 1 (grader selection) completing normally, then pass 2 (eval generation, ~47 KB prompt) dying ~130 ms after the turn starts — noassistant.message, noassistant.turn_end— far too fast for any generation to have happened:For comparison, pass 1 in the same run produced
assistant.message(graders: [trigger, file, text, prompt]) andassistant.turn_endbeforesession.idle.Root cause
CopilotEngine.Executeintentionally reports inline conversation errors via the response rather than a Go error (internal/execution/copilot.go#L478-L509):But
suggest.Generateonly checks the Go error andresp == nil(internal/suggest/suggest.go#L93-L107):When the engine fails,
resp.FinalOutputis empty,ParseResponse("")falls through to the catch-all at internal/suggest/suggest.go#L217, and the actualresp.ErrorMsgis discarded. The same gap applies to the pass-1 call at internal/suggest/suggest.go#L73 (a pass-1 failure silently degrades to "no grader docs").Suggested fix
In
suggest.Generate, after eachengine.Executecall:(or include
resp.ErrorMsgin the parse-failure message whenFinalOutputis empty). Bonus: distinguishing "model returned nothing" from "model returned unparseable YAML" inParseResponsewould make the remaining genuine parse failures easier to debug too.Related