azd waza run examples/code-explainer/eval.yaml fail with error

Steps to repro:

1. Clone this repo
2. Swap the executor of examples/code-explainer/eval.yaml from "mock" to "copilot-sdk"
3. With the root of this repo as the current working directory, run `azd waza run examples/code-explainer/eval.yaml`
4. Get errors.

Full output as follows
```
Running benchmark: code-explainer-eval
Skill: code-explainer
Engine: copilot-sdk
Model: claude-sonnet-4-20250514

[ERROR] copilot failed to start: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

✗ [1/4] Explain JavaScript Async/Await
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

✗ [2/4] Explain List Comprehension
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

✗ [3/4] Explain Python Recursion
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1

✗ [4/4] Explain SQL JOIN Query
===================================================
 BENCHMARK RESULTS
===================================================

Total Tests:    4
Succeeded:      0
Failed:         4
Errors:         0
Success Rate:   0.0%
Aggregate Score: 0.00
Min Score:      0.00
Max Score:      0.00
Std Dev:        0.0000
Duration:       842ms

---------------------------------------------------
 PER-TASK BREAKDOWN
---------------------------------------------------
  ✗ Explain JavaScript Async/Await [failed]
      pass_rate=100.0%  avg=0.00  min=0.00  max=0.00  stddev=0.0000  avg_dur=273ms
  ✗ Explain List Comprehension [failed]
      pass_rate=100.0%  avg=0.00  min=0.00  max=0.00  stddev=0.0000  avg_dur=2ms
  ✗ Explain Python Recursion [failed]
      pass_rate=100.0%  avg=0.00  min=0.00  max=0.00  stddev=0.0000  avg_dur=0ms
  ✗ Explain SQL JOIN Query [failed]
      pass_rate=100.0%  avg=0.00  min=0.00  max=0.00  stddev=0.0000  avg_dur=1ms

Failed Tests:
  - Explain JavaScript Async/Await (failed)
  - Explain List Comprehension (failed)
  - Explain Python Recursion (failed)
  - Explain SQL JOIN Query (failed)

---------------------------------------------------
 TRIGGER ACCURACY
---------------------------------------------------
  Accuracy:  0.0%
  Errors:    16 prompt(s) returned errors
  Precision: 0.0%  Recall: 0.0%  F1: 0.0%
  TP: 0  FP: 7  FN: 7  TN: 0

time=2026-03-02T15:52:21.379-08:00 level=INFO msg="failed to stop client" error="failed to kill CLI process: os: process already finished"
Error: benchmark completed: 4 failed and 0 error(s); trigger accuracy 0.0% below threshold 90.0%
benchmark completed: 4 failed and 0 error(s); trigger accuracy 0.0% below threshold 90.0%
```

@richardpark-msft 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

azd waza run examples/code-explainer/eval.yaml fail with error #42

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

azd waza run examples/code-explainer/eval.yaml fail with error #42

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions