Running benchmark: code-explainer-eval
Skill: code-explainer
Engine: copilot-sdk
Model: claude-sonnet-4-20250514
[ERROR] copilot failed to start: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
✗ [1/4] Explain JavaScript Async/Await
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
✗ [2/4] Explain List Comprehension
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
✗ [3/4] Explain Python Recursion
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
[ERROR] failed to create session: failed to create session: CLI process exited: exit status 1
✗ [4/4] Explain SQL JOIN Query
===================================================
BENCHMARK RESULTS
===================================================
Total Tests: 4
Succeeded: 0
Failed: 4
Errors: 0
Success Rate: 0.0%
Aggregate Score: 0.00
Min Score: 0.00
Max Score: 0.00
Std Dev: 0.0000
Duration: 842ms
---------------------------------------------------
PER-TASK BREAKDOWN
---------------------------------------------------
✗ Explain JavaScript Async/Await [failed]
pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=273ms
✗ Explain List Comprehension [failed]
pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=2ms
✗ Explain Python Recursion [failed]
pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=0ms
✗ Explain SQL JOIN Query [failed]
pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=1ms
Failed Tests:
- Explain JavaScript Async/Await (failed)
- Explain List Comprehension (failed)
- Explain Python Recursion (failed)
- Explain SQL JOIN Query (failed)
---------------------------------------------------
TRIGGER ACCURACY
---------------------------------------------------
Accuracy: 0.0%
Errors: 16 prompt(s) returned errors
Precision: 0.0% Recall: 0.0% F1: 0.0%
TP: 0 FP: 7 FN: 7 TN: 0
time=2026-03-02T15:52:21.379-08:00 level=INFO msg="failed to stop client" error="failed to kill CLI process: os: process already finished"
Error: benchmark completed: 4 failed and 0 error(s); trigger accuracy 0.0% below threshold 90.0%
benchmark completed: 4 failed and 0 error(s); trigger accuracy 0.0% below threshold 90.0%
Steps to repro:
azd waza run examples/code-explainer/eval.yamlFull output as follows
@richardpark-msft