Rework analyze-ci-failures skill by vicroms · Pull Request #51513 · microsoft/vcpkg

vicroms · 2026-05-03T06:25:36Z

I created an evaluation project for the AI agent skills in vcpkg using Microsoft/waza.
The skill is evaluated analyze-ci-failures using five different models and grading the produced output. The skill is asked to analyze the output of a real CI run and generate a report of the regressions found.

The changes to the skill were motivated by the output of the waza check and waza run commands, in combination this evaluate the quality of the skill and the output it produces. Taking an iterative approach the skill was reworked to greatly reduce over-specificity and produce output that can pass the evaluation metrics.

I plan to make the evaluations public but probably kept in a separate repository (or maybe another branch?). The evaluation graders are:

The correct CI build is referenced.
All triplets with regressions are identified.
All ports with regressions are identified and root caused.
The skill produces a report and downloads the ADO failure logs for review.
The skill passes a quality judgement by an LLM.

I also ran a test comparing the best performing model with and without the skill.

Comparison (no skill vs skill improvements)

Average score (3 runs):

Model	No Skill	master	de73198	3218d5e
claude-opus-4.7	72.56%	52.79% (-19.77)	92.67% (+39.70)	97.77% (+5.10)
claude-opus-4.5	55.95%	41.31% (-14.64)	90.92% (+49.62)	95.31% (+4.38)
gpt-5.3-codex	55.13%	37.54% (-17.59)	88.18% (+50.64)	98.64% (+10.46)
claude-haiku-4.5	50.57%	36.90% (-13.64)	82.38% (+45.49)	96.95% (+14.56)
gpt-5.4-mini	73.85%	61.41% (-12.44)	92.79% (+31.38)	96.21% (+3.41)

Cost report

Token Usage Per Trial

Model	Trials	Input Tokens	Cached Tokens	Output Tokens	Total Tokens
claude-opus-4.7	3	187.6K	1.87M	34.5K	2.09M
claude-haiku-4.5	3	110.9K	1.11M	18.0K	1.24M
gpt-5.4-mini	3	190.4K	2.31M	37.4K	2.54M
claude-opus-4.5	3	164.8K	1.60M	22.8K	1.78M
gpt-5.3-codex	3	100.3K	965.9K	31.8K	1.10M

Cost Breakdown Per Trial

Model	Input Cost	Cached Cost	Output Cost	Cost/Trial
claude-opus-4.7	N/A	N/A	N/A	N/A
claude-haiku-4.5	$0.1109	$0.1109	$0.0902	$0.3120
gpt-5.4-mini	$0.1428	$0.1734	$0.1681	$0.4843
claude-opus-4.5	$0.8238	$0.7983	$0.5700	$2.1922
gpt-5.3-codex	$0.1755	$0.1690	$0.4449	$0.7895

BillyONeal · 2026-05-03T06:57:11Z

Neat report! I'm not that surprised to see GPT win at this.

…s and minimum viable report format

BillyONeal · 2026-05-04T20:57:53Z

Drafted this because you still seem to be making changes.

Co-authored-by: Billy O'Neal <bion@microsoft.com>

BillyONeal

I know I asked for curl commands in that one spot but this iteration has bigger pwsh blocks so I guess it doesn't matter. We might want to consider bash translations because when copilot-cli runs in there that's what it'll get but not worth it for now

Rework analyze-ci-failures script - pass evals

159cb19

vicroms changed the title ~~Rework analyze-ci-failures script - pass evals~~ Rework analyze-ci-failures script - add evals May 3, 2026

vicroms mentioned this pull request May 3, 2026

[skills] Add skills to create ports and patches #51187

Draft

BillyONeal previously approved these changes May 3, 2026

View reviewed changes

Comment thread .github/skills/analyze-ci-failures/SKILL.md Outdated

Enhance analyze-ci-failures documentation with detailed workflow step…

de73198

…s and minimum viable report format

vicroms dismissed BillyONeal’s stale review via de73198 May 4, 2026 19:31

BillyONeal marked this pull request as draft May 4, 2026 20:57

vicroms and others added 3 commits May 5, 2026 17:17

Improve triplet recognition and failure type diagnostic

3218d5e

Update .github/skills/analyze-ci-failures/SKILL.md

ebd2049

Co-authored-by: Billy O'Neal <bion@microsoft.com>

Merge branch 'master' into skill/analyze-ci-failures-improvements

d80f00c

vicroms marked this pull request as ready for review May 7, 2026 10:28

Add version metadata

017f440

BillyONeal previously approved these changes May 7, 2026

View reviewed changes

Comment thread .github/skills/analyze-ci-failures/SKILL.md Outdated

vicroms changed the title ~~Rework analyze-ci-failures script - add evals~~ Rework analyze-ci-failures script May 7, 2026

vicroms changed the title ~~Rework analyze-ci-failures script~~ Rework analyze-ci-failures skill May 7, 2026

vicroms marked this pull request as draft May 8, 2026 07:37

Remove azcopy reference

ba37db8

vicroms dismissed BillyONeal’s stale review via ba37db8 May 11, 2026 16:44

BillyONeal approved these changes May 13, 2026

View reviewed changes

Merge branch 'master' into skill/analyze-ci-failures-improvements

240d663

vicroms marked this pull request as ready for review May 13, 2026 19:19

vicroms merged commit ba27a18 into microsoft:master May 13, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework analyze-ci-failures skill#51513

Rework analyze-ci-failures skill#51513
vicroms merged 8 commits into
microsoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements

vicroms commented May 3, 2026 •

edited

Loading

Uh oh!

BillyONeal commented May 3, 2026

Uh oh!

Uh oh!

BillyONeal commented May 4, 2026

Uh oh!

BillyONeal left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vicroms commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comparison (no skill vs skill improvements)

Cost report

Token Usage Per Trial

Cost Breakdown Per Trial

Uh oh!

BillyONeal commented May 3, 2026

Uh oh!

Uh oh!

BillyONeal commented May 4, 2026

Uh oh!

BillyONeal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vicroms commented May 3, 2026 •

edited

Loading