Skip to content

Rework analyze-ci-failures skill#51513

Merged
vicroms merged 8 commits into
microsoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements
May 13, 2026
Merged

Rework analyze-ci-failures skill#51513
vicroms merged 8 commits into
microsoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements

Conversation

@vicroms

@vicroms vicroms commented May 3, 2026

Copy link
Copy Markdown
Member

I created an evaluation project for the AI agent skills in vcpkg using Microsoft/waza.
The skill is evaluated analyze-ci-failures using five different models and grading the produced output. The skill is asked to analyze the output of a real CI run and generate a report of the regressions found.

The changes to the skill were motivated by the output of the waza check and waza run commands, in combination this evaluate the quality of the skill and the output it produces. Taking an iterative approach the skill was reworked to greatly reduce over-specificity and produce output that can pass the evaluation metrics.

I plan to make the evaluations public but probably kept in a separate repository (or maybe another branch?). The evaluation graders are:

  • The correct CI build is referenced.
  • All triplets with regressions are identified.
  • All ports with regressions are identified and root caused.
  • The skill produces a report and downloads the ADO failure logs for review.
  • The skill passes a quality judgement by an LLM.

I also ran a test comparing the best performing model with and without the skill.


Comparison (no skill vs skill improvements)

Average score (3 runs):

Model No Skill master de73198 3218d5e
claude-opus-4.7 72.56% 52.79% (-19.77) 92.67% (+39.70) 97.77% (+5.10)
claude-opus-4.5 55.95% 41.31% (-14.64) 90.92% (+49.62) 95.31% (+4.38)
gpt-5.3-codex 55.13% 37.54% (-17.59) 88.18% (+50.64) 98.64% (+10.46)
claude-haiku-4.5 50.57% 36.90% (-13.64) 82.38% (+45.49) 96.95% (+14.56)
gpt-5.4-mini 73.85% 61.41% (-12.44) 92.79% (+31.38) 96.21% (+3.41)

Cost report

Token Usage Per Trial

Model Trials Input Tokens Cached Tokens Output Tokens Total Tokens
claude-opus-4.7 3 187.6K 1.87M 34.5K 2.09M
claude-haiku-4.5 3 110.9K 1.11M 18.0K 1.24M
gpt-5.4-mini 3 190.4K 2.31M 37.4K 2.54M
claude-opus-4.5 3 164.8K 1.60M 22.8K 1.78M
gpt-5.3-codex 3 100.3K 965.9K 31.8K 1.10M

Cost Breakdown Per Trial

Model Input Cost Cached Cost Output Cost Cost/Trial
claude-opus-4.7 N/A N/A N/A N/A
claude-haiku-4.5 $0.1109 $0.1109 $0.0902 $0.3120
gpt-5.4-mini $0.1428 $0.1734 $0.1681 $0.4843
claude-opus-4.5 $0.8238 $0.7983 $0.5700 $2.1922
gpt-5.3-codex $0.1755 $0.1690 $0.4449 $0.7895

@vicroms vicroms changed the title Rework analyze-ci-failures script - pass evals Rework analyze-ci-failures script - add evals May 3, 2026
@BillyONeal

Copy link
Copy Markdown
Member

Neat report! I'm not that surprised to see GPT win at this.

BillyONeal
BillyONeal previously approved these changes May 3, 2026
Comment thread .github/skills/analyze-ci-failures/SKILL.md Outdated
@BillyONeal BillyONeal marked this pull request as draft May 4, 2026 20:57
@BillyONeal

Copy link
Copy Markdown
Member

Drafted this because you still seem to be making changes.

@vicroms vicroms marked this pull request as ready for review May 7, 2026 10:28
BillyONeal
BillyONeal previously approved these changes May 7, 2026

@BillyONeal BillyONeal left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I asked for curl commands in that one spot but this iteration has bigger pwsh blocks so I guess it doesn't matter. We might want to consider bash translations because when copilot-cli runs in there that's what it'll get but not worth it for now

Comment thread .github/skills/analyze-ci-failures/SKILL.md Outdated
@vicroms vicroms changed the title Rework analyze-ci-failures script - add evals Rework analyze-ci-failures script May 7, 2026
@vicroms vicroms changed the title Rework analyze-ci-failures script Rework analyze-ci-failures skill May 7, 2026
@vicroms vicroms marked this pull request as draft May 8, 2026 07:37
@vicroms vicroms marked this pull request as ready for review May 13, 2026 19:19
@vicroms vicroms merged commit ba27a18 into microsoft:master May 13, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants