fix(ops-ci): emit only currently-failing workflows; add pre-dispatch staleness check#196
Conversation
…staleness check `bin/ops-ci` previously pulled "last 3 failures" per repo regardless of whether those workflows are still red on branch HEAD. Downstream consumers (`ops-fires`, `ops-go`) treated any 24h failure as actionable and dispatched fix agents to already-self-resolved fires — burning ~50–150k Sonnet tokens per agent before it concluded "nothing to fix". Smoke test on a real portfolio: legacy mode emits 14 repos with failures, new mode emits 6 (8 stale entries filtered, 57% noise gone). Changes: - `bin/ops-ci`: per repo, survey workflows from last 30 runs, then for each workflow × tracked branch (main/dev/master) fetch latest run and emit only when conclusion=="failure". Behind `OPS_CI_MODE=current` default; `OPS_CI_MODE=legacy` reverts to old behaviour. - `skills/ops-fires/SKILL.md`: add MANDATORY pre-dispatch staleness check — `gh run list --workflow X --branch Y --limit 1` before spawning any fix agent. Defense in depth for cache races (fix landed in seconds since cache write). - Bump plugin version 2.0.8 → 2.0.9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR releases plugin version 2.0.9 with two behavioral improvements: refactoring the CI-failure detection script to support dual modes (current state vs. legacy 24h-based detection) and adding a pre-dispatch staleness validation check to the ops-fires skill that re-verifies CI failures before dispatching fix agents. ChangesPlugin 2.0.9 Release: CI Failure Detection & Validation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
Pins marketplace ops plugin to v2.0.9 (released via PR #196 — fix ops-ci stale data). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit 09d8e62. Configure here.
|
|
||
| - If `conclusion == "success"` → SKIP. Mark task completed with metadata `{resolution: "self-resolved-pre-dispatch"}`. Do NOT spawn agent. | ||
| - If `conclusion == "failure"` → proceed to dispatch. | ||
| - If `conclusion == null` (in_progress) → wait 30s, recheck once, then proceed if still null. |
There was a problem hiding this comment.
Staleness check dispatches during in-progress runs defeating PR goal
Medium Severity
The conclusion == null branch waits only 30 seconds then proceeds to dispatch a fix agent. Since most CI runs take 2–15 minutes, an in-progress run (likely triggered by a human pushing a fix) will almost always still be null after the recheck, causing dispatch anyway. This directly undermines the PR's goal of avoiding wasted token spend on self-resolving failures. Additionally, conclusion == null also matches the case where no runs exist at all (empty array from gh run list, with --jq '.[0]' yielding null), which is conflated with "in-progress" but really means the failure is stale/gone.
Reviewed by Cursor Bugbot for commit 09d8e62. Configure here.
…release (#199) The v2.0.5 → v2.0.9 patch series shipped meaningful features (multi-workspace Slack #195, /ops:credentials audit #184, ops-ci current-state filter #196, telegram preflight #185, userConfig schema upgrades #182). Per semver these should have been a minor bump. This release retroactively rolls them up into v2.1.0 with a single coherent CHANGELOG entry. No code changes — only: - plugin.json: 2.0.9 → 2.1.0 - CHANGELOG.md: new [2.1.0] entry consolidating Added/Fixed/Notes for the patch series - README header + What's-new section: refer to v2.1.0 - 11 docs/*.md badges + agents-reference subtitle + migration latest-stable note: v2.0.9 → v2.1.0 Marketplace pin (.claude-plugin/marketplace.json) bumped in follow-up PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Stop wasting Sonnet quota dispatching
/ops-firesagents to already-self-resolved CI failures.bin/ops-cipreviously pulled "last 3 failures" per repo regardless of whether those workflows are still red on branch HEAD./ops-firesand/ops-gotreated any 24h-old failure as actionable and dispatched fix agents — burning ~50–150k Sonnet tokens per agent before concluding "nothing to fix".Real-world signal: during a
/ops-firessession tonight, 4 of 7 dispatched fixers reported "self-resolved before I started" (PRs already merged: healify-agentcore #520, fiberwifi #471/#473, stagery-contracts #30, stagery-api Inngest Vercel redeploy). That's ~400k tokens spent confirming green builds.Smoke test
Real portfolio (38 repos, ops-ci registry):
OPS_CI_MODE=legacy bash bin/ops-ci: 14 repos with failuresbash bin/ops-ci(new default): 6 repos with failuresChanges
bin/ops-ci: survey workflows from last 30 runs, then for each workflow × tracked branch (main/dev/master) fetch latest run, emit only whenconclusion=="failure". BehindOPS_CI_MODE=currentdefault;OPS_CI_MODE=legacyreverts.skills/ops-fires/SKILL.md: MANDATORY pre-dispatch staleness check —gh run list --workflow X --branch Y --limit 1before spawning any fix agent. Defense in depth for cache races.Test plan
bash -n bin/ops-cisyntax check/ops-firesskips self-resolved failures on next invocation/ops-gomorning briefing surface count drops to actionable-only🤖 Generated with Claude Code
Note
Medium Risk
Changes CI failure detection logic used by downstream automation, so false negatives/positives could alter what incidents get surfaced or fixed; also increases
ghAPI calls per repo which may hit rate limits or slow runs.Overview
CI failure reporting is switched from “recent failures” to “current state.”
bin/ops-cinow surveys recent workflow names, then for each workflow and tracked branch (main/dev/master) fetches the latest run and emits only those whose latestconclusionisfailure, filtering out stale failures;OPS_CI_MODE=legacyrestores the prior behavior.Ops-fires adds a defense-in-depth staleness check.
skills/ops-fires/SKILL.mdnow requires re-checking the latestgh runresult (and handlingin_progress) before dispatching any fix agent, skipping self-resolved failures. Plugin version bumps to2.0.9.Reviewed by Cursor Bugbot for commit 09d8e62. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Release Notes
New Features
Chores