Skip to content

fix(ops-ci): emit only currently-failing workflows; add pre-dispatch staleness check#196

Merged
auroracapital merged 1 commit intomainfrom
fix/ops-ci-stale-data
May 2, 2026
Merged

fix(ops-ci): emit only currently-failing workflows; add pre-dispatch staleness check#196
auroracapital merged 1 commit intomainfrom
fix/ops-ci-stale-data

Conversation

@auroracapital
Copy link
Copy Markdown
Collaborator

@auroracapital auroracapital commented May 2, 2026

Summary

Stop wasting Sonnet quota dispatching /ops-fires agents to already-self-resolved CI failures.

bin/ops-ci previously pulled "last 3 failures" per repo regardless of whether those workflows are still red on branch HEAD. /ops-fires and /ops-go treated any 24h-old failure as actionable and dispatched fix agents — burning ~50–150k Sonnet tokens per agent before concluding "nothing to fix".

Real-world signal: during a /ops-fires session tonight, 4 of 7 dispatched fixers reported "self-resolved before I started" (PRs already merged: healify-agentcore #520, fiberwifi #471/#473, stagery-contracts #30, stagery-api Inngest Vercel redeploy). That's ~400k tokens spent confirming green builds.

Smoke test

Real portfolio (38 repos, ops-ci registry):

  • OPS_CI_MODE=legacy bash bin/ops-ci: 14 repos with failures
  • bash bin/ops-ci (new default): 6 repos with failures
  • 8 stale entries filtered → 57% noise reduction

Changes

  • bin/ops-ci: survey workflows from last 30 runs, then for each workflow × tracked branch (main/dev/master) fetch latest run, emit only when conclusion=="failure". Behind OPS_CI_MODE=current default; OPS_CI_MODE=legacy reverts.
  • skills/ops-fires/SKILL.md: MANDATORY pre-dispatch staleness check — gh run list --workflow X --branch Y --limit 1 before spawning any fix agent. Defense in depth for cache races.
  • Bump plugin version 2.0.8 → 2.0.9.

Test plan

  • bash -n bin/ops-ci syntax check
  • Real-portfolio smoke (legacy 14 → current 6)
  • JSON output schema unchanged (consumers don't break)
  • Verify /ops-fires skips self-resolved failures on next invocation
  • Confirm /ops-go morning briefing surface count drops to actionable-only

🤖 Generated with Claude Code


Note

Medium Risk
Changes CI failure detection logic used by downstream automation, so false negatives/positives could alter what incidents get surfaced or fixed; also increases gh API calls per repo which may hit rate limits or slow runs.

Overview
CI failure reporting is switched from “recent failures” to “current state.” bin/ops-ci now surveys recent workflow names, then for each workflow and tracked branch (main/dev/master) fetches the latest run and emits only those whose latest conclusion is failure, filtering out stale failures; OPS_CI_MODE=legacy restores the prior behavior.

Ops-fires adds a defense-in-depth staleness check. skills/ops-fires/SKILL.md now requires re-checking the latest gh run result (and handling in_progress) before dispatching any fix agent, skipping self-resolved failures. Plugin version bumps to 2.0.9.

Reviewed by Cursor Bugbot for commit 09d8e62. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced CI failure detection with a new default mode that surveys latest workflow runs per repository and branch for improved accuracy
    • Added pre-dispatch staleness verification to ensure CI fires are still active before attempting fixes, with special handling for PR-scoped workflows
  • Chores

    • Updated plugin version to 2.0.9

…staleness check

`bin/ops-ci` previously pulled "last 3 failures" per repo regardless of
whether those workflows are still red on branch HEAD. Downstream
consumers (`ops-fires`, `ops-go`) treated any 24h failure as actionable
and dispatched fix agents to already-self-resolved fires — burning
~50–150k Sonnet tokens per agent before it concluded "nothing to fix".

Smoke test on a real portfolio: legacy mode emits 14 repos with
failures, new mode emits 6 (8 stale entries filtered, 57% noise gone).

Changes:
- `bin/ops-ci`: per repo, survey workflows from last 30 runs, then for
  each workflow × tracked branch (main/dev/master) fetch latest run and
  emit only when conclusion=="failure". Behind `OPS_CI_MODE=current`
  default; `OPS_CI_MODE=legacy` reverts to old behaviour.
- `skills/ops-fires/SKILL.md`: add MANDATORY pre-dispatch staleness
  check — `gh run list --workflow X --branch Y --limit 1` before
  spawning any fix agent. Defense in depth for cache races (fix landed
  in seconds since cache write).
- Bump plugin version 2.0.8 → 2.0.9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR releases plugin version 2.0.9 with two behavioral improvements: refactoring the CI-failure detection script to support dual modes (current state vs. legacy 24h-based detection) and adding a pre-dispatch staleness validation check to the ops-fires skill that re-verifies CI failures before dispatching fix agents.

Changes

Plugin 2.0.9 Release: CI Failure Detection & Validation

Layer / File(s) Summary
Version Manifest
claude-ops/.claude-plugin/plugin.json
Version bumped from 2.0.8 to 2.0.9.
CI Failure Detection Refactor
claude-ops/bin/ops-ci
Script refactored to support OPS_CI_MODE: new default "current" mode scans latest 30 workflow runs per repo, extracts distinct workflows, and emits only failures (conclusion=failure) for each workflow against tracked branches (main, dev, master); "legacy" mode retains prior status-filtered query (up to 3 results). Logic moved from inline computation into scan_repo_current() and scan_repo_legacy() functions; output format unchanged.
Pre-Dispatch Staleness Check
claude-ops/skills/ops-fires/SKILL.md
Added mandatory validation before spawning fix agents: re-checks selected fire results at branch HEAD via gh run list; skips if latest conclusion is success, proceeds if failure, waits/rechecks if in_progress. Special handling for PR-scoped workflows using gh pr checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Poem

🐰 A rabbit hops with glee,
Failures caught in "current" mode,
Staleness checked before we flee,
CI workflows now bestowed,
Version bumped to 2.0.9!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: fixing ops-ci to emit only currently-failing workflows and adding a pre-dispatch staleness check.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ops-ci-stale-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@auroracapital auroracapital merged commit 10c60c2 into main May 2, 2026
10 of 11 checks passed
@auroracapital auroracapital deleted the fix/ops-ci-stale-data branch May 2, 2026 02:47
auroracapital added a commit that referenced this pull request May 2, 2026
Pins marketplace ops plugin to v2.0.9 (released via PR #196 — fix ops-ci stale data).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 09d8e62. Configure here.


- If `conclusion == "success"` → SKIP. Mark task completed with metadata `{resolution: "self-resolved-pre-dispatch"}`. Do NOT spawn agent.
- If `conclusion == "failure"` → proceed to dispatch.
- If `conclusion == null` (in_progress) → wait 30s, recheck once, then proceed if still null.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staleness check dispatches during in-progress runs defeating PR goal

Medium Severity

The conclusion == null branch waits only 30 seconds then proceeds to dispatch a fix agent. Since most CI runs take 2–15 minutes, an in-progress run (likely triggered by a human pushing a fix) will almost always still be null after the recheck, causing dispatch anyway. This directly undermines the PR's goal of avoiding wasted token spend on self-resolving failures. Additionally, conclusion == null also matches the case where no runs exist at all (empty array from gh run list, with --jq '.[0]' yielding null), which is conflated with "in-progress" but really means the failure is stale/gone.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09d8e62. Configure here.

auroracapital added a commit that referenced this pull request May 2, 2026
…release (#199)

The v2.0.5 → v2.0.9 patch series shipped meaningful features (multi-workspace
Slack #195, /ops:credentials audit #184, ops-ci current-state filter #196,
telegram preflight #185, userConfig schema upgrades #182). Per semver these
should have been a minor bump. This release retroactively rolls them up
into v2.1.0 with a single coherent CHANGELOG entry.

No code changes — only:
- plugin.json: 2.0.9 → 2.1.0
- CHANGELOG.md: new [2.1.0] entry consolidating Added/Fixed/Notes for the patch series
- README header + What's-new section: refer to v2.1.0
- 11 docs/*.md badges + agents-reference subtitle + migration latest-stable note: v2.0.9 → v2.1.0

Marketplace pin (.claude-plugin/marketplace.json) bumped in follow-up PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant