Skip to content

feat(agentic-ci): decision-ready triage and daily PR fixes#600

Merged
andreatgretel merged 14 commits into
mainfrom
andreatgretel/feat/agentic-ci-improvements
May 12, 2026
Merged

feat(agentic-ci): decision-ready triage and daily PR fixes#600
andreatgretel merged 14 commits into
mainfrom
andreatgretel/feat/agentic-ci-improvements

Conversation

@andreatgretel

Copy link
Copy Markdown
Contributor

📋 Summary

Reorganize the weekly issue-triage report around recommended actions so each flagged item is decision-ready, and flip four of the five daily audit suites from read-only reports to opening one PR per run for the most important localized fix. The shared procedure (selection, ranking, allowlists, attempted-fixes lifecycle, two-strike escalation, branch/PR conventions) lives in a single _fix-policy.md; suite recipes declare only their eligible categories.

🔗 Related Issue

N/A — extends the agentic-CI work tracked in plan 472. We can link a follow-up tracking issue once one is opened.

🔄 Changes

  • New .agents/recipes/_fix-policy.md — universal localized-fix bar (≤3 files, ≤50 LOC, reversible, self-evident, test-safe, single-concern), per-suite path/command allowlists, finding-hash spec for stable cross-run identification, fix_backlog / attempted_fixes schema, ranking criteria (confidence > severity > impact > recency), draft-PR mode, two-strike escalation, standard fix procedure, direct gh pr create --body-file PR-creation pattern.
  • New .agents/recipes/_phase-audit.md and .agents/recipes/_phase-fix.md — phase directives prepended to each claude invocation so each call knows which phase it executes.
  • Rewrite .agents/recipes/issue-triage/recipe.md: action-organized buckets (Close as resolved, Close as duplicate, Needs maintainer decision, Ready for assignment, Stuck PR, Duplicate PRs, Stale, consider closing), per-row Action / Evidence / Rationale columns, healthy items collapsed to count + <details>, multi-comment split with :i/N markers and orphan reconciliation, plus a Repeatedly-failed fix attempts section that surfaces two-strike findings from the daily suites.
  • Add a fix phase to four suite recipes (eligibility tables only — procedure inherited from _fix-policy.md):
    • docs-and-references: broken-link, docstring-drift (signature-driven), arch-ref-rename. Non-draft.
    • structure: missing-future, lazy-import. Non-draft. Dead exports stay report-only.
    • dependencies: transitive-gap, unused. Non-draft.
    • code-quality: bare-except narrowing only. Draft PRs until draft_until_proven is flipped after two non-draft PRs land clean. TODO-line deletion explicitly forbidden.
  • test-health: explicit "no fix phase, all categories report-only" with a future-candidate note for test-isolation violations.
  • Update .github/workflows/agentic-ci-daily.yml: split the recipe step into two claude invocations (audit, then conditionally fix), each with the existing --max-turns 50. Fix step gated on audit success AND matrix.suite != 'test-health' AND non-empty fix_backlog. Adds a git-identity step, expands artifact upload to include the fix log + PR body, and reports both phase outcomes in the job summary. No branch auto-deletion.
  • Minor _runner.md updates: generalized branch prefix beyond chore, documented why CI uses gh pr create --body-file instead of /create-pr.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • .agents/recipes/_fix-policy.md — load-bearing contract. Allowlists, ranking, two-strike escalation, and the standard fix procedure all live here. Worth reviewing in full.
  • .github/workflows/agentic-ci-daily.yml — the audit/fix split, the backlog gate (fromJSON(steps.backlog.outputs.size || '0') > 0), and the matrix.suite != 'test-health' guard.
  • One-time manual setup before merge: create labels agentic-ci, agentic-ci/docs-and-references, agentic-ci/structure, agentic-ci/dependencies, agentic-ci/code-quality. The recipes apply these via gh pr edit --add-label; missing labels won't block the PR opening but will surface a gh warning.

🧪 Testing

  • make test — N/A. No Python or other source code changed; the diff is recipes (markdown), workflow YAML, and one new policy doc.
  • Workflow YAML validated (python3 -c "import yaml; yaml.safe_load(...)")
  • Pre-commit hooks pass (trim whitespace, EOF, yaml, large files, merge conflicts, line endings)
  • Production validation plan — enable each suite via workflow_dispatch one at a time and read the first 1–2 actual runs before promoting to the next, per the validation section of the plan. Two-strike escalation, allowlist enforcement, and re-attempt blocking are all observable from the runner state and PR list.

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A; this extends agentic-CI infrastructure (originally in plan 472), and _fix-policy.md itself is the architecture doc for the fix phase.

@andreatgretel andreatgretel requested a review from a team as a code owner May 4, 2026 16:55
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

PR #600 Review — feat(agentic-ci): decision-ready triage and daily PR fixes

Summary

Reorganizes the weekly issue-triage report around recommended actions (seven exclusive buckets, per-row Action/Evidence/Rationale) and extends four of five daily audit suites (docs-and-references, structure, dependencies, code-quality) to open one PR per run for eligible, narrowly-scoped mechanical fixes. Shared rules live in a new .agents/recipes/_fix-policy.md; suite recipes declare only their eligible categories. test-health stays explicitly report-only. The workflow .github/workflows/agentic-ci-daily.yml splits the recipe step into audit + fix invocations, gated on audit success, non-test-health suite, and fix_backlog non-empty.

Diff is 607 adds / 118 dels, entirely in recipe markdown + one workflow YAML — no Python changes. Scope is consistent with the PR description.

Findings

Correctness / potential issues

  • make test-<package> name mapping is ambiguous. _fix-policy.md:26 and the fix procedure at _fix-policy.md:171-173 tell the fix phase to run make test-<package> for the affected package. The Makefile only defines test-config, test-engine, test-interface — but a recipe selecting e.g. packages/data-designer-config/pyproject.toml under the dependencies allowlist may plausibly infer make test-data-designer-config (which doesn't exist). Recommend adding an explicit mapping table in _fix-policy.md (path prefix → make target) so the recipe can't synthesize a non-existent target. This is the single most likely silent-abandon source on day one.

  • Check fix backlog step has no id: audit dependency, just if: success(). That's fine when Run audit recipe fails (whole job flips to failure), but if: success() does not distinguish between "audit ran and passed" and "audit was skipped". There is no current path to skip it, so not a bug today, but worth tightening to if: steps.audit.outcome == 'success' && matrix.suite != 'test-health' for robustness if a pre-audit gate is ever added.

  • fromJSON(steps.backlog.outputs.size || '0') when the step is skipped. If backlog is skipped (test-health), steps.backlog.outputs.size is empty string and the || '0' is required to keep fromJSON from erroring. Good. Consider applying the same pattern to AUDIT_OUTCOME / FIX_OUTCOME in the summary step — the default skipped/unknown render is currently fine, but a skipped backlog will show Fix backlog size: \`rather thann/a`. Minor.

  • Single-part vs numbered triage filename inconsistency. issue-triage/recipe.md:204-206 says each part is /tmp/issue-triage-report-1.md, ...-2.md, etc. The fallback note at recipe.md:265 says "the workflow's fallback step will post /tmp/issue-triage-report.md" (no suffix). In the single-part case, the file must exist under both names or the fallback loses the report. Either (a) instruct the recipe to write both, or (b) change the workflow fallback to prefer -1.md.

  • attempted_fixes reconciliation is underspecified. _fix-policy.md:158-159 says reconcile against open PRs at the start of each fix run; step 6 adds a hidden <!-- agentic-ci finding=<id> suite=<suite> --> marker in the PR body. But the reconciliation algorithm (how to match PRs to attempted_fixes entries by id) isn't described. On a crash after gh pr create but before the state write, the next run has no way to recover the entry without parsing that marker. Recommend one explicit sentence: "reconcile by scanning open PR bodies for the agentic-ci finding= marker and back-filling missing attempted_fixes entries".

  • fix_backlog.data schema is freeform. Each category describes its data shape in prose (e.g. docs-drift "signature-vs-Args delta", bare-except "proposed replacement type and grep evidence"). If two runs use slightly different key names, downstream reconciliation is brittle. A small schema table per category would harden this.

  • set -o pipefail is set in both the audit and fix shell steps — good. claude ... | tee will propagate non-zero correctly. No issue.

Style / conventions

  • Markdown tables in _fix-policy.md are dense but readable; good use of section headers. No issues.
  • Branch-naming pattern update in _runner.md is consistent with _fix-policy.md:167.
  • The explicit "TODO line deletion is forbidden" in code-quality (both in fix phase and Constraints) is a welcome belt-and-suspenders after the prior incident pattern.
  • Issue-triage multi-comment split logic (PATCH existing, post surplus, delete extras) is clean and idempotent.

Risk / operational

  • Cost. Cache is disabled (DISABLE_PROMPT_CACHING: "1") and the fix phase re-loads _phase-fix.md + _runner.md + _fix-policy.md + the recipe body per run. This roughly doubles per-suite prompt bytes on fix-eligible days. Acceptable, but worth a note in the rollout plan that cost per daily run approximately doubles when a backlog exists.
  • Draft-PR gate is governed by a runner-state boolean (draft_until_proven). Flipping it requires manually editing the cached state. That's fine as designed, but please document the flip procedure in the rollout note (the PR body already says "a maintainer flips the flag after two non-draft PRs land clean" — just spell out the exact file/key path).
  • Label pre-creation is a documented manual step. Missing labels only surface a gh warning; the PR still opens. Acceptable.
  • Two-strike surfacing in the triage report depends on access to the daily-suites' runner-state, which lives in per-suite caches on a different workflow. The recipe acknowledges this with a "may be empty" caveat, but that effectively neuters the escalation visibility until a shared-state mechanism is added. Worth tracking as a follow-up rather than blocking this PR.

Tests

  • N/A. No Python source changed. YAML is syntactically valid per the PR checklist; the changes are behavioral and only exercisable in production runs.

Verdict

Approve-with-comments. The design is coherent, the audit/fix split is well-isolated, and the escape hatches (localized-fix bar, allowlists, two-strike escalation, draft mode for the highest-risk suite) are appropriate. The one thing I'd resolve before merging is the make test-<package> mapping ambiguity — leaving it implicit invites day-one silent abandons. The other findings are nice-to-have tightening that can land as follow-ups.

Suggested pre-merge:

  1. Add an explicit path-prefix → make-target table to _fix-policy.md.
  2. Clarify the attempted_fixes PR-marker reconciliation algorithm in one sentence.
  3. Resolve the single-part triage filename (/tmp/issue-triage-report.md vs -1.md) inconsistency.

@greptile-apps

greptile-apps Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR reorganizes the weekly issue-triage report around action buckets and upgrades four of the five daily audit suites from read-only mode to opening one automated fix PR per run. A new _fix-policy.md centralizes the localized-fix bar (≤3 files, ≤50 LOC net, reversible, self-evident, test-safe), per-suite path allowlists, finding-hash spec, two-strike escalation, and the standard fix procedure. The daily workflow gains an audit/fix split, a snapshot-based scope gate, a dependencies-specific lockfile gate, and an expanded artifact upload.

  • New _fix-policy.md + phase directives (_phase-audit.md, _phase-fix.md) — shared policy contract prepended to every CI invocation, keeping suite recipes to eligibility declarations only.
  • Four suites gain a fix phase (docs-and-references, structure, dependencies, code-quality) with eligibility tables and appropriate test_required / draft_until_proven flags; test-health explicitly opts out.
  • agentic-ci-daily.yml — adds Run fix recipe, Validate fix scope, and Verify dependencies lockfile steps; snapshot-based selector ensures every new open attempt is gated regardless of reconciled orphans; cancel-in-progress flipped to false to prevent orphaned branches.
  • agentic-ci-issue-triage.yml — fallback posting rewritten to numbered parts with identity-based deduplication and pre-capture() jq guard to prevent stream truncation from sibling-workflow comments.

Confidence Score: 5/5

Safe to merge. All previously-flagged issues have been resolved; the one remaining finding is a minor spec/implementation gap in the LOC-delta calculation that only affects larger refactor-style fixes, which this system does not currently generate.

The scope gate, lockfile gate, snapshot-based selector, and two-strike escalation all look logically correct. The LOC-delta calculation computes a+d where the policy says "net", which would over-count for symmetric refactors — but every eligible fix category in this PR generates 1–10 changed lines, so the mismatch has no practical impact on the current release.

.github/workflows/agentic-ci-daily.yml — the scope gate's LOC-delta awk expression and the AST docstring checker are the most logic-dense additions and worth a second read.

Important Files Changed

Filename Overview
.github/workflows/agentic-ci-daily.yml Splits the Claude invocation into audit+fix phases, adds scope gate and lockfile gate. LOC-delta calc uses additions+deletions while policy says "≤50 LOC net", creating a mismatch for refactor-style fixes.
.agents/recipes/_fix-policy.md New policy document defining the localized-fix bar, allowlists, finding-hash spec, ranking, two-strike escalation, and standard fix procedure — well-structured and internally consistent.
.github/workflows/agentic-ci-issue-triage.yml Fallback posting rewritten to use numbered part files, identity-based deduplication, and MISSING-array iteration; capture() jq guard added. Looks correct.
.agents/recipes/code-quality/recipe.md Adds fix phase for bare-except narrowing with appropriate draft-PR default, test_required flag, and ineligibility guards for ambiguous exception types and TODO deletion.
.agents/recipes/dependencies/recipe.md Adds fix phase for transitive-gap and unused deps with make install-dev pre-validation; ineligibility guard for transitive-gap when no sibling specifier exists is a good safeguard.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
.github/workflows/agentic-ci-daily.yml:355-356
`_fix-policy.md` specifies "≤50 LOC **net**", but the gate computes `a+d` (insertions + deletions combined). A fix that rewrites 30 lines with 30 different lines would show `a+d = 60` and be rejected even though the net LOC change is 0. For the tiny one-liner fixes this system typically generates the gap is harmless, but the docstring-drift category (which can legitimately replace several `Args:` lines) could exceed 50 `a+d` while staying well under 50 net — causing the gate to close a valid PR. The awk expression should subtract deletions from insertions instead.

```suggestion
              LOC_DELTA=$(git diff --shortstat "origin/main...$DIFF_REF" \
                | awk '{ a=0; d=0; for (i=1;i<=NF;i++) { if ($i ~ /insertion/) a=$(i-1); if ($i ~ /deletion/) d=$(i-1) } print (a > d ? a - d : d - a) }')
```

Reviews (13): Last reviewed commit: "Merge branch 'main' into andreatgretel/f..." | Re-trigger Greptile

Comment thread .github/workflows/agentic-ci-daily.yml Outdated
Comment thread .agents/recipes/_fix-policy.md
Comment thread .agents/recipes/issue-triage/recipe.md Outdated
Comment thread .github/workflows/agentic-ci-issue-triage.yml Outdated
Comment thread .github/workflows/agentic-ci-issue-triage.yml
Comment thread .github/workflows/agentic-ci-issue-triage.yml Outdated
@johnnygreco

Copy link
Copy Markdown
Contributor

Thanks for putting this together, @andreatgretel — the iteration is visible, and the factoring of _fix-policy.md as the shared contract is the right call.

Summary

This reorganizes the weekly triage report around action buckets and flips four daily suites from report-only to opening one PR per run, with a shared policy doc covering selection, ranking, allowlists, two-strike escalation, and atomicity. The implementation matches the stated intent and reflects multiple rounds of review feedback (the partial-post detection and capture() guard were tightened well across commits 19db106591e87496).

I read this through the lens you asked for — humans-out-of-the-loop autonomous PR generation — and the bulk of my comments are about the trust boundary between policy and enforcement. Most of the safety bar lives in _fix-policy.md as instructions to the agent, but the workflow doesn't independently verify that the agent followed them before the PR is opened. That's a reasonable v1 stance, but it's worth being explicit about which guarantees are invariants and which are agent-best-effort.

Findings

Warnings — Worth addressing

_fix-policy.md:38-49 — Path allowlists and forbidden commands are policy, not gates

  • What: The path allowlists, ≤3 files / ≤50 LOC bound, and the shared forbidden-command list (git push --force, gh pr merge, gh pr close, gh pr review, pip install) are spelled out as constraints the agent must follow. Nothing in agentic-ci-daily.yml validates the diff before the agent pushes/opens the PR — the agent runs with GH_TOKEN: ${{ github.token }} and pull-requests: write, so a misbehaving or confused agent could in principle push to a forbidden path or self-merge.
  • Why: For "humans out of the loop", this is the trust boundary. The agent's compliance with _fix-policy.md is load-bearing. A single bad run that touches .github/workflows/** (allowed by the token) or merges its own PR (allowed if branch protection isn't strict enough) is hard to undo. The two-strike escalation handles "agent picked the wrong fix"; it doesn't handle "agent escaped the allowlist."
  • Suggestion: Add a workflow-level gate between the agent's commit and the PR open. After the fix phase finishes, run git diff --name-only HEAD~1 HEAD against the suite's allowlist and wc -l against the LOC cap; abort + delete the local branch (without force-push) if either fails. Same idea for the docstring-only constraint on docs-and-references (the path allowlist permits packages/*/src/**/*.py but the policy says "docstring-only" — a git diff filtered to non-docstring AST changes would catch escapes). Even a coarse check ("no lines outside docstrings, comments, or whitespace") is much stronger than relying on the prompt.

agentic-ci-daily.yml:71-73cancel-in-progress: true on the matrix job can interrupt mid-fix

  • What: The matrix-level concurrency group cancels in-progress runs when a new run for the same suite arrives (e.g., a workflow_dispatch while the cron run is still going). If cancellation lands between the local commit and the PR-marker comment that drives reconciliation, the orphaned branch exists but no attempted_fixes record does.
  • Why: The _fix-policy.md "Atomicity" section promises "no half-states." Reconciliation via the <!-- agentic-ci finding=<id> --> marker recovers state for opened PRs, but a cancellation between git push and gh pr create leaves a pushed branch with no PR — the next run won't see it via gh pr list and may attempt the same finding again. With humans out of the loop, this drifts silently.
  • Suggestion: Either flip the matrix job to cancel-in-progress: false (the worst case is a queued duplicate run, not an interrupted one), or extend the reconciliation pass to also enumerate agentic-ci/* branches without an open PR and either delete them or open the deferred PR. Branch deletion is forbidden by policy elsewhere — leaving them for pr-stale.yml to reap is fine, but document the failure mode.

_phase-audit.md:12 — "Do NOT modify any files outside {{memory_path}}/" contradicts the recipes

  • What: The phase directive forbids writing outside {{memory_path}}/, but every audit recipe instructs the agent to write the report to /tmp/audit-{{suite}}.md. The workflow then reads that exact path for the job summary and artifact upload (agentic-ci-daily.yml:268, 288).
  • Why: A literal-minded agent could refuse to write the report file, which would break the entire daily pipeline silently (job summary empty, no artifact, fix phase has no audit context).
  • Suggestion: Change the bullet to something like "Do NOT modify any files outside {{memory_path}}/ and /tmp/audit-{{suite}}.md." Same nit applies to _phase-fix.md if it grows similar wording — the fix phase legitimately writes /tmp/pr-body-{{suite}}.md.

agentic-ci-daily.yml:260-271 — Agent log only uploaded on failure

  • What: The upload-artifact step is gated on if: failure(). Successful runs leave no trace of the agent's full reasoning stream beyond what made it into the job summary.
  • Why: For autonomous PR generation, the most interesting failure is "the workflow succeeded but the PR was wrong" — a bad fix that lands, an over-eager allowlist interpretation, the agent doing something subtly off. Without the stream-json log on success, there's nothing to look back at when a maintainer notices the issue days later.
  • Suggestion: Drop the if: failure() (or change to if: always()). Disk + retention cost is small; debuggability for autonomous mode is the bigger win. Bonus: include runner-state.json in the artifact list so the post-run state is recoverable if the cache gets evicted.

dependencies/recipe.md:178-181make install-dev after pyproject.toml edits is policy, not enforced

  • What: The recipe says "Before running the per-package test target [...], run make install-dev to confirm the lockfile resolves cleanly." Like the allowlist, this is an instruction to the agent. The workflow doesn't validate that make install-dev re-ran after the pyproject change before the test target.
  • Why: A transitive-gap fix that adds an unresolvable specifier (e.g., a typo in a version) would pass the per-package test target if the venv is still using the pre-change lockfile. The PR opens; CI on the PR would catch it eventually, but for a fully autonomous flow the in-recipe verification is the first line of defense.
  • Suggestion: For the dependencies suite specifically, the workflow could re-run make install-dev and a quick python -c "import <new-dep>" in a post-fix step before letting gh pr create run. Cheap, deterministic, and exactly what "test-safe" was meant to guarantee.

Suggestions — Take it or leave it

agentic-ci-daily.yml:91-99 — Unused cache step id

  • What: id: cache is set but steps.cache.outputs.cache-hit is never referenced.
  • Suggestion: Drop the id for cleanliness, or use it to skip the "Initialize runner memory" step when there's a cache hit (saves a couple of lines of redundant work).

_fix-policy.md:206 — Draft mode toggle is manual-only

  • What: draft_until_proven flips from truefalse only by maintainer edit of runner-state.json after "at least two non-draft PRs have landed clean." For a humans-out-of-the-loop system, the flip still requires a human.
  • Why: Not a bug — this is intentional and probably right for v1. But worth an explicit note in the policy that this is a deliberate human-gated promotion step, so future-you (or the next agent reading the policy) doesn't decide to automate it without thinking through the implications.
  • Suggestion: Add a sentence to the "Draft PRs" bullet: "This flip is intentionally manual; it is the sole human-gated promotion step in the fix policy."

code-quality/recipe.md:260 — Bare-except eligibility leans on agent judgment of "exactly one plausible exception type"

  • What: Eligibility requires "grep across the try-block confirms exactly one exception type is plausibly raised, verified by inspecting the called functions or imported library docs." That's the most interpretive of all the eligibility rules — and you've already correctly gated it behind draft mode.
  • Suggestion: Consider tightening further: only mark eligible when the try-block calls a single function (not a chain), and that function's signature/docstring lists Raises: in Google style. That makes the rule mechanical (grep for Raises: in the called function's docstring) rather than inferential. Could narrow the eligible population, but every fix that lands becomes evidence the rule works.

agentic-ci-issue-triage.yml:117mapfile -t PARTS < <(...) overwrites the array we just collected

  • What: PARTS=(...) is built first via the glob, then mapfile -t PARTS < <(printf '%s\n' "${PARTS[@]}" | sort -V) rebuilds it sorted. This works but reads a little awkwardly because PARTS is being reassigned from itself through a subshell.
  • Suggestion: A direct sort would be clearer: IFS=$'\n' read -r -d '' -a PARTS < <(printf '%s\n' /tmp/issue-triage-report-*.md | sort -V && printf '\0'). Take it or leave it — current form is correct.

Minor — JSON formatting consistency

  • What: agentic-ci-daily.yml:257 writes runner-state.json with indent=2. The agent likely writes it compact (no instruction either way). Each handover flips the format, which produces noisy cache diffs.
  • Suggestion: Either standardize on indent=2 (add a one-line note in _fix-policy.md's "Runner-state schema" section) or leave compact. Pick one.

What Looks Good

  • _fix-policy.md as the shared contract. Pulling allowlists, ranking, the standard fix procedure, and the attempted_fixes lifecycle into one document is exactly the right factoring — each suite recipe shrinks to the parts that actually vary (eligible categories, branch type, test_required). It's a high-signal doc to read end-to-end; reviewers can verify the contract once.
  • Phase split via separate claude invocations. Two distinct invocations with _phase-audit.md / _phase-fix.md directives is structurally cleaner than a single agent that has to choose its own phase. It's easier to reason about what each invocation is allowed to do, easier to gate (the workflow's if: steps.audit.outcome == 'success' && ... && fix_backlog > 0 is dead simple), and matches the "no half-states" atomicity goal.
  • Identity-based partial-post detection in triage fallback. The evolution of this through the review (count → identity-based with test() guard before capture(), then iterating only MISSING[] instead of PARTS[]) is a small piece of bash but a meaningful one — it's exactly the kind of "duplicate post of one part shouldn't mask a missing other" failure mode that's hard to test for and easy to get wrong. Worth keeping the explanatory comment block at lines 120-126.
  • Two-strike escalation surfaced in weekly triage. Routing repeatedly-failed fix attempts into the human-facing weekly report (rather than just dropping them) is exactly the right escape valve for autonomous mode — the system gracefully tells maintainers "I gave up here, you decide."
  • test-health opting out of the fix phase. Acknowledging that hollow-test rewrites and missing-test authoring require inferring intent, and explicitly leaving them report-only with a documented future-candidate note for test-isolation, is good restraint.

Verdict

Needs changes — Five Warnings worth addressing before this runs autonomously:

  1. Add a workflow-level diff gate against the suite's path allowlist + LOC cap before PR open (_fix-policy.md enforcement).
  2. Reconsider cancel-in-progress: true on the matrix job, or extend reconciliation to cover orphaned branches.
  3. Reword the _phase-audit.md "do not modify outside {{memory_path}}/" line to allow /tmp/audit-{{suite}}.md (and the equivalent for fix phase).
  4. Always upload the agent log artifact, not only on failure.
  5. Either gate the dependencies suite on a post-fix make install-dev + import smoke check, or document explicitly that the per-package test target is the only verification.

The Suggestions are honestly nits — none of them block.


This review was generated by an AI assistant.

Reorganize the weekly issue-triage report around recommended actions
(close as resolved, close as duplicate, needs maintainer decision,
ready for assignment, stuck PR, duplicate PRs, stale) so each flagged
item carries action + evidence + rationale and can be resolved without
opening it. Multi-comment split with i/N markers and orphan
reconciliation when the report grows or shrinks.

Flip the four daily audit suites with mechanical fix categories from
read-only reports to opening one PR per run:

- docs-and-references: broken-link, docstring-drift, arch-ref-rename
- structure: missing-future, lazy-import
- dependencies: transitive-gap, unused
- code-quality: bare-except (draft until landing rate proven)

test-health stays report-only (all candidates require inferring intent).

The shared procedure - fix_backlog selection, finding-hash spec for
stable cross-run identification, attempted_fixes lifecycle with
two-strike escalation, allowlists, ranking, branch/PR conventions -
lives in .agents/recipes/_fix-policy.md. Each suite recipe declares
only its eligible categories, branch types, and test requirements.

Workflow runs claude twice per suite (audit, then conditionally fix),
each capped at the existing --max-turns 50. Fix call is gated on
non-empty fix_backlog and skipped entirely for test-health.
- Map per-package test targets explicitly in _fix-policy.md (Makefile
  exposes test-config/test-engine/test-interface, not test-<package>).
- Use github-actions[bot] noreply identity for commits the recipes
  produce.
- Refresh fix_backlog.data when an id already exists so the fix phase
  cannot drive a PR from stale data after the underlying file changed.
- Stop time-pruning closed/abandoned attempted_fixes entries — pruning
  before the two-strike threshold erases the history needed to
  escalate. Single-strike entries now age out only via the 200-entry
  cap.
- Disambiguate bare-except findings within the same function by
  including a try-body hash in the finding id.
- Audit grep for code-quality now matches both `except:` and
  `except BaseException:`, in parity with the fix eligibility.
- Restrict transitive-gap fix eligibility to cases where a sibling
  package already declares the dep (avoids inventing version
  specifiers from scratch).
- Issue-triage workflow handles multi-part reports in both the fallback
  post step and the job summary; recipe always writes numbered parts.
- Replace remaining `make test-<package>` references with pointers to
  the mapping table; only the table itself uses that placeholder now.
- Fix `gh api --paginate | jq | length` returning per-page counts: slurp
  with `jq -s 'add // 0'` to get a single total.
- Compare posted-comment count to expected part count so a partial post
  (agent posted part 1 but not 2/3) triggers the fallback instead of
  being silently treated as success.
- Add `shell: bash` to triage steps using `shopt`/`mapfile` so they're
  not at the mercy of the runner's default shell.
- Disambiguate bare-except findings whose try-body hashes collide by
  adding a per-function ordinal to the canonical_key.
- Tie the 200-entry attempted_fixes cap eviction to `attempts[0].at`
  (the schema has no `first_seen` field).
…back

Replace the count-only POSTED_COUNT >= EXPECTED_PARTS check with an
identity-based check that extracts every i/N marker seen in
today-dated bot comments and verifies each expected i is present.
A duplicate post of one part can no longer mask a missing other.
- Exempt two-strike attempted_fixes entries from the 200-entry cap
  eviction. Cap now evicts non-two-strike oldest-first by
  attempts[0].at; two-strike entries are silently-forgotten only in
  the pathological all-200-are-two-strike case (itself a signal).
- Specify the attempted_fixes PR-marker reconciliation algorithm:
  scan open PR bodies for the `<!-- agentic-ci finding=<id> -->`
  marker and back-fill missing entries.
- Tighten the daily workflow conditionals to gate on explicit step
  outcomes (steps.audit.outcome == 'success' rather than success())
  so a future pre-audit gate cannot accidentally trip the fix step.
…ording)

- Bump daily-suite job timeout from 20 to 40 minutes. The split into
  two sequential `claude --max-turns 50` invocations can saturate a
  20-minute budget; a mid-fix SIGTERM would leave an orphaned branch
  and inconsistent runner-state.
- Disambiguate the `_phase-fix.md` "do NOT re-scan" rule. It forbids
  rebuilding fix_backlog from scratch but does NOT override the
  per-candidate re-verification step required by _fix-policy.md
  step 4.1 (re-grep / re-read the specific file the candidate points
  at). Single-candidate re-verification is required; whole-codebase
  re-scanning is forbidden.
- Guard `jq capture()` with a `test()` select. `capture()` errors on
  non-match instead of returning empty, which would truncate
  SEEN_PARTS if any unrelated today-dated bot comment lacks the
  triage marker (e.g. from a sibling workflow). Adding the test()
  guard ensures capture() only runs on bodies that already match.
- Iterate the MISSING[] array when posting fallback parts, not the
  full PARTS[] array. Posting all parts when only some were missing
  was creating duplicate comments for the parts the agent already
  successfully posted.
Address the five Warnings from the 2026-05-07 review focused on the
trust boundary for autonomous PR generation. Five workflow/policy
adjustments shrink the surface where agent compliance is load-bearing:

- Workflow-level scope gate. After the fix step, re-derive the diff
  against `origin/main` and validate against the per-suite path
  allowlist (regex mirrored from `_fix-policy.md`), the 50-LOC cap, and
  the 3-file cap. On violation, close the PR with `--delete-branch`
  and flip the `attempted_fixes` entry from `open` to `abandoned` so
  two-strike logic still sees the failure. The recipe alone could not
  bind the agent's path choices; the workflow now does.
- Dependencies install-dev verification. For the dependencies suite
  only, re-run `make install-dev` after the scope gate so the agent's
  pyproject edit is exercised against the lockfile resolver. Closes
  the PR if `install-dev` fails — catches the failure mode where the
  per-package test target passed against the old cached lockfile.
- Flip matrix-job `cancel-in-progress` from true to false. A
  cancellation between the agent's git push and `gh pr create` would
  leave an orphaned branch with no `attempted_fixes` record;
  reconciliation only covers PRs that were opened. Queueing a
  duplicate run is the lesser evil. `_fix-policy.md` Atomicity
  section now documents the trade-off.
- Allow `/tmp/audit-{{suite}}.md` in `_phase-audit.md`'s "do not
  modify outside `{{memory_path}}/`" directive. A literal-minded
  agent could refuse to write the report file, which would break the
  job summary, artifact upload, and the fix phase's audit context.
- Always upload the agent log artifact (was `if: failure()` only) and
  include `runner-state.json`. For autonomous mode, the most
  interesting failure is "the workflow succeeded but the PR was
  wrong"; the stream-json log is the only way to look back days
  later.

Also takes johnnygreco's Suggestion 2: spell out in the policy doc
that the `draft_until_proven` flip is the sole human-gated
promotion step in the fix policy and must not be automated.

Greptile and the github-actions auto-reviewer's findings were
already closed in the prior pass-2/pass-3 commits; no action needed
on those.
@andreatgretel andreatgretel force-pushed the andreatgretel/feat/agentic-ci-improvements branch from 91e8749 to 23829fb Compare May 7, 2026 23:40
Codex flagged five issues in the prior commit's scope/lockfile gates.
This commit closes all five:

- HIGH: Wrong-PR targeting. Both gates selected the last globally-open
  attempted_fixes entry, which could match a stale orphan from a
  prior crashed run rather than the PR opened by *this* run. Adds a
  pre-fix snapshot step that captures `(id, attempts-length)` pairs
  before the fix runs, and changes the post-fix selectors to require
  that the entry's attempts count grew during this run.
- HIGH: Docstring-only enforcement gap on the docs-and-references
  suite. The .py path allowlist was at workflow level but the
  docstring-only caveat was still policy-only. Adds an AST-based
  check: for each .py file changed, parse the post-change tree,
  collect docstring line ranges (module/class/function), then verify
  every added line in the diff is either inside a docstring, a
  comment, or whitespace. Verified locally with both pass and fail
  fixtures.
- MEDIUM: Diff-ref mismatch. Gates diffed `origin/main...HEAD` rather
  than `origin/main...origin/$BRANCH`, so a misbehaving agent that
  left HEAD pointing elsewhere would have validated the wrong tree.
  Now fetches `origin/$BRANCH` first and prefers that ref. Falls
  back to HEAD only if fetch fails (with a warning).
- MEDIUM: FILE_COUNT bug. `grep -c '.' || echo 0` produced "0\n0" on
  empty diff, breaking the downstream integer comparison. Replaces
  with `mapfile -t FILE_ARR` + `${#FILE_ARR[@]}`, which is correct
  for any input including empty.
- LOW: Non-atomic JSON writes. The runner-state mutations could leave
  the file half-written if the workflow was cancelled mid-write.
  Switches both gates to the temp-file + os.replace pattern.

Also: dependencies-lockfile gate now does an explicit
`git checkout --detach origin/$BRANCH` before re-running install-dev,
so verification runs against what was actually pushed rather than
relying on local working-tree state.
Comment thread .github/workflows/agentic-ci-daily.yml Outdated
Greptile review on 872d561 flagged that the fix step's custom `if:`
expression bypasses GitHub Actions' implicit success() check. Without
explicitly referencing steps.snapshot.outcome, a snapshot failure
(corrupt runner-state, disk error) would let the fix step run anyway.
The scope gate's `jq --slurpfile prior /tmp/prior-attempted-fixes.json`
would then exit non-zero on the missing file, leave OPEN empty, and
hit the "nothing to validate" early-exit — silently approving whatever
the agent pushed.

Adds steps.snapshot.outcome == 'success' to both the fix step's
condition (the actual fix) and the scope_gate step's condition
(belt-and-suspenders against future refactors).
Signed-off-by: Andre Manoel <amanoel@nvidia.com>
Comment thread .github/workflows/agentic-ci-daily.yml Outdated
@johnnygreco

Copy link
Copy Markdown
Contributor

Review findings from a pass over the final PR state:

  1. High: Post-fix gates are skipped when the fix step fails after opening a PR.

    .github/workflows/agentic-ci-daily.yml gates scope validation on steps.fix.outcome == 'success' and does the same for lockfile validation. If claude creates a PR and records an open attempt, then exits nonzero later, the PR is never allowlist/LOC/lockfile checked. Future runs will also skip it because the attempt is already present in the snapshot, so it no longer “grew.” Run these gates with always() whenever the snapshot exists and inspect grown/open attempts regardless of the fix step outcome.

  2. Medium: attempted_fixes never refreshes open PRs to merged or closed.

    _fix-policy.md only reconciles against currently open PRs, then filters out latest open/merged attempts. Once a maintainer merges or closes an agentic PR, state remains open forever, so the finding is permanently excluded and two-strike escalation cannot work. The reconciliation step should also query existing pr_numbers and update open attempts to merged or closed before filtering.

  3. Medium: Multi-part triage fallback can reuse stale /tmp report parts on the self-hosted runner.

    .github/workflows/agentic-ci-issue-triage.yml globs /tmp/issue-triage-report-*.md without clearing old files. If a previous run left report-2.md and today’s agent writes only report-1.md, fallback and the job summary will treat the stale second part as current and may post old triage data. Clean the report/log files before invoking the agent, or write under a run-specific $RUNNER_TEMP directory.

  4. Medium: The fallback comments lookup can abort the fallback path.

    .github/workflows/agentic-ci-issue-triage.yml assigns SEEN_PARTS from a gh api | sort pipeline. With the workflow’s shell: bash, GitHub runs bash with -e -o pipefail; if the comments API call fails, the whole fallback step exits before posting the generated report. The old code tolerated lookup failure. Wrap that lookup so it falls back to an empty seen set.

Validation I ran: PR diff/file inspection, existing review-comment scan, git diff --check, YAML parsing for both changed workflows, and PR status check review. CI is green, though the PR is currently behind main.

@andreatgretel

Copy link
Copy Markdown
Contributor Author

Thanks, this was a good catch. Addressed in 61bc60e.

Changes:

  • post-fix scope/lockfile gates now run from the pre-fix snapshot even if the fix step exits nonzero after opening a PR
  • dependency lockfile verification now fails closed on branch fetch/checkout failure instead of validating the previous working tree
  • attempted_fixes reconciliation now refreshes recorded open PRs to merged/closed before open-PR crash recovery
  • issue-triage clears stale /tmp report files before each run
  • fallback comment lookup now degrades to an empty seen set if the API lookup fails

Validation:

  • workflow YAML parsed
  • embedded workflow bash checked with bash -n
  • git diff --check passed
  • .venv/bin/ruff check --fix . passed
  • GitHub Actions checks are passing

@andreatgretel andreatgretel merged commit 1d203b1 into main May 12, 2026
49 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/feat/agentic-ci-improvements branch May 14, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants