You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today's 24-hour audit covered 45 agentic workflow runs across 40 distinct workflows. Overall safe output health is good — 25 safe output job executions with only 1 failure (96% success rate). The single failure was a transient GitHub API rate limit during a concurrent burst window, consistent with a pattern first observed on 2026-04-02. Three additional runs had safe outputs skipped due to upstream agent failures (bad credentials and invalid Gemini API key), both recurring known issues.
Period: 2026-04-07 (last 24 hours) Runs Analyzed: 45 (with full run data) Safe Output Jobs Executed: 25 Safe Output Jobs Failed: 1 Hard Failures: 1 (API rate limit) Error Clusters: 2 (recurring patterns)
Safe Output Job Statistics
Job Type
Executions
Failures
Notes
create_discussion
12
0
100% ✅
create_issue
10
1
90% ⚠️
add_comment
8
0
100% ✅
noop
7
0
100% ✅
create_pull_request
5
0
100% ✅
missing_data
3
0
100% ✅
assign_to_agent
3
0
100% ✅
create_pull_request_review_comment
2
0
100% ✅
update_issue
2
0
100% ✅
push_to_pull_request_branch
1
0
100% ✅
dispatch_workflow
1
0
100% ✅
submit_pull_request_review
1
0
100% ✅
set_issue_type
1
0
100% ✅
create_code_scanning_alert
1
0
100% ✅
update_pull_request
1
0
100% ✅
add_reviewer
1
0
100% ✅
missing_tool
1
0
100% ✅
report_incomplete
1
0
100% ✅
Custom (post_slack_message, send_slack_message)
2
0
Logged/delegated ✅
Total safe output items submitted: 75 across 36 runs with non-empty agent output.
Error Clusters
Cluster 1: API Rate Limit on Concurrent Burst (1 occurrence)
These are agent-level failures, not safe output failures. Safe output jobs were never scheduled because agent jobs failed to produce output. Included for completeness.
API key not valid (INVALID_ARGUMENT), exit code 144
The bad credentials issue for the side-repo checkout is a recurring pattern — it has appeared on multiple consecutive days. The Gemini API key invalidity is also recurring.
Root Cause Analysis
API Rate Limit Issue
The GitHub App installation rate limit is triggered when multiple concurrent safe output jobs attempt GitHub API calls within the same time window. Today's burst window was 12:06–12:08 UTC. The create_issue handler does not implement retry-with-backoff for rate limit responses (HTTP 429 / 403 with rate-limit body), so the first affected call fails immediately.
Cross-Repo Credential Issue (Agent-Level, Not Safe Output)
The githubnext/gh-aw-side-repo checkout step in Smoke Create/Update Cross-Repo PR workflows uses a PAT that appears expired or revoked. This prevents the agent from running entirely, so no safe output items are ever produced. Detection and safe_outputs jobs are consequently skipped (conditional: needs.detection.result == 'success' is never met).
Gemini API Key (Agent-Level, Not Safe Output)
The Gemini API key configured for Smoke Gemini is invalid (API_KEY_INVALID from generativelanguage.googleapis.com). The Gemini CLI exits with code 144 after the first API call attempt.
Recurring Missing Data — Auto-Triage Issues (Non-Error)
Two Auto-Triage Issues runs (§24082713086, §24082753334) emitted missing_data for issue #25092. The reason: the DIFC integrity filter blocks reading the issue content (the issue's integrity level is below the workflow's required threshold). This is expected behavior — the safe output handler correctly recorded the missing data signal. Not a failure.
Recommendations
Critical Issues (Immediate Action Required)
Rotate PAT for githubnext/gh-aw-side-repo
Priority: High
Root Cause: Expired/revoked credentials for cross-repo smoke test workflows
Action: Renew or replace the Personal Access Token used to check out githubnext/gh-aw-side-repo in the Smoke Create/Update Cross-Repo PR workflows
Root Cause: GEMINI_API_KEY secret is invalid (API_KEY_INVALID)
Action: Rotate the Gemini API key in repository secrets
Affected: Smoke Gemini workflow
Bug Fixes Required
Add Retry Logic for Rate-Limited Safe Output API Calls
Priority: Medium
Problem: safe_output_handler_manager.cjs fails immediately on HTTP 429/403-rate-limit responses with no retry
Fix: Implement exponential backoff (e.g., 3 retries: 5s → 15s → 45s) for GitHub API calls that return rate limit errors
Expected: Transient concurrent-burst rate limits would self-heal without user intervention
Affected: All create_issue, create_discussion, add_comment handlers
Configuration Changes
Stagger Concurrent Workflow Schedules
Priority: Low
Current: Multiple daily workflows (Delight, Multi-Device Docs Tester, Workflow Health Manager, GitHub MCP Structural Analysis) are scheduled with the same or adjacent cron times (~12:06–12:08 UTC)
Recommended: Spread schedules by 5–10 minutes to reduce concurrent safe output API bursts
Benefits: Reduces rate limit pressure without code changes
Work Item Plans
Work Item 1: Add Retry-With-Backoff to Safe Output API Calls
Type: Bug Fix
Priority: Medium
Description: The safe output handler manager makes GitHub API calls without retry logic. When the GitHub App installation rate limit is exceeded during concurrent workflow bursts, the affected safe output call fails permanently. This is a recurring issue (2026-04-02, 2026-04-07).
Acceptance Criteria:
create_issue, create_discussion, add_comment handlers retry on HTTP 429 and rate-limit HTTP 403 responses
Technical Approach: Wrap @octokit/rest API calls in a retry helper that checks error.status === 429 || (error.status === 403 && /rate limit/i.test(error.message)) and sleeps before retrying
Estimated Effort: Small
Dependencies: None
Work Item 2: Investigate and Rotate Cross-Repo Smoke Test Credentials
Type: Operations / Investigation
Priority: High
Description: Smoke Create/Update Cross-Repo PR workflows fail daily at the Checkout githubnext/gh-aw-side-repo step with "Bad credentials". This blocks smoke test coverage for cross-repository PR operations.
Acceptance Criteria:
Root cause of credential failure identified (PAT expiry, revocation, or scope change)
Valid credentials restored in repository secrets
Smoke Create Cross-Repo PR and Smoke Update Cross-Repo PR pass their next scheduled runs
Technical Approach: Check the PAT referenced by the side-repo checkout step; verify expiry date; generate new PAT with repo scope; update the GITHUBNEXT_TOKEN (or equivalent) secret
Estimated Effort: Small
Dependencies: Requires access to the githubnext GitHub account or the secret-owner
Historical Context
7-Day Trend
Date
Runs
Safe Output Failures
Hard Failures
Error Clusters
2026-03-29
—
—
—
1 (invalid_event_context)
2026-03-30
—
—
—
2
2026-03-31
—
—
—
1 (push_to_PR_branch missing remote)
2026-04-01
—
—
—
1 (push_to_PR_branch disallowed files)
2026-04-02
—
7
7
2 (rate limit burst, dispatch_workflow)
2026-04-05
—
—
—
0
2026-04-06
20
0
0
0 ✅
2026-04-07
45
1
1
2
Trends:
Rate limit burst pattern is recurring (2026-04-02, 2026-04-07) — no retry logic has been added yet
Cross-repo bad credentials have been failing for multiple consecutive days without resolution
push_to_pull_request_branch disallowed files issue resolved after 2026-04-02 (Smoke Claude now uses the allowed filename)
Today's run count (45) is significantly higher than yesterday (20), reflecting broader workflow coverage in the download period
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
Today's 24-hour audit covered 45 agentic workflow runs across 40 distinct workflows. Overall safe output health is good — 25 safe output job executions with only 1 failure (96% success rate). The single failure was a transient GitHub API rate limit during a concurrent burst window, consistent with a pattern first observed on 2026-04-02. Three additional runs had safe outputs skipped due to upstream agent failures (bad credentials and invalid Gemini API key), both recurring known issues.
Safe Output Job Statistics
create_discussioncreate_issueadd_commentnoopcreate_pull_requestmissing_dataassign_to_agentcreate_pull_request_review_commentupdate_issuepush_to_pull_request_branchdispatch_workflowsubmit_pull_request_reviewset_issue_typecreate_code_scanning_alertupdate_pull_requestadd_reviewermissing_toolreport_incompletepost_slack_message,send_slack_message)Total safe output items submitted: 75 across 36 runs with non-empty agent output.
Error Clusters
Cluster 1: API Rate Limit on Concurrent Burst (1 occurrence)
create_issueCluster 2: Safe Outputs Skipped — Agent Failures (3 runs, no safe outputs lost)
These are agent-level failures, not safe output failures. Safe output jobs were never scheduled because agent jobs failed to produce output. Included for completeness.
##[error]Bad credentialschecking outgithubnext/gh-aw-side-repo##[error]Bad credentialschecking outgithubnext/gh-aw-side-repoAPI key not valid(INVALID_ARGUMENT), exit code 144The bad credentials issue for the side-repo checkout is a recurring pattern — it has appeared on multiple consecutive days. The Gemini API key invalidity is also recurring.
Root Cause Analysis
API Rate Limit Issue
The GitHub App installation rate limit is triggered when multiple concurrent safe output jobs attempt GitHub API calls within the same time window. Today's burst window was 12:06–12:08 UTC. The
create_issuehandler does not implement retry-with-backoff for rate limit responses (HTTP 429 / 403 with rate-limit body), so the first affected call fails immediately.Cross-Repo Credential Issue (Agent-Level, Not Safe Output)
The
githubnext/gh-aw-side-repocheckout step in Smoke Create/Update Cross-Repo PR workflows uses a PAT that appears expired or revoked. This prevents the agent from running entirely, so no safe output items are ever produced. Detection and safe_outputs jobs are consequently skipped (conditional:needs.detection.result == 'success'is never met).Gemini API Key (Agent-Level, Not Safe Output)
The Gemini API key configured for Smoke Gemini is invalid (
API_KEY_INVALIDfromgenerativelanguage.googleapis.com). The Gemini CLI exits with code 144 after the first API call attempt.Recurring Missing Data — Auto-Triage Issues (Non-Error)
Two Auto-Triage Issues runs (§24082713086, §24082753334) emitted
missing_datafor issue #25092. The reason: the DIFC integrity filter blocks reading the issue content (the issue's integrity level is below the workflow's required threshold). This is expected behavior — the safe output handler correctly recorded the missing data signal. Not a failure.Recommendations
Critical Issues (Immediate Action Required)
Rotate PAT for
githubnext/gh-aw-side-repogithubnext/gh-aw-side-repoin the Smoke Create/Update Cross-Repo PR workflowsRenew Gemini API Key
GEMINI_API_KEYsecret is invalid (API_KEY_INVALID)Bug Fixes Required
safe_output_handler_manager.cjsfails immediately on HTTP 429/403-rate-limit responses with no retrycreate_issue,create_discussion,add_commenthandlersConfiguration Changes
Work Item Plans
Work Item 1: Add Retry-With-Backoff to Safe Output API Calls
create_issue,create_discussion,add_commenthandlers retry on HTTP 429 and rate-limit HTTP 403 responses@octokit/restAPI calls in a retry helper that checkserror.status === 429 || (error.status === 403 && /rate limit/i.test(error.message))and sleeps before retryingWork Item 2: Investigate and Rotate Cross-Repo Smoke Test Credentials
Checkout githubnext/gh-aw-side-repostep with "Bad credentials". This blocks smoke test coverage for cross-repository PR operations.reposcope; update theGITHUBNEXT_TOKEN(or equivalent) secretgithubnextGitHub account or the secret-ownerHistorical Context
7-Day Trend
Trends:
push_to_pull_request_branchdisallowed files issue resolved after 2026-04-02 (Smoke Claude now uses the allowed filename)Metrics and KPIs
create_discussion,add_comment,noop,push_to_pull_request_branch— all 100%create_issue— 1 failure due to rate limit (90% today)Next Steps
githubnext/gh-aw-side-repoPAT in repository secretsupdate-discussionwarning #25092 — DIFC integrity filter is expected to clear once the issue passes integrity checksReferences:
Beta Was this translation helpful? Give feedback.
All reactions