-
Notifications
You must be signed in to change notification settings - Fork 18
[plan] Implement CI Failure Doctor workflow #333
Copy link
Copy link
Closed
Labels
Description
Objective
Create an automated workflow that investigates CI/CD failures, analyzes logs, identifies root causes, and creates detailed investigation reports with remediation steps.
Context
This repository has complex Docker/networking tests that frequently fail with opaque errors like "subnet pool overlap" and container cleanup race conditions. Manual log analysis wastes developer time and accumulated failure patterns aren't leveraged.
Approach
- Create workflow file:
.github/workflows/ci-doctor.md - Configure triggers:
workflow_runon completion of:test-integration,test-coverage,test-action- Only trigger when
conclusion == 'failure'
- Implement investigation protocol:
- Fetch workflow run details and job logs via GitHub API
- Analyze for Docker network issues (subnet pool exhaustion, overlaps)
- Check for container cleanup race conditions
- Detect iptables rule conflicts
- Identify Squid proxy startup failures
- Search for similar past failures using cache-memory
- Create investigation report:
- Detailed root cause analysis
- Specific remediation steps
- Link to similar past failures
- Label with
bug,ci,needs-investigation
- Store failure patterns: Update cache-memory with new patterns
Files to Create
.github/workflows/ci-doctor.md- Main workflow- Reference: agentics/ci-doctor.md template
Domain-Specific Focus Areas
- Docker network pool exhaustion (172.30.0.0/24 subnet conflicts)
- Container cleanup race conditions (
timeoutkills leave orphaned resources) - iptables rule conflicts (NET_ADMIN capability issues)
- Squid proxy healthcheck failures
- GitHub Actions runner Docker version incompatibilities
Acceptance Criteria
- Workflow triggers automatically on CI failures for specified test workflows
- Creates detailed investigation issues with root cause analysis
- Identifies Docker/networking specific failure patterns
- Searches cache-memory for similar historical failures
- Provides actionable remediation steps
- Completes within 10 minutes timeout
Success Metrics
- Reduce failure diagnosis time from manual (hours) to <30 minutes
- Build knowledge base of 10+ common failure patterns within first month
Related to [plan] Enhance agentic workflow maturity to Level 4 (Optimized) #332
AI generated by Plan Command for discussion #328
Reactions are currently unavailable
Metadata
Metadata
Labels
Type
Fields
Give feedbackNo fields configured for issues without a type.