Skip to content

[plan] Implement CI Failure Doctor workflow #333

@github-actions

Description

@github-actions

Objective

Create an automated workflow that investigates CI/CD failures, analyzes logs, identifies root causes, and creates detailed investigation reports with remediation steps.

Context

This repository has complex Docker/networking tests that frequently fail with opaque errors like "subnet pool overlap" and container cleanup race conditions. Manual log analysis wastes developer time and accumulated failure patterns aren't leveraged.

Approach

  1. Create workflow file: .github/workflows/ci-doctor.md
  2. Configure triggers:
    • workflow_run on completion of: test-integration, test-coverage, test-action
    • Only trigger when conclusion == 'failure'
  3. Implement investigation protocol:
    • Fetch workflow run details and job logs via GitHub API
    • Analyze for Docker network issues (subnet pool exhaustion, overlaps)
    • Check for container cleanup race conditions
    • Detect iptables rule conflicts
    • Identify Squid proxy startup failures
    • Search for similar past failures using cache-memory
  4. Create investigation report:
    • Detailed root cause analysis
    • Specific remediation steps
    • Link to similar past failures
    • Label with bug, ci, needs-investigation
  5. Store failure patterns: Update cache-memory with new patterns

Files to Create

Domain-Specific Focus Areas

  • Docker network pool exhaustion (172.30.0.0/24 subnet conflicts)
  • Container cleanup race conditions (timeout kills leave orphaned resources)
  • iptables rule conflicts (NET_ADMIN capability issues)
  • Squid proxy healthcheck failures
  • GitHub Actions runner Docker version incompatibilities

Acceptance Criteria

  • Workflow triggers automatically on CI failures for specified test workflows
  • Creates detailed investigation issues with root cause analysis
  • Identifies Docker/networking specific failure patterns
  • Searches cache-memory for similar historical failures
  • Provides actionable remediation steps
  • Completes within 10 minutes timeout

Success Metrics

AI generated by Plan Command for discussion #328

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions