[plan] Implement CI Failure Doctor workflow

## Objective

Create an automated workflow that investigates CI/CD failures, analyzes logs, identifies root causes, and creates detailed investigation reports with remediation steps.

## Context

This repository has complex Docker/networking tests that frequently fail with opaque errors like "subnet pool overlap" and container cleanup race conditions. Manual log analysis wastes developer time and accumulated failure patterns aren't leveraged.

## Approach

1. **Create workflow file**: `.github/workflows/ci-doctor.md`
2. **Configure triggers**: 
   - `workflow_run` on completion of: `test-integration`, `test-coverage`, `test-action`
   - Only trigger when `conclusion == 'failure'`
3. **Implement investigation protocol**:
   - Fetch workflow run details and job logs via GitHub API
   - Analyze for Docker network issues (subnet pool exhaustion, overlaps)
   - Check for container cleanup race conditions
   - Detect iptables rule conflicts
   - Identify Squid proxy startup failures
   - Search for similar past failures using cache-memory
4. **Create investigation report**:
   - Detailed root cause analysis
   - Specific remediation steps
   - Link to similar past failures
   - Label with `bug`, `ci`, `needs-investigation`
5. **Store failure patterns**: Update cache-memory with new patterns

## Files to Create

- `.github/workflows/ci-doctor.md` - Main workflow
- Reference: [agentics/ci-doctor.md](https://github.com/githubnext/agentics/blob/main/workflows/ci-doctor.md) template

## Domain-Specific Focus Areas

- Docker network pool exhaustion (172.30.0.0/24 subnet conflicts)
- Container cleanup race conditions (`timeout` kills leave orphaned resources)
- iptables rule conflicts (NET_ADMIN capability issues)
- Squid proxy healthcheck failures
- GitHub Actions runner Docker version incompatibilities

## Acceptance Criteria

- [ ] Workflow triggers automatically on CI failures for specified test workflows
- [ ] Creates detailed investigation issues with root cause analysis
- [ ] Identifies Docker/networking specific failure patterns
- [ ] Searches cache-memory for similar historical failures
- [ ] Provides actionable remediation steps
- [ ] Completes within 10 minutes timeout

## Success Metrics

- Reduce failure diagnosis time from manual (hours) to <30 minutes
- Build knowledge base of 10+ common failure patterns within first month
Related to #332




> AI generated by [Plan Command](https://github.com/githubnext/gh-aw-firewall/actions/runs/21100875337) for discussion #328

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan] Implement CI Failure Doctor workflow #333

Objective

Context

Approach

Files to Create

Domain-Specific Focus Areas

Acceptance Criteria

Success Metrics

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[plan] Implement CI Failure Doctor workflow #333

Description

Objective

Context

Approach

Files to Create

Domain-Specific Focus Areas

Acceptance Criteria

Success Metrics

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions