[WIP] Address feedback on model routing policy in ADR-017 and ADR-018 by Copilot · Pull Request #385 · rjmurillo/ai-agents

Copilot · 2025-12-24T21:08:01Z

Completed: ADR Renumbering + Protocol Compliance

Successfully addressed feedback to resolve ADR numbering conflicts, added required session logs with full QA, and documented protocol discoveries.

Files Renamed

ADR-017-model-routing-strategy.md → ADR-019-model-routing-strategy.md
ADR-017-debate-log.md → ADR-019-debate-log.md → .agents/critique/ADR-019-debate-log.md
ADR-018-architecture-governance-split-criteria.md → ADR-020-architecture-governance-split-criteria.md
.serena/memories/adr-017-quantitative-analysis.md → adr-019-quantitative-analysis.md
.serena/memories/adr-017-split-execution.md → adr-019-split-execution.md

References Updated

✅ All internal ADR cross-references (ADR-017→019, ADR-018→020)
✅ Governance policy references in AI-REVIEW-MODEL-POLICY.md
✅ Debate log moved to .agents/critique/ directory
✅ Updated all references to debate log path (architecture → critique)
✅ Serena memory files with new ADR numbers
✅ Fixed markdown linting issues (MD041, MD001, MD036)
✅ Verified no broken links

Session Logs & Documentation

✅ Session 92: ADR renumbering work
✅ Session 93: Skillbook learning about pre-commit protocol (with QA)
✅ Branch handoff: .agents/handoffs/copilot/sub-pr-310/session-92-93.md
✅ GitHub issue documented: HANDOFF.md validator protocol conflict (saved to /tmp/github-issue-handoff-validator.md)

New Skills Generated

✅ skill-git-002-fix-hook-errors-never-bypass: Fix errors flagged by pre-commit hooks; NEVER use --no-verify (Impact: 10/10 Critical)
✅ skill-workflow-012-branch-handoffs: Use branch handoffs on feature branches instead of HANDOFF.md (Impact: 9/10)
✅ QA Report: .agents/qa/session-93-skill-git-002-qa-report.md (7/7 tests passed)

Protocol Compliance

✅ All pre-commit hook requirements satisfied
✅ Session End checklists match canonical format (9 rows)
✅ All MUST requirements completed with evidence
✅ QA validation performed (100% test coverage)
✅ Branch handoff created (feature branch protocol per ADR-014)
✅ Markdown linting passed (0 errors)
✅ Memory index validation passed
✅ Skill format validation passed

Protocol Discovery

Identified and documented HANDOFF.md validator conflict where three sources contradict:

HANDOFF.md says "DO NOT update"
Validator requires session log reference
Pre-commit hook blocks HANDOFF.md on feature branches

Workaround Applied: Created branch handoff instead per ADR-014
Issue Documented: /tmp/github-issue-handoff-validator.md (for manual creation)

The renumbering resolves the ambiguity where multiple ADRs shared the same numbers (ADR-017 and ADR-018), making shorthand references unambiguous going forward.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…conflicts

Co-authored-by: rjmurillo <6811113+rjmurillo@users.noreply.github.com>

* docs(adr): add model routing policy to minimize false PASS Signed-off-by: Richard Murillo <6811113+rjmurillo@users.noreply.github.com> * docs: add session 85 - PR #310 review and description update Session 85 reviewed ADR-017 model routing policy and updated PR #310 description using the PR template. Key actions: - Analyzed ADR-017 content and rationale - Created comprehensive PR description with proper template sections - Documented decision context and consequences Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs(adr): Session 86 - ADR-017 critic review (model routing policy) Critic review of ADR-017 (Copilot model routing policy). ## Summary ADR-017 proposes evidence-aware, tiered model routing to minimize false PASS verdicts. Core decision is sound; execution requires additional specifics before deployment. **Position**: Disagree-and-Commit with conditions - Approve strategic direction (evidence-based routing, conservative verdicts) - Defer tactical implementation to Phase 2 (baseline metrics, concrete examples, validation) - Three P1 concerns resolve before deployment (metrics, examples, model confirmation) - Estimated Phase 2 effort: 4-7 hours across metrics, examples, and CI guardrails ## Key Findings **Strengths** (5): 1. Clear problem identification (summary-mode false PASS) 2. Conservative evidence-sufficiency principle is sound 3. Well-reasoned model matrix by prompt shape 4. Honest tradeoffs acknowledged 5. Governance safeguard (copilot-model parameter required) **Gaps** (7): 1. Model claims lack validation (no vendor benchmarks) 2. Implementation incomplete (CONTEXT_MODE header not shown) 3. Success metrics aspirational, not measurable 4. Evidence improvement marked optional vs. required 5. No cost impact quantification 6. Prompt enforcement vague 7. No model deprecation policy **Recommendations** (7): 1. Add baseline metrics and thresholds 2. Concrete examples (before/after workflows) 3. Clarify evidence improvement scope 4. Model validation plan with monitoring 5. Quantify cost impact 6. CI validation script for prompt rules 7. Model deprecation policy and fallbacks ## Phase 2 Implementation Plan 1. Merge ADR-017 as strategic decision 2. Add copilot-model parameter to composite action 3. Create follow-up task: Implementation Specifics (examples, metrics, CI) 4. Do NOT deploy workflow changes until Phase 2 complete Session: .agents/sessions/2025-12-23-session-86-adr-017-critic-review.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * docs(adr): refine ADR-017 through multi-agent debate Conducted rigorous 2-round debate with 5 specialized agents (architect, critic, independent-thinker, security, analyst). Key changes from debate: - Add Scope Clarification separating from Issue #164 - Add Section 4: Security Hardening (prompt injection, CONTEXT_MODE) - Add Section 5: Escalation Criteria with operational table - Add Section 6: Risk Review Contract for summary-mode PRs - Promote Section 7: Aggregator Policy to required - Add Prerequisites section with P0 blocking gates - Update success metrics with baseline column and targets Final positions: 4 Accept + 1 Disagree-and-Commit Independent-thinker dissent documented in debate log. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update session 85 with multi-agent debate results Added comprehensive summary of ADR-017 multi-agent debate: - 2 rounds to consensus (4 Accept + 1 Disagree-and-Commit) - 8 major ADR enhancements including security hardening - Independent-thinker dissent documented - Prerequisites section added (3 P0 + 1 P1 blocking gates) Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs(adr): complete ADR-017 prerequisites and change status to Accepted Execute all prerequisites for ADR-017 (Model Routing Policy): P0-1: Baseline False PASS Measurement [COMPLETE] - Audited last 20 merged PRs with AI reviews - Found 3/20 (15%) required post-merge fixes - Identified PRs #226, #268, #249 as false PASS cases - Target: reduce to 7.5% within 30 days P0-2: Model Availability Verification [COMPLETE] - Verified all 6 models available in Copilot CLI - Confirmed claude-opus-4.5 via workflow run 20475138392 - Documented fallback chains per ADR specification P0-3: Governance Guardrail Status [DOCUMENTED] - Audited 4 ai-*.yml workflows - Found only 1/4 specifies copilot-model explicitly - Implementation plan documented in ADR P1-4: Cost Impact Analysis [COMPLETE] - Analyzed 74 PRs merged in December 2025 - Projected 20-30% cost REDUCTION with routing policy - Current: 100% opus; Projected: 35% opus, 50% sonnet, 15% mini ADR Status: Proposed -> Accepted (2025-12-23) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs: update session 85 with prerequisites execution results Session 85 extended to document ADR-017 prerequisites completion: - Baseline false PASS rate: 15% (3/20 PRs) - All 6 models verified available - Cost impact: 20-30% REDUCTION (not increase) - ADR status: Proposed -> Accepted Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs(adr): ADR-017 Round 3 post-prerequisites debate - clarify scope and strengthen security Session 90: Conducted multi-agent debate on ADR-017 after prerequisite completion. Achieved consensus (5 Accept + 1 Disagree-and-Commit) with critical scope clarification. ## Critical Finding The 3 baseline false PASS cases (PRs #226, #268, #249) were caused by prompt quality and validation gaps, NOT by evidence insufficiency or model mismatch. ADR solution doesn't address current 15% baseline—it targets FUTURE risk from large PRs with summary-mode context. ## P0 Changes Applied (8 blocking issues) 1. **Root Cause Analysis**: Explicitly states ADR doesn't fix current baseline cases; targets future evidence insufficiency risks. Separates metrics: - Baseline false PASS (all causes): 15% - Target false PASS (evidence insufficiency): TBD (new metric) 2. **Baseline Methodology**: Clarified all 20 PRs validated (17 confirmed no fixes, 3 had post-merge fixes). 7-day window is lower bound. 3. **Status Timeline**: Added chronology showing prerequisites completed BEFORE status change to Accepted (2025-12-23). 4. **Prompt Injection**: Changed from blacklist (bypassable) to whitelist/schema validation. Reject input not conforming to alphanumeric + common punctuation. 5. **CONTEXT_MODE Validation**: Added token count check to prevent manipulation. Workflow fails if claimed mode doesn't match actual context size. 6. **Circuit Breaker**: Prevents fallback DoS attack. If 5 consecutive blocks due to "forbid PASS" rule, escalate to manual approval with oncall alert. 7. **Aggregator Enforcement**: Added branch protection requirement for "AI Review Aggregator" status check. Prevents developer bypass. 8. **Cost Calculation**: Explicit math showing 36% reduction (568 → 366 Opus-eq units). Reconciles 20% escalation rate with routing savings. ## P1 Changes Applied (2 important issues) 1. **Success Metrics**: Updated baseline from "TBD (prerequisite)" to "15% (P0-1 complete)" 2. **Partial Diff N**: Defined N=500 lines (aligns with spec-file behavior) ## Debate Results - **Rounds**: 3 total (2 initial in Session 86-88, 1 post-prerequisites in Session 90) - **Consensus**: 5 Accept (architect, critic, security, analyst, high-level-advisor) + 1 Disagree-and-Commit (independent-thinker) - **Independent-thinker dissent**: Skeptical evidence insufficiency is primary lever, but ADR now intellectually honest about scope. Supports execution for validation. ## Files Modified - `.agents/architecture/ADR-017-model-routing-low-false-pass.md`: 10 sections updated - `.agents/architecture/ADR-017-debate-log.md`: Round 3 entry added, metadata updated - `.agents/sessions/2025-12-23-session-90-adr-debate-clarification.md`: Session log ## Files Added (Sessions 86-88 artifacts) - `.agents/sessions/2025-12-23-session-86-adr-017-architect-review.md` - `.agents/sessions/2025-12-23-session-86-adr-017-independent-thinker-review.md` - `.agents/sessions/2025-12-23-session-86-adr-017-security-review.md` - `.agents/sessions/2025-12-23-session-87-adr-017-analyst-review.md` - `.agents/sessions/2025-12-23-session-87-architect-adr-017-convergence.md` - `.agents/sessions/2025-12-23-session-88-independent-thinker-adr-017-convergence.md` ADR remains in Accepted status with clarified preventive scope. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs(adr): create ADR-018 establishing architecture vs governance split criteria Session 90 follow-up: User questioned whether ADR-017 strictly adheres to foundational ADR definition. Analysis revealed "single AD" criterion violation (bundles 7 related decisions) and surfaced "Any Decision Record" debate. ## Problem Ambiguity exists about when to use: - `.agents/architecture/` (ADRs) - `.agents/governance/` (operational policies) - Both (split pattern like ADR-014 + COST-GOVERNANCE) ## Decision (ADR-018) Establish explicit split criteria with three patterns: ### 1. ADR-only - Affects system structure/quality attributes - Primarily technical decision - No ongoing enforcement required - Example: API authentication strategy ### 2. Governance-only - Operational policy/standard/process - Does NOT affect architecture - Requires compliance enforcement - Example: naming-conventions.md ### 3. Split (ADR + Governance) - BOTH architectural significance AND enforcement requirements - Decision affects structure BUT requires ongoing compliance - Policy evolves independently from architectural decision - Example: ADR-014 (runner selection) + COST-GOVERNANCE (enforcement) ## Key Provisions - **Decision matrix**: Classify by architectural impact + enforcement needs - **Decision workflow**: Flowchart with 3 decision points - **Real examples**: ADR-014 split (exemplar), ADR-017 (candidate for split) - **Templates**: ADR and Governance policy templates in Appendix C - **When to split**: Trigger criteria for retroactive splits ## Resolution of "Any Decision Record" Debate **MADR movement**: Broadens ADRs to "Any" decision (design, process, governance) **Critics**: Dilutes architectural focus, recommend separate records **Our approach**: Hybrid - Adopt "Any Decision Record" concept via governance/ directory - Preserve architectural focus in architecture/ directory - Use split pattern when both aspects exist ## Impact - Resolves placement ambiguity for future decisions - Recommends ADR-017 split into architecture + governance - Establishes precedent for meta-ADRs (ADRs about ADR process) ## Files - `.agents/architecture/ADR-018-architecture-governance-split-criteria.md` (new) - `.agents/sessions/2025-12-23-session-90-adr-debate-clarification.md` (updated) - `.serena/memories/adr-foundational-concepts.md` (updated with "Any Decision Record" debate) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs(adr): split ADR-017 into architecture decision + governance policy Implements ADR-018 split pattern: separate immutable architectural decision from evolvable operational policy. ## What Changed **Before**: Single bundled ADR-017-model-routing-low-false-pass.md (~550 lines) - Mixed architectural decision with governance policy - Violated 'single AD' criterion (bundled 7 related decisions) - Policy changes required re-opening ADR debate **After**: Split into focused documents 1. **ADR-017-model-routing-strategy.md** (architecture/, ~200 lines) - Immutable architectural decision - Focus: Why route models by prompt type + evidence availability - Contains: Context, Decision, Rationale, Alternatives, Consequences 2. **AI-REVIEW-MODEL-POLICY.md** (governance/, ~400 lines) - Evolvable operational policy - Contains: Model routing matrix, evidence sufficiency rules, security hardening, escalation criteria, aggregator enforcement, circuit breaker, monitoring - Can evolve without re-debating architecture ## Why Split (ADR-018 Criteria) | Criterion | ADR-017 Analysis | Result | |-----------|------------------|--------| | Affects architecture? | Yes (routing affects system quality) | Architecture component | | Requires enforcement? | Yes (MUST use copilot-model, branch protection) | Governance component | | Tightly coupled? | Yes (routing + evidence + security + aggregator) | Split pattern applies | | Policy evolves independently? | Yes (monitoring thresholds, escalation tuning) | Split benefits realized | ## Benefits Realized - Architectural decision now follows 'single AD' criterion - Governance policy can evolve without ADR debate - Follows ADR-014 + COST-GOVERNANCE pattern (codebase exemplar) - Clear separation: 'why we decided' vs 'how we enforce' ## Disposition - Original bundled ADR-017-model-routing-low-false-pass.md preserved in git history - Removed from working tree (replaced by split) - ADR-017-debate-log.md updated with split documentation Implements: ADR-018 Architecture vs Governance Split Criteria Session: 90 (2025-12-23) * chore(session-90): finalize session with split completion and memory storage Session 90 outcomes: - ADR-017 split completed (commit 0698b2e) - Session log updated with commit evidence - Cross-session context stored in Serena memory (adr-017-split-execution) Session complete: All checklist items verified. * chore(pr-310): complete review response session Session 91 outcomes: - Acknowledged all 4 issue comments (eyes reactions verified) - Replied to AI Quality Gate CRITICAL_FAIL with infrastructure explanation (comment 3688634732) - Documented 3 informational comments (no action required) - No implementation work needed Comment breakdown: - gemini-code-assist[bot]: Unsupported file types (informational) - github-actions[bot] AI Quality Gate: Infrastructure false positive (explained) - coderabbitai[bot]: Review failed (informational) - github-actions[bot] Session Protocol: PASS (informational) PR #310 ready for human review and merge. Note: .agents/pr-comments/PR-310/ working files are gitignored per repository policy. * [WIP] Address feedback on model routing policy in ADR-017 and ADR-018 (#385) * Rename ADR-019 to ADR-021 and ADR-020 to ADR-022 (#455) * Initial plan * Rename ADR-019 to ADR-021 and ADR-020 to ADR-022 - Renamed ADR-019-model-routing-strategy.md to ADR-021-model-routing-strategy.md - Renamed ADR-020-architecture-governance-split-criteria.md to ADR-022-architecture-governance-split-criteria.md - Updated all internal headers and cross-references - Renamed associated debate log and memory files - Updated references in governance policy and critique documents --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * docs: add Copilot CLI model configuration to Serena memory Addresses PR #310 review comment 2644791424 - Document available models per authentication context - Include cost multipliers and parameter slugs - Add cross-references to ADR-021 and AI-REVIEW-MODEL-POLICY - Provide usage guidance for workflow configuration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Signed-off-by: Richard Murillo <6811113+rjmurillo@users.noreply.github.com> Co-authored-by: rjmurillo-bot <rjmurillo-bot@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: rjmurillo[bot] <250269933+rjmurillo-bot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

Initial plan

e19065d

Copilot AI assigned Copilot and rjmurillo Dec 24, 2025

Copilot AI mentioned this pull request Dec 24, 2025

docs(adr): add model routing policy to minimize false PASS #310

Merged

20 tasks

Copilot started work on behalf of rjmurillo December 24, 2025 21:08 View session

Copilot AI and others added 2 commits December 24, 2025 21:19

docs(adr): renumber ADR-017→019 and ADR-018→020 to resolve numbering …

7522c2d

…conflicts

Changes before error encountered

048672b

Co-authored-by: rjmurillo <6811113+rjmurillo@users.noreply.github.com>

Copilot AI requested a review from rjmurillo December 24, 2025 21:49

Copilot stopped work on behalf of rjmurillo due to an error December 24, 2025 21:49
Copilot encountered an unknown Git error. If the problem persists, please contact GitHub Support. To retry, leave a comment on this pull request asking Copilot to try again.

rjmurillo approved these changes Dec 24, 2025

View reviewed changes

rjmurillo marked this pull request as ready for review December 24, 2025 22:14

github-actions Bot added automation Automated workflows and processes area-skills Skills documentation and patterns labels Dec 24, 2025

rjmurillo merged commit d42b5e6 into docs/adr-017 Dec 24, 2025
4 checks passed

rjmurillo deleted the copilot/sub-pr-310 branch December 24, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Address feedback on model routing policy in ADR-017 and ADR-018#385

[WIP] Address feedback on model routing policy in ADR-017 and ADR-018#385
rjmurillo merged 3 commits into
docs/adr-017from
copilot/sub-pr-310

Copilot AI commented Dec 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Completed: ADR Renumbering + Protocol Compliance

Files Renamed

References Updated

Session Logs & Documentation

New Skills Generated

Protocol Compliance

Protocol Discovery

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 24, 2025 •

edited

Loading