[Spike] Measure local-model routing for bounded sub-tasks (issue classify, commit-type suggest, inbox summary)

## Driver

Many framework tasks today route through Claude that don't need a strong model — issue triage, conventional-commit type suggestion, AgDR slug generation, branch-name suggestion from issue title, inbox/PR-list summary paragraphs, glossary-entry stubs, language detection on ambiguous filenames. A small local LLM (Llama 3 8B / Mistral / Phi via Ollama) or a purpose-built classifier could handle these in <1s with zero token cost.

This ticket is a research spike — measurement first, recommendation second, implementation in follow-ups. Same shape as the LSP spike (#178 → PR #184).

## Hypothesis

Routing specific bounded tasks (issue classify, commit-type suggest, inbox summary) through a local model instead of Claude reduces token cost ≥10× on those tasks, with quality "good enough" for the framework's purpose, at acceptable latency.

## Phases

### Phase 1 — measurement

Three representative tasks, three variants each:

1. **Issue classification** — given an issue title + first paragraph, classify as bug / feature / spike / chore. 20 examples from the apexyard portfolio.
2. **Commit-type suggestion** — given a diff, suggest a conventional-commit type (feat / fix / refactor / docs / chore / test). 20 diffs from recent merges.
3. **Inbox summary** — given a list of 10 PR titles, produce a 3-line "what's pending" paragraph. 10 inbox snapshots.

Variants:
- **A. Pure tool**: regex / template / heuristic. No model.
- **B. Local LLM**: Ollama running Llama-3-8B-Instruct (or comparable). Same prompts as Variant C.
- **C. Claude (today's path)**: same prompts. Use Claude Haiku for the cheapest comparison; Sonnet for quality reference.

For each (task × variant), measure:
- Token cost (input + output)
- Wall-clock latency (cold + warm)
- Output quality — score against a hand-graded reference. Inter-rater spread acceptable; aim for ≥85% match on the dominant class.
- Setup cost (one-time install + maintenance)

Document in `docs/spikes/local-model-routing.md`.

### Phase 2 — integration mechanism

Investigate how a local model would plug into apexyard skills:

- **Tool surface**: a new `OllamaCall` tool? An MCP server wrapping Ollama? A bash helper invoked by skills?
- **Skill opt-in**: skills declare which sub-tasks they'd route to local. e.g. `/feature` opts into "local model for issue-class suggestion before showing the user".
- **Fallback**: if local model is unavailable (Ollama not installed, model not pulled), skill falls back to Claude — no new failure mode.
- **Privacy story**: local stays on-machine. Claim the privacy benefit clearly in adopter docs.

### Phase 3 — recommendation

Rank the three candidate sub-tasks (issue classify, commit-type suggest, inbox summary) by token-savings × adopter-frequency. Recommend which to migrate first. Sketch the migration shape (skill-side opt-in flag, fallback behavior, install steps).

### Phase 4 — AgDR sketch

Y-statement + option matrix. The AgDR proper lands as a follow-up if the spike says "go".

## Acceptance Criteria

- [ ] `docs/spikes/local-model-routing.md` exists with methodology + raw numbers for all 3 task variants × 3 model variants.
- [ ] Phase 2 findings documented — recommended integration surface (tool vs MCP vs bash helper).
- [ ] Phase 3 recommendation ranks the 3 sub-tasks by ROI; picks one to migrate first.
- [ ] Phase 4 AgDR sketch.
- [ ] GO / NO-GO recommendation, with reasoning. If GO, follow-up ticket per task to migrate.

## Risks / Dependencies

- **Local install adoption** — adopters who don't install Ollama see no benefit. Acceptable: opt-in, documented, fallback to Claude.
- **Quality drop** — Llama-3-8B is "good enough" for short bounded tasks but not for synthesis. Phase 1 measurement validates per-task before recommending migration.
- **Prompt tuning** — local models often need different prompt shapes. Spike includes one round of prompt tuning per task; if the bar isn't met after that, the recommendation is "pure tool" or "stay on Claude" for that task.
- **Cost-of-install ≥ cost-of-use** — for low-frequency tasks (e.g. AgDR slug generation, twice a week), local install overhead may not pay off. Phase 3 should weight by frequency.

## Out of scope

- Implementing the migration. The spike outputs measurement + recommendation; implementation lands in follow-up tickets.
- Custom training / fine-tuning. Off-the-shelf Llama-3 / Mistral / Phi only.
- Replacing Claude framework-wide. The intent is to **route** specific sub-tasks; Claude stays the dispatcher and synthesis layer.

## Refs

- Sibling spike: #178 (LSP) → PR #184. Same shape: measure first, recommend second, implement in follow-ups.
- Related: `parallel-work.md` rule (we sanction fan-out; routing to local is similar — explicit framework support for cheaper alternatives).
- Adjacent: hooks already use no-model regex (`_lib-detect-bash-write.sh`, `validate-pr-create.sh`) — that's the "pure tool" baseline this spike compares against.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike] Measure local-model routing for bounded sub-tasks (issue classify, commit-type suggest, inbox summary) #195

Driver

Hypothesis

Phases

Phase 1 — measurement

Phase 2 — integration mechanism

Phase 3 — recommendation

Phase 4 — AgDR sketch

Acceptance Criteria

Risks / Dependencies

Out of scope

Refs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Spike] Measure local-model routing for bounded sub-tasks (issue classify, commit-type suggest, inbox summary) #195

Description

Driver

Hypothesis

Phases

Phase 1 — measurement

Phase 2 — integration mechanism

Phase 3 — recommendation

Phase 4 — AgDR sketch

Acceptance Criteria

Risks / Dependencies

Out of scope

Refs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions