Skip to content

[Spike] Measure local-model routing for bounded sub-tasks (issue classify, commit-type suggest, inbox summary) #195

@atlas-apex

Description

@atlas-apex

Driver

Many framework tasks today route through Claude that don't need a strong model — issue triage, conventional-commit type suggestion, AgDR slug generation, branch-name suggestion from issue title, inbox/PR-list summary paragraphs, glossary-entry stubs, language detection on ambiguous filenames. A small local LLM (Llama 3 8B / Mistral / Phi via Ollama) or a purpose-built classifier could handle these in <1s with zero token cost.

This ticket is a research spike — measurement first, recommendation second, implementation in follow-ups. Same shape as the LSP spike (#178 → PR #184).

Hypothesis

Routing specific bounded tasks (issue classify, commit-type suggest, inbox summary) through a local model instead of Claude reduces token cost ≥10× on those tasks, with quality "good enough" for the framework's purpose, at acceptable latency.

Phases

Phase 1 — measurement

Three representative tasks, three variants each:

  1. Issue classification — given an issue title + first paragraph, classify as bug / feature / spike / chore. 20 examples from the apexyard portfolio.
  2. Commit-type suggestion — given a diff, suggest a conventional-commit type (feat / fix / refactor / docs / chore / test). 20 diffs from recent merges.
  3. Inbox summary — given a list of 10 PR titles, produce a 3-line "what's pending" paragraph. 10 inbox snapshots.

Variants:

  • A. Pure tool: regex / template / heuristic. No model.
  • B. Local LLM: Ollama running Llama-3-8B-Instruct (or comparable). Same prompts as Variant C.
  • C. Claude (today's path): same prompts. Use Claude Haiku for the cheapest comparison; Sonnet for quality reference.

For each (task × variant), measure:

  • Token cost (input + output)
  • Wall-clock latency (cold + warm)
  • Output quality — score against a hand-graded reference. Inter-rater spread acceptable; aim for ≥85% match on the dominant class.
  • Setup cost (one-time install + maintenance)

Document in docs/spikes/local-model-routing.md.

Phase 2 — integration mechanism

Investigate how a local model would plug into apexyard skills:

  • Tool surface: a new OllamaCall tool? An MCP server wrapping Ollama? A bash helper invoked by skills?
  • Skill opt-in: skills declare which sub-tasks they'd route to local. e.g. /feature opts into "local model for issue-class suggestion before showing the user".
  • Fallback: if local model is unavailable (Ollama not installed, model not pulled), skill falls back to Claude — no new failure mode.
  • Privacy story: local stays on-machine. Claim the privacy benefit clearly in adopter docs.

Phase 3 — recommendation

Rank the three candidate sub-tasks (issue classify, commit-type suggest, inbox summary) by token-savings × adopter-frequency. Recommend which to migrate first. Sketch the migration shape (skill-side opt-in flag, fallback behavior, install steps).

Phase 4 — AgDR sketch

Y-statement + option matrix. The AgDR proper lands as a follow-up if the spike says "go".

Acceptance Criteria

  • docs/spikes/local-model-routing.md exists with methodology + raw numbers for all 3 task variants × 3 model variants.
  • Phase 2 findings documented — recommended integration surface (tool vs MCP vs bash helper).
  • Phase 3 recommendation ranks the 3 sub-tasks by ROI; picks one to migrate first.
  • Phase 4 AgDR sketch.
  • GO / NO-GO recommendation, with reasoning. If GO, follow-up ticket per task to migrate.

Risks / Dependencies

  • Local install adoption — adopters who don't install Ollama see no benefit. Acceptable: opt-in, documented, fallback to Claude.
  • Quality drop — Llama-3-8B is "good enough" for short bounded tasks but not for synthesis. Phase 1 measurement validates per-task before recommending migration.
  • Prompt tuning — local models often need different prompt shapes. Spike includes one round of prompt tuning per task; if the bar isn't met after that, the recommendation is "pure tool" or "stay on Claude" for that task.
  • Cost-of-install ≥ cost-of-use — for low-frequency tasks (e.g. AgDR slug generation, twice a week), local install overhead may not pay off. Phase 3 should weight by frequency.

Out of scope

  • Implementing the migration. The spike outputs measurement + recommendation; implementation lands in follow-up tickets.
  • Custom training / fine-tuning. Off-the-shelf Llama-3 / Mistral / Phi only.
  • Replacing Claude framework-wide. The intent is to route specific sub-tasks; Claude stays the dispatcher and synthesis layer.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — plan-worthy, not urgentenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions