Feature: Local Model Setup Skill — Ollama, llama.cpp & vLLM Configuration Guide with Model Recommendations (inspired by Liquid AI LocalCowork)

## Overview

Research into Liquid AI's [LocalCowork](https://github.com/Liquid4All/cookbook/tree/main/examples/localcowork) and their [LFM2-24B-A2B model](http://liquid.ai/blog/no-cloud-tool-calling-agents-consumer-hardware-lfm2-24b-a2b) revealed that while Hermes Agent technically supports local models via custom OpenAI-compatible endpoints (`OPENAI_BASE_URL`), there is **zero documentation, no setup guidance, and no model recommendations** for users who want to run locally. The current "Custom endpoint" flow is a manual, unguided process that assumes users already know how to set up inference servers and which models support tool calling.

LocalCowork's value proposition — running agents entirely on-device with sub-second tool calling — highlights a growing user demand for local/private agent setups. Hermes should make this easy rather than leaving users to figure it out themselves.

This issue proposes a **skill** that guides users through setting up local models with Hermes Agent, recommends models for different use cases, and handles the configuration nuances that trip up new users.

---

## Research Findings

### Current State of Local Model Support in Hermes

**What works today:**
- Custom endpoint configuration via `OPENAI_BASE_URL` + optional `OPENAI_API_KEY`
- Setup wizard has a "Custom endpoint" option (enter URL + key + model name)
- Context probing handles unknown models by stepping down from 2M→32K tokens on error
- Error message parsing specifically handles Ollama, LM Studio, and llama.cpp formats

**What's missing:**
- **No model recommendations**: Users must know which models support tool calling. There's no guidance on model selection
- **No setup instructions**: No docs or skills for installing/configuring Ollama, llama.cpp, or vLLM
- **No auto-detection**: No code to detect running local inference servers (e.g., check if Ollama is running on :11434)
- **No model capability metadata**: No way to declare whether a local model supports tool calling, vision, structured output, etc.
- **No context length presets**: Known local models (Qwen3, Llama, Mistral) could have hardcoded context lengths instead of wasteful error-based probing
- **No tool-calling format guidance**: Some local models need specific chat templates or prompt formats for tool calling to work

### LFM2-24B-A2B Assessment

Liquid AI's model was the impetus for this research. Honest assessment:

| Attribute | Details |
|-----------|---------|
| Architecture | Hybrid MoE (gated convolutions + GQA attention), 24B total, ~2.3B active |
| Memory (Q4_K_M) | 14.4 GB — fits easily in 32GB RAM |
| Latency | ~390ms per tool call on M4 Max — best-in-class |
| Tool Accuracy | 80% single-step (67 tools, 100 prompts) |
| Context Length | 32K tokens |
| Coding | **NOT recommended** by Liquid AI themselves |
| Intelligence | Below-average for its class (10/50 on Artificial Analysis) |
| Multi-step | Only 26% success rate on 3-6 step chains |
| License | LFM Open License v1.0 — free under $10M revenue, paid above |
| Inference | Ollama, llama.cpp, vLLM, LM Studio, MLX, ExecuTorch |

**Verdict**: Good for speed-optimized simple tool dispatching but NOT a general-purpose agent model. For Hermes users who need quality tool calling + coding + reasoning, **Qwen3 models (especially Qwen3-30B-A3B or Qwen3-8B) are stronger choices**. LFM2-24B-A2B should be listed as an option with appropriate caveats.

### Recommended Local Models for Hermes Agent

Based on research across tool-calling benchmarks, context lengths, and practical requirements:

| Use Case | Model | Active/Total Params | RAM (Q4) | Context | Notes |
|----------|-------|---------------------|----------|---------|-------|
| **Best overall** | Qwen3-30B-A3B | 3.3B/30B MoE | ~17 GB | 128K | Best accuracy/efficiency balance |
| **Budget/small GPU** | Qwen3-8B | 8B dense | ~5 GB | 128K | Solid tool calling, fits anywhere |
| **Speed-optimized** | LFM2-24B-A2B | 2.3B/24B MoE | ~14 GB | 32K | Fastest latency, limited reasoning |
| **Maximum quality** | Qwen3-32B | 32B dense | ~20 GB | 128K | Dense model, highest accuracy |
| **Coding-focused** | Qwen2.5-Coder-32B | 32B dense | ~20 GB | 128K | Code generation + tool calling |
| **Lightweight** | Llama-3.1-8B | 8B dense | ~5 GB | 128K | Good ecosystem, proven |

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **skill** because:
- It's primarily instructions + terminal commands (install Ollama, download model, configure Hermes)
- It wraps external CLIs (ollama, llama-server, vllm) that the agent can call via terminal
- It doesn't need custom Python integration or API key management in the harness
- Per CONTRIBUTING.md criteria: "The capability can be expressed as instructions + shell commands + existing tools"

**Should it be bundled?** Yes — `optional-skills/` is the right home. Local model setup is broadly useful but not universally needed (many users will stay on cloud providers). It should ship with the repo as an official optional skill, discoverable via `hermes skills browse`.

### What We'd Need

1. **Skill SKILL.md**: Step-by-step instructions for:
   - Installing Ollama / llama.cpp / vLLM
   - Downloading recommended models
   - Starting inference servers
   - Configuring Hermes to point at the local endpoint
   - Verifying tool calling works
   - Troubleshooting common issues

2. **Model recommendation table**: In the skill's references, a maintained list of recommended local models with RAM requirements, context lengths, and use case notes

3. **Config templates**: Example config.yaml snippets for each setup (Ollama, llama-server, vLLM)

4. **Context length presets** (optional codebase change): Add known local models to `DEFAULT_CONTEXT_LENGTHS` in `agent/model_metadata.py` to avoid probing

### Phased Rollout

**Phase 1: Ollama Setup Skill**
- Skill covering Ollama installation and configuration
- Model recommendation table
- Config.yaml templates
- Verification steps (test tool calling)
- Common troubleshooting (tool calling format issues, context length errors)

**Phase 2: Advanced Backends**
- llama.cpp server setup (for GGUF enthusiasts)
- vLLM setup (for multi-GPU / production users)
- MLX setup (Apple Silicon native)
- LM Studio setup (GUI option)

**Phase 3: Auto-Detection & Intelligence**
- Detect running Ollama/llama-server on common ports during `hermes setup`
- Auto-populate model name from detected server
- Add local model entries to `DEFAULT_CONTEXT_LENGTHS`
- Model capability tags (supports_tool_calling, supports_vision, etc.)

---

## Pros & Cons

### Pros
- **Removes a major friction point**: Going from "Custom endpoint" to "follow these steps" dramatically lowers the barrier for local model users
- **Privacy-conscious users**: Growing demand for fully local, no-cloud agent setups — this serves that audience
- **Cost savings**: Local inference is free after hardware investment — important for power users
- **Complements MCP**: Local models + local MCP servers = fully private agent stack
- **Low implementation cost**: Phase 1 is just a well-written skill — no codebase changes needed

### Cons / Risks
- **Model churn**: Local model recommendations change rapidly — the skill needs maintenance
- **Quality expectations**: Users may expect cloud-quality performance from local models and be disappointed. Clear expectation-setting is critical
- **Support burden**: Local setups have many failure modes (wrong CUDA version, insufficient RAM, model format mismatch) — this could increase support questions
- **Tool calling compatibility**: Not all local models support OpenAI-compatible function calling. The skill must be explicit about which models work and which don't

---

## Open Questions

- Should the skill include a "quick test" command that verifies tool calling works with the configured local model?
- Should we maintain model recommendations in the skill itself or in a separate frequently-updated reference file?
- Is there appetite for auto-detecting local inference servers during `hermes setup` (Phase 3)?
- Should the skill cover multi-model setups (e.g., local model for fast tasks, cloud model for complex reasoning)?

---

## References

- [Liquid AI Blog: No Cloud, No Waiting](http://liquid.ai/blog/no-cloud-tool-calling-agents-consumer-hardware-lfm2-24b-a2b)
- [LocalCowork Source Code](https://github.com/Liquid4All/cookbook/tree/main/examples/localcowork)
- [LFM2-24B-A2B on HuggingFace](https://huggingface.co/LiquidAI/LFM2-24B-A2B-Preview)
- [LFM2-24B-A2B on Ollama](https://ollama.com/library/lfm2) (918K+ downloads)
- [LFM2 Technical Report](https://arxiv.org/abs/2511.23404)
- [Issue #467 — Fork with local MLX inference](https://github.com/NousResearch/hermes-agent/issues/467) (related: m0at's local provider work)
- [Issue #127 — Model config is inflexible](https://github.com/NousResearch/hermes-agent/issues/127) (related: config limitations)
- [Issue #157 — Multi-model routing](https://github.com/NousResearch/hermes-agent/issues/157) (related: capability-based model selection)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Local Model Setup Skill — Ollama, llama.cpp & vLLM Configuration Guide with Model Recommendations (inspired by Liquid AI LocalCowork) #523

Overview

Research Findings

Current State of Local Model Support in Hermes

LFM2-24B-A2B Assessment

Recommended Local Models for Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attribute	Details
Architecture	Hybrid MoE (gated convolutions + GQA attention), 24B total, ~2.3B active
Memory (Q4_K_M)	14.4 GB — fits easily in 32GB RAM
Latency	~390ms per tool call on M4 Max — best-in-class
Tool Accuracy	80% single-step (67 tools, 100 prompts)
Context Length	32K tokens
Coding	NOT recommended by Liquid AI themselves
Intelligence	Below-average for its class (10/50 on Artificial Analysis)
Multi-step	Only 26% success rate on 3-6 step chains
License	LFM Open License v1.0 — free under $10M revenue, paid above
Inference	Ollama, llama.cpp, vLLM, LM Studio, MLX, ExecuTorch

Use Case	Model	Active/Total Params	RAM (Q4)	Context	Notes
Best overall	Qwen3-30B-A3B	3.3B/30B MoE	~17 GB	128K	Best accuracy/efficiency balance
Budget/small GPU	Qwen3-8B	8B dense	~5 GB	128K	Solid tool calling, fits anywhere
Speed-optimized	LFM2-24B-A2B	2.3B/24B MoE	~14 GB	32K	Fastest latency, limited reasoning
Maximum quality	Qwen3-32B	32B dense	~20 GB	128K	Dense model, highest accuracy
Coding-focused	Qwen2.5-Coder-32B	32B dense	~20 GB	128K	Code generation + tool calling
Lightweight	Llama-3.1-8B	8B dense	~5 GB	128K	Good ecosystem, proven

Feature: Local Model Setup Skill — Ollama, llama.cpp & vLLM Configuration Guide with Model Recommendations (inspired by Liquid AI LocalCowork) #523

Description

Overview

Research Findings

Current State of Local Model Support in Hermes

LFM2-24B-A2B Assessment

Recommended Local Models for Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions