Skip to content

Feature: Local Model Setup Skill — Ollama, llama.cpp & vLLM Configuration Guide with Model Recommendations (inspired by Liquid AI LocalCowork) #523

@teknium1

Description

@teknium1

Overview

Research into Liquid AI's LocalCowork and their LFM2-24B-A2B model revealed that while Hermes Agent technically supports local models via custom OpenAI-compatible endpoints (OPENAI_BASE_URL), there is zero documentation, no setup guidance, and no model recommendations for users who want to run locally. The current "Custom endpoint" flow is a manual, unguided process that assumes users already know how to set up inference servers and which models support tool calling.

LocalCowork's value proposition — running agents entirely on-device with sub-second tool calling — highlights a growing user demand for local/private agent setups. Hermes should make this easy rather than leaving users to figure it out themselves.

This issue proposes a skill that guides users through setting up local models with Hermes Agent, recommends models for different use cases, and handles the configuration nuances that trip up new users.


Research Findings

Current State of Local Model Support in Hermes

What works today:

  • Custom endpoint configuration via OPENAI_BASE_URL + optional OPENAI_API_KEY
  • Setup wizard has a "Custom endpoint" option (enter URL + key + model name)
  • Context probing handles unknown models by stepping down from 2M→32K tokens on error
  • Error message parsing specifically handles Ollama, LM Studio, and llama.cpp formats

What's missing:

  • No model recommendations: Users must know which models support tool calling. There's no guidance on model selection
  • No setup instructions: No docs or skills for installing/configuring Ollama, llama.cpp, or vLLM
  • No auto-detection: No code to detect running local inference servers (e.g., check if Ollama is running on :11434)
  • No model capability metadata: No way to declare whether a local model supports tool calling, vision, structured output, etc.
  • No context length presets: Known local models (Qwen3, Llama, Mistral) could have hardcoded context lengths instead of wasteful error-based probing
  • No tool-calling format guidance: Some local models need specific chat templates or prompt formats for tool calling to work

LFM2-24B-A2B Assessment

Liquid AI's model was the impetus for this research. Honest assessment:

Attribute Details
Architecture Hybrid MoE (gated convolutions + GQA attention), 24B total, ~2.3B active
Memory (Q4_K_M) 14.4 GB — fits easily in 32GB RAM
Latency ~390ms per tool call on M4 Max — best-in-class
Tool Accuracy 80% single-step (67 tools, 100 prompts)
Context Length 32K tokens
Coding NOT recommended by Liquid AI themselves
Intelligence Below-average for its class (10/50 on Artificial Analysis)
Multi-step Only 26% success rate on 3-6 step chains
License LFM Open License v1.0 — free under $10M revenue, paid above
Inference Ollama, llama.cpp, vLLM, LM Studio, MLX, ExecuTorch

Verdict: Good for speed-optimized simple tool dispatching but NOT a general-purpose agent model. For Hermes users who need quality tool calling + coding + reasoning, Qwen3 models (especially Qwen3-30B-A3B or Qwen3-8B) are stronger choices. LFM2-24B-A2B should be listed as an option with appropriate caveats.

Recommended Local Models for Hermes Agent

Based on research across tool-calling benchmarks, context lengths, and practical requirements:

Use Case Model Active/Total Params RAM (Q4) Context Notes
Best overall Qwen3-30B-A3B 3.3B/30B MoE ~17 GB 128K Best accuracy/efficiency balance
Budget/small GPU Qwen3-8B 8B dense ~5 GB 128K Solid tool calling, fits anywhere
Speed-optimized LFM2-24B-A2B 2.3B/24B MoE ~14 GB 32K Fastest latency, limited reasoning
Maximum quality Qwen3-32B 32B dense ~20 GB 128K Dense model, highest accuracy
Coding-focused Qwen2.5-Coder-32B 32B dense ~20 GB 128K Code generation + tool calling
Lightweight Llama-3.1-8B 8B dense ~5 GB 128K Good ecosystem, proven

Implementation Plan

Skill vs. Tool Classification

This should be a skill because:

  • It's primarily instructions + terminal commands (install Ollama, download model, configure Hermes)
  • It wraps external CLIs (ollama, llama-server, vllm) that the agent can call via terminal
  • It doesn't need custom Python integration or API key management in the harness
  • Per CONTRIBUTING.md criteria: "The capability can be expressed as instructions + shell commands + existing tools"

Should it be bundled? Yes — optional-skills/ is the right home. Local model setup is broadly useful but not universally needed (many users will stay on cloud providers). It should ship with the repo as an official optional skill, discoverable via hermes skills browse.

What We'd Need

  1. Skill SKILL.md: Step-by-step instructions for:

    • Installing Ollama / llama.cpp / vLLM
    • Downloading recommended models
    • Starting inference servers
    • Configuring Hermes to point at the local endpoint
    • Verifying tool calling works
    • Troubleshooting common issues
  2. Model recommendation table: In the skill's references, a maintained list of recommended local models with RAM requirements, context lengths, and use case notes

  3. Config templates: Example config.yaml snippets for each setup (Ollama, llama-server, vLLM)

  4. Context length presets (optional codebase change): Add known local models to DEFAULT_CONTEXT_LENGTHS in agent/model_metadata.py to avoid probing

Phased Rollout

Phase 1: Ollama Setup Skill

  • Skill covering Ollama installation and configuration
  • Model recommendation table
  • Config.yaml templates
  • Verification steps (test tool calling)
  • Common troubleshooting (tool calling format issues, context length errors)

Phase 2: Advanced Backends

  • llama.cpp server setup (for GGUF enthusiasts)
  • vLLM setup (for multi-GPU / production users)
  • MLX setup (Apple Silicon native)
  • LM Studio setup (GUI option)

Phase 3: Auto-Detection & Intelligence

  • Detect running Ollama/llama-server on common ports during hermes setup
  • Auto-populate model name from detected server
  • Add local model entries to DEFAULT_CONTEXT_LENGTHS
  • Model capability tags (supports_tool_calling, supports_vision, etc.)

Pros & Cons

Pros

  • Removes a major friction point: Going from "Custom endpoint" to "follow these steps" dramatically lowers the barrier for local model users
  • Privacy-conscious users: Growing demand for fully local, no-cloud agent setups — this serves that audience
  • Cost savings: Local inference is free after hardware investment — important for power users
  • Complements MCP: Local models + local MCP servers = fully private agent stack
  • Low implementation cost: Phase 1 is just a well-written skill — no codebase changes needed

Cons / Risks

  • Model churn: Local model recommendations change rapidly — the skill needs maintenance
  • Quality expectations: Users may expect cloud-quality performance from local models and be disappointed. Clear expectation-setting is critical
  • Support burden: Local setups have many failure modes (wrong CUDA version, insufficient RAM, model format mismatch) — this could increase support questions
  • Tool calling compatibility: Not all local models support OpenAI-compatible function calling. The skill must be explicit about which models work and which don't

Open Questions

  • Should the skill include a "quick test" command that verifies tool calling works with the configured local model?
  • Should we maintain model recommendations in the skill itself or in a separate frequently-updated reference file?
  • Is there appetite for auto-detecting local inference servers during hermes setup (Phase 3)?
  • Should the skill cover multi-model setups (e.g., local model for fast tasks, cloud model for complex reasoning)?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions