Feature Request: Intelligent Local Model Detection for Memory System
Problem Statement
Currently, Hermes Agent's built-in memory system (MEMORY.md and USER.md) uses the main cloud-based LLM for all memory operations:
- Memory injection: Built-in memory always injected into system prompt
- Memory extraction: Automatic extraction from conversation context
- Memory summarization: Generating memory summaries
This results in significant token consumption:
- Memory injection: ~1100-1400 tokens per session
- Total cost: 60-80% of tokens consumed by memory operations
Proposed Solution
Add intelligent local model detection with automatic fallback. Users can switch between local and cloud modes via CLI, and the system automatically detects available local models.
CLI Commands
hermes memory mode local # Use local model for memory operations
hermes memory mode cloud # Use cloud model for memory operations
hermes memory mode auto # Auto-detect (local if available, otherwise cloud)
Configuration
memory:
memory_enabled: true
user_profile_enabled: true
mode: auto # local / cloud / auto
# Auto-detection settings
local:
detect_ollama: true
detect_lmstudio: true
detect_openai_compatible: true
preferred_models: [qwen3:4b, qwen3:7b, llama3:8b]
timeout: 60
Auto-Detection Logic
When mode: auto:
-
Check for local Ollama
- Test connection to
http://localhost:11434
- List available models
- Match with
preferred_models
-
Check for LM Studio
- Test connection to
http://localhost:1234/v1
- Verify API compatibility
-
Check for other OpenAI-compatible endpoints
-
Decision:
- Local model available → Use local model
- No local model → Use cloud model (fallback)
-
Runtime switching:
- If local model becomes unavailable → Auto-switch to cloud
- If local model becomes available → User can switch manually
Benefits
- Zero Configuration: Users don't need to specify model details
- Universal Compatibility: Works with any local LLM (Ollama, LM Studio, custom)
- Automatic Fallback: Gracefully switches to cloud if local fails
- Simple Management: CLI commands to switch modes
- Token Savings: 60-80% reduction when using local models
- Privacy Enhancement: Memory processing stays local when possible
Implementation
1. Local Model Detection Service
class LocalModelDetector:
"""Detect and connect to available local models."""
def detect_ollama(self) -> Optional[Dict]:
"""Check for Ollama installation and models."""
# Connect to localhost:11434
# List models
# Return first matching preferred model
def detect_lmstudio(self) -> Optional[Dict]:
"""Check for LM Studio."""
# Connect to localhost:1234/v1
# Verify OpenAI compatibility
def detect_all(self) -> List[Dict]:
"""Return list of all available local models."""
2. Memory Mode Manager
class MemoryModeManager:
"""Manage memory mode (local/cloud/auto)."""
def set_mode(self, mode: str) -> None:
"""Switch memory mode."""
if mode == "auto":
local_model = self.detector.detect_first()
if local_model:
self.use_local_model(local_model)
else:
self.use_cloud_model()
def runtime_health_check(self) -> None:
"""Monitor local model health, switch to cloud if needed."""
3. BuiltinMemoryProvider Integration
class BuiltinMemoryProvider(MemoryProvider):
def __init__(self, memory_mode: str = "auto"):
self._mode = memory_mode
self._detector = LocalModelDetector()
self._mode_manager = MemoryModeManager(self._detector)
def initialize(self, session_id: str, **kwargs) -> None:
"""Initialize with auto-detection."""
if self._mode == "auto":
self._mode_manager.set_mode("auto")
Use Cases
Use Case 1: New User with Local Ollama
# User has Ollama installed with qwen3:4b
$ hermes memory mode auto
✅ Detected local model: qwen3:4b (Ollama)
✅ Memory will use local model
# Memory operations now use qwen3:4b locally
# Zero cloud tokens for memory
Use Case 2: Ollama Stops Working
# Ollama process crashes or becomes unavailable
⚠️ Local model qwen3:4b not responding
⚠️ Auto-switching to cloud model
# Memory continues working with cloud model
# No interruption to user workflow
Use Case 3: Manual Switching
# User wants to force cloud mode
$ hermes memory mode cloud
✅ Memory now using cloud model
# User wants to force local mode
$ hermes memory mode local
✅ Memory now using local model: qwen3:4b
Security & Error Handling
- Connection Validation: Timeout and retry logic for local model detection
- Model Validation: Verify model capabilities before using
- Graceful Fallback: Always fall back to cloud if local fails
- User Notification: Clear status messages when switching modes
- Config Persistence: Save mode choice in config.yaml
Testing
- Detection Tests: Mock local model endpoints
- Mode Switching Tests: Verify local ↔ cloud transitions
- Fallback Tests: Simulate local model failures
- Multi-Provider Tests: Test with Ollama, LM Studio, custom endpoints
Documentation
docs/user-guide/configuration/memory.md - Memory mode configuration
docs/user-guide/cli/memory.md - CLI commands
docs/development/local-model-detection.md - Detection API
Related Issues
Conclusion
This feature provides universal local model support with zero configuration. Users get automatic detection, seamless fallback, and simple CLI management, while saving 60-80% of memory tokens when local models are available.
Feature Request: Intelligent Local Model Detection for Memory System
Problem Statement
Currently, Hermes Agent's built-in memory system (
MEMORY.mdandUSER.md) uses the main cloud-based LLM for all memory operations:This results in significant token consumption:
Proposed Solution
Add intelligent local model detection with automatic fallback. Users can switch between local and cloud modes via CLI, and the system automatically detects available local models.
CLI Commands
Configuration
Auto-Detection Logic
When
mode: auto:Check for local Ollama
http://localhost:11434preferred_modelsCheck for LM Studio
http://localhost:1234/v1Check for other OpenAI-compatible endpoints
Decision:
Runtime switching:
Benefits
Implementation
1. Local Model Detection Service
2. Memory Mode Manager
3. BuiltinMemoryProvider Integration
Use Cases
Use Case 1: New User with Local Ollama
Use Case 2: Ollama Stops Working
Use Case 3: Manual Switching
Security & Error Handling
Testing
Documentation
docs/user-guide/configuration/memory.md- Memory mode configurationdocs/user-guide/cli/memory.md- CLI commandsdocs/development/local-model-detection.md- Detection APIRelated Issues
Conclusion
This feature provides universal local model support with zero configuration. Users get automatic detection, seamless fallback, and simple CLI management, while saving 60-80% of memory tokens when local models are available.