Skip to content

Feature: Intelligent Local Model Detection for Memory System #7327

@zjq12333

Description

@zjq12333

Feature Request: Intelligent Local Model Detection for Memory System

Problem Statement

Currently, Hermes Agent's built-in memory system (MEMORY.md and USER.md) uses the main cloud-based LLM for all memory operations:

  1. Memory injection: Built-in memory always injected into system prompt
  2. Memory extraction: Automatic extraction from conversation context
  3. Memory summarization: Generating memory summaries

This results in significant token consumption:

  • Memory injection: ~1100-1400 tokens per session
  • Total cost: 60-80% of tokens consumed by memory operations

Proposed Solution

Add intelligent local model detection with automatic fallback. Users can switch between local and cloud modes via CLI, and the system automatically detects available local models.

CLI Commands

hermes memory mode local   # Use local model for memory operations
hermes memory mode cloud   # Use cloud model for memory operations
hermes memory mode auto     # Auto-detect (local if available, otherwise cloud)

Configuration

memory:
  memory_enabled: true
  user_profile_enabled: true
  mode: auto  # local / cloud / auto
  
  # Auto-detection settings
  local:
    detect_ollama: true
    detect_lmstudio: true
    detect_openai_compatible: true
    preferred_models: [qwen3:4b, qwen3:7b, llama3:8b]
    timeout: 60

Auto-Detection Logic

When mode: auto:

  1. Check for local Ollama

    • Test connection to http://localhost:11434
    • List available models
    • Match with preferred_models
  2. Check for LM Studio

    • Test connection to http://localhost:1234/v1
    • Verify API compatibility
  3. Check for other OpenAI-compatible endpoints

  4. Decision:

    • Local model available → Use local model
    • No local model → Use cloud model (fallback)
  5. Runtime switching:

    • If local model becomes unavailable → Auto-switch to cloud
    • If local model becomes available → User can switch manually

Benefits

  1. Zero Configuration: Users don't need to specify model details
  2. Universal Compatibility: Works with any local LLM (Ollama, LM Studio, custom)
  3. Automatic Fallback: Gracefully switches to cloud if local fails
  4. Simple Management: CLI commands to switch modes
  5. Token Savings: 60-80% reduction when using local models
  6. Privacy Enhancement: Memory processing stays local when possible

Implementation

1. Local Model Detection Service

class LocalModelDetector:
    """Detect and connect to available local models."""
    
    def detect_ollama(self) -> Optional[Dict]:
        """Check for Ollama installation and models."""
        # Connect to localhost:11434
        # List models
        # Return first matching preferred model
    
    def detect_lmstudio(self) -> Optional[Dict]:
        """Check for LM Studio."""
        # Connect to localhost:1234/v1
        # Verify OpenAI compatibility
    
    def detect_all(self) -> List[Dict]:
        """Return list of all available local models."""

2. Memory Mode Manager

class MemoryModeManager:
    """Manage memory mode (local/cloud/auto)."""
    
    def set_mode(self, mode: str) -> None:
        """Switch memory mode."""
        if mode == "auto":
            local_model = self.detector.detect_first()
            if local_model:
                self.use_local_model(local_model)
            else:
                self.use_cloud_model()
    
    def runtime_health_check(self) -> None:
        """Monitor local model health, switch to cloud if needed."""

3. BuiltinMemoryProvider Integration

class BuiltinMemoryProvider(MemoryProvider):
    def __init__(self, memory_mode: str = "auto"):
        self._mode = memory_mode
        self._detector = LocalModelDetector()
        self._mode_manager = MemoryModeManager(self._detector)
    
    def initialize(self, session_id: str, **kwargs) -> None:
        """Initialize with auto-detection."""
        if self._mode == "auto":
            self._mode_manager.set_mode("auto")

Use Cases

Use Case 1: New User with Local Ollama

# User has Ollama installed with qwen3:4b
$ hermes memory mode auto
✅ Detected local model: qwen3:4b (Ollama)
✅ Memory will use local model

# Memory operations now use qwen3:4b locally
# Zero cloud tokens for memory

Use Case 2: Ollama Stops Working

# Ollama process crashes or becomes unavailable
⚠️ Local model qwen3:4b not responding
⚠️ Auto-switching to cloud model

# Memory continues working with cloud model
# No interruption to user workflow

Use Case 3: Manual Switching

# User wants to force cloud mode
$ hermes memory mode cloud
✅ Memory now using cloud model

# User wants to force local mode
$ hermes memory mode local
✅ Memory now using local model: qwen3:4b

Security & Error Handling

  1. Connection Validation: Timeout and retry logic for local model detection
  2. Model Validation: Verify model capabilities before using
  3. Graceful Fallback: Always fall back to cloud if local fails
  4. User Notification: Clear status messages when switching modes
  5. Config Persistence: Save mode choice in config.yaml

Testing

  1. Detection Tests: Mock local model endpoints
  2. Mode Switching Tests: Verify local ↔ cloud transitions
  3. Fallback Tests: Simulate local model failures
  4. Multi-Provider Tests: Test with Ollama, LM Studio, custom endpoints

Documentation

  • docs/user-guide/configuration/memory.md - Memory mode configuration
  • docs/user-guide/cli/memory.md - CLI commands
  • docs/development/local-model-detection.md - Detection API

Related Issues

Conclusion

This feature provides universal local model support with zero configuration. Users get automatic detection, seamless fallback, and simple CLI management, while saving 60-80% of memory tokens when local models are available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildertool/memoryMemory tool and memory providerstype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions