Feature: Intelligent Local Model Detection for Memory System

# Feature Request: Intelligent Local Model Detection for Memory System

## Problem Statement

Currently, Hermes Agent's built-in memory system (`MEMORY.md` and `USER.md`) uses the main cloud-based LLM for all memory operations:

1. **Memory injection**: Built-in memory always injected into system prompt
2. **Memory extraction**: Automatic extraction from conversation context
3. **Memory summarization**: Generating memory summaries

This results in significant token consumption:
- Memory injection: ~1100-1400 tokens per session
- Total cost: 60-80% of tokens consumed by memory operations

## Proposed Solution

Add intelligent local model detection with automatic fallback. Users can switch between local and cloud modes via CLI, and the system automatically detects available local models.

### CLI Commands

```bash
hermes memory mode local   # Use local model for memory operations
hermes memory mode cloud   # Use cloud model for memory operations
hermes memory mode auto     # Auto-detect (local if available, otherwise cloud)
```

### Configuration

```yaml
memory:
  memory_enabled: true
  user_profile_enabled: true
  mode: auto  # local / cloud / auto
  
  # Auto-detection settings
  local:
    detect_ollama: true
    detect_lmstudio: true
    detect_openai_compatible: true
    preferred_models: [qwen3:4b, qwen3:7b, llama3:8b]
    timeout: 60
```

### Auto-Detection Logic

When `mode: auto`:

1. **Check for local Ollama**
   - Test connection to `http://localhost:11434`
   - List available models
   - Match with `preferred_models`

2. **Check for LM Studio**
   - Test connection to `http://localhost:1234/v1`
   - Verify API compatibility

3. **Check for other OpenAI-compatible endpoints**

4. **Decision**:
   - Local model available → Use local model
   - No local model → Use cloud model (fallback)

5. **Runtime switching**:
   - If local model becomes unavailable → Auto-switch to cloud
   - If local model becomes available → User can switch manually

### Benefits

1. **Zero Configuration**: Users don't need to specify model details
2. **Universal Compatibility**: Works with any local LLM (Ollama, LM Studio, custom)
3. **Automatic Fallback**: Gracefully switches to cloud if local fails
4. **Simple Management**: CLI commands to switch modes
5. **Token Savings**: 60-80% reduction when using local models
6. **Privacy Enhancement**: Memory processing stays local when possible

### Implementation

#### 1. Local Model Detection Service

```python
class LocalModelDetector:
    """Detect and connect to available local models."""
    
    def detect_ollama(self) -> Optional[Dict]:
        """Check for Ollama installation and models."""
        # Connect to localhost:11434
        # List models
        # Return first matching preferred model
    
    def detect_lmstudio(self) -> Optional[Dict]:
        """Check for LM Studio."""
        # Connect to localhost:1234/v1
        # Verify OpenAI compatibility
    
    def detect_all(self) -> List[Dict]:
        """Return list of all available local models."""
```

#### 2. Memory Mode Manager

```python
class MemoryModeManager:
    """Manage memory mode (local/cloud/auto)."""
    
    def set_mode(self, mode: str) -> None:
        """Switch memory mode."""
        if mode == "auto":
            local_model = self.detector.detect_first()
            if local_model:
                self.use_local_model(local_model)
            else:
                self.use_cloud_model()
    
    def runtime_health_check(self) -> None:
        """Monitor local model health, switch to cloud if needed."""
```

#### 3. BuiltinMemoryProvider Integration

```python
class BuiltinMemoryProvider(MemoryProvider):
    def __init__(self, memory_mode: str = "auto"):
        self._mode = memory_mode
        self._detector = LocalModelDetector()
        self._mode_manager = MemoryModeManager(self._detector)
    
    def initialize(self, session_id: str, **kwargs) -> None:
        """Initialize with auto-detection."""
        if self._mode == "auto":
            self._mode_manager.set_mode("auto")
```

### Use Cases

#### Use Case 1: New User with Local Ollama

```bash
# User has Ollama installed with qwen3:4b
$ hermes memory mode auto
✅ Detected local model: qwen3:4b (Ollama)
✅ Memory will use local model

# Memory operations now use qwen3:4b locally
# Zero cloud tokens for memory
```

#### Use Case 2: Ollama Stops Working

```bash
# Ollama process crashes or becomes unavailable
⚠️ Local model qwen3:4b not responding
⚠️ Auto-switching to cloud model

# Memory continues working with cloud model
# No interruption to user workflow
```

#### Use Case 3: Manual Switching

```bash
# User wants to force cloud mode
$ hermes memory mode cloud
✅ Memory now using cloud model

# User wants to force local mode
$ hermes memory mode local
✅ Memory now using local model: qwen3:4b
```

### Security & Error Handling

1. **Connection Validation**: Timeout and retry logic for local model detection
2. **Model Validation**: Verify model capabilities before using
3. **Graceful Fallback**: Always fall back to cloud if local fails
4. **User Notification**: Clear status messages when switching modes
5. **Config Persistence**: Save mode choice in config.yaml

### Testing

1. **Detection Tests**: Mock local model endpoints
2. **Mode Switching Tests**: Verify local ↔ cloud transitions
3. **Fallback Tests**: Simulate local model failures
4. **Multi-Provider Tests**: Test with Ollama, LM Studio, custom endpoints

### Documentation

- `docs/user-guide/configuration/memory.md` - Memory mode configuration
- `docs/user-guide/cli/memory.md` - CLI commands
- `docs/development/local-model-detection.md` - Detection API

### Related Issues

- Issue #3926: Add Ollama Cloud as built-in provider
- Issue #879: Local model routing for auxiliary tasks

### Conclusion

This feature provides universal local model support with zero configuration. Users get automatic detection, seamless fallback, and simple CLI management, while saving 60-80% of memory tokens when local models are available.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Intelligent Local Model Detection for Memory System #7327

Feature Request: Intelligent Local Model Detection for Memory System

Problem Statement

Proposed Solution

CLI Commands

Configuration

Auto-Detection Logic

Benefits

Implementation

1. Local Model Detection Service

2. Memory Mode Manager

3. BuiltinMemoryProvider Integration

Use Cases

Use Case 1: New User with Local Ollama

Use Case 2: Ollama Stops Working

Use Case 3: Manual Switching

Security & Error Handling

Testing

Documentation

Related Issues

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Intelligent Local Model Detection for Memory System #7327

Description

Feature Request: Intelligent Local Model Detection for Memory System

Problem Statement

Proposed Solution

CLI Commands

Configuration

Auto-Detection Logic

Benefits

Implementation

1. Local Model Detection Service

2. Memory Mode Manager

3. BuiltinMemoryProvider Integration

Use Cases

Use Case 1: New User with Local Ollama

Use Case 2: Ollama Stops Working

Use Case 3: Manual Switching

Security & Error Handling

Testing

Documentation

Related Issues

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions