Skip to content

edit_prediction: Add Ollama as inline completion provider#1

Merged
akhil-p-git merged 6 commits intomainfrom
feat/ollama-inline-completion
Dec 13, 2025
Merged

edit_prediction: Add Ollama as inline completion provider#1
akhil-p-git merged 6 commits intomainfrom
feat/ollama-inline-completion

Conversation

@akhil-p-git
Copy link
Owner

@akhil-p-git akhil-p-git commented Dec 1, 2025

Summary

This PR implements Ollama as an inline code completion provider for Zed, addressing issue #15968. It enables users to get AI-powered code completions from locally-running Large Language Models (LLMs) without requiring external API keys or cloud services.

Why Ollama?

Ollama is a popular open-source tool for running LLMs locally on macOS, Linux, and Windows. By integrating Ollama as an edit prediction provider, Zed users gain:

  • Privacy: All code stays on their local machine - no data sent to external servers
  • No API costs: Use open-source models without usage fees or rate limits
  • Offline capability: Works without internet connectivity
  • Model flexibility: Choose from hundreds of available models (Qwen, CodeLlama, DeepSeek, etc.)
  • Hardware optimization: Models run on local GPU/CPU, optimized for user's hardware

Features

1. Core Completion Provider

A new ollama_completion crate (~950 lines) implementing the EditPredictionProvider trait:

Feature Description
FIM Prompting Uses standardized Fill-In-Middle format for context-aware completions
Intelligent Debouncing 75ms delay prevents excessive API calls while typing
Optimized Parameters 256 max tokens, 0.2 temperature for focused completions
Smart Stop Sequences Auto-stops at double newlines and FIM tokens
Grapheme-aware Matching Proper Unicode handling when computing edit ranges

2. User Configuration

Full settings support via settings.json:

{
  "features": {
    "edit_prediction_provider": "ollama"
  },
  "edit_predictions": {
    "ollama": {
      "api_url": "http://localhost:11434",
      "model": "qwen2.5-coder:7b",
      "temperature": 0.2,
      "max_tokens": 256
    }
  }
}
Option Type Default Description
api_url String http://localhost:11434 Ollama server URL
model String qwen2.5-coder:7b Model for completions
temperature Float 0.2 Sampling temperature (0.0-2.0)
max_tokens Integer 256 Maximum tokens to generate

3. Error Handling & Status Indicators

Comprehensive error handling with visual feedback:

State UI Indicator Context Menu Message
Connected Normal "Connected to Ollama" (green)
Server Down Red dot "Ollama server is not running. Start with 'ollama serve'."
Model Missing Red dot "Model not found. Available: [list]. Run 'ollama pull '."
Other Error Red dot Truncated error message

4. Model Health Check

Automatic validation on first completion request:

  • Verifies Ollama server is running via /api/tags endpoint
  • Confirms configured model exists on the server
  • Provides actionable error messages with available models list
  • Runs once per session to minimize overhead

5. Context Window Optimization

Smart prompt building that respects model context limits:

Setting Value Purpose
Prefix limit 4KB Code before cursor
Suffix limit 1KB Code after cursor
Line boundary Auto Clean truncation at newlines

6. Completion Caching

LRU cache eliminates redundant API calls:

Property Value
Cache size 50 entries
Lookup O(1) hash-based
Eviction Automatic LRU
Key Prompt hash

7. Streaming Completions

Lower perceived latency with streaming API:

  • Uses stream: true for incremental responses
  • Processes tokens as they arrive
  • Tracks time-to-first-token metric
  • Accumulates into final completion

8. Telemetry Integration

Comprehensive analytics for quality measurement:

Event Properties When
Ollama Completion model, token_count, total_time_ms, time_to_first_token_ms, cached, success After each request
Ollama Completion Shown model, completion_length Ghost text displayed
Ollama Completion Accepted model User presses Tab
Ollama Completion Discarded model User dismisses
Ollama Health Check Passed - Successful check
Ollama Health Check Failed error Failed check

Technical Implementation

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         User Types Code                          │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Editor triggers refresh()                     │
│                      (with 75ms debounce)                        │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│               OllamaCompletionProvider.refresh()                 │
│                                                                  │
│  1. Check completion cache ────────────────► Return if cached    │
│  2. Health check (first request only)                            │
│  3. Build optimized FIM prompt                                   │
│  4. Stream completion from Ollama API                            │
│  5. Cache result, report telemetry                               │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              Ollama API (/api/generate, stream=true)             │
│                                                                  │
│  Request:                                                        │
│  {                                                               │
│    "model": "qwen2.5-coder:7b",                                  │
│    "prompt": "<|fim_prefix|>...<|fim_suffix|>...<|fim_middle|>", │
│    "stream": true,                                               │
│    "options": { "num_predict": 256, "temperature": 0.2 }         │
│  }                                                               │
│                                                                  │
│  Response: Line-delimited JSON with incremental tokens           │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                   completion_from_text()                         │
│                                                                  │
│  1. Trim to end of line (unless multiline)                       │
│  2. Compare with buffer text (grapheme-aware)                    │
│  3. Generate minimal edit operations                             │
│  4. Return EditPrediction::Local { edits, ... }                  │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              Editor displays ghost text / inline hint            │
│              User presses Tab to accept                          │
└─────────────────────────────────────────────────────────────────┘

FIM Prompt Format

<|fim_prefix|>fn calculate_sum(numbers: &[i32]) -> i32 {
    let mut sum = 0;
    for n in numbers {
        sum += <|fim_suffix|>
    }
    sum
}<|fim_middle|>

The model generates: *n;

Supported models: Qwen2.5-Coder, CodeLlama, DeepSeek-Coder, StarCoder, Codestral

Files Changed

File Lines Description
crates/ollama_completion/Cargo.toml +34 New crate manifest
crates/ollama_completion/src/ollama_completion.rs +950 Core implementation
crates/settings/src/settings_content/language.rs +31 Settings schema
crates/language/src/language_settings.rs +23 Runtime settings
crates/zed/src/zed/edit_prediction_registry.rs +20 Provider registration
crates/edit_prediction_button/src/edit_prediction_button.rs +102 UI integration
crates/edit_prediction_button/Cargo.toml +1 Dependency
crates/agent_ui/src/agent_ui.rs +1 Action filter
crates/zed/Cargo.toml +1 Workspace dependency
Cargo.toml +2 Workspace member

Key Data Structures

pub struct OllamaCompletionProvider {
    http_client: Arc<dyn HttpClient>,
    api_url: String,
    model: String,
    temperature: f32,
    max_tokens: i32,
    buffer_id: Option<EntityId>,
    completion_text: Option<String>,
    pending_refresh: Option<Task<Result<()>>>,
    completion_position: Option<Anchor>,
    completion_cache: HashMap<u64, String>,  // LRU cache
    cache_order: Vec<u64>,                   // Eviction order
    health_checked: bool,                    // One-time flag
}

pub enum OllamaConnectionStatus {
    Unknown,
    Connected,
    Error(String),
}

struct CompletionMetrics {
    model: String,
    token_count: u32,
    total_time_ms: u64,
    time_to_first_token_ms: Option<u64>,
    cached: bool,
}

Installation & Usage

Prerequisites

  1. Install Ollama: https://ollama.ai

    # macOS
    brew install ollama
    
    # Or download from https://ollama.ai/download
  2. Pull a code completion model:

    ollama pull qwen2.5-coder:7b

    Recommended models:

    Model Size Notes
    qwen2.5-coder:7b 4.7GB Best quality/speed balance
    codellama:7b 3.8GB Meta's code model
    deepseek-coder:6.7b 3.8GB Strong code understanding
    starcoder2:7b 4.0GB Multi-language support
  3. Start Ollama server (if not running as service):

    ollama serve

Configuration

Add to ~/.config/zed/settings.json:

{
  "features": {
    "edit_prediction_provider": "ollama"
  }
}

Usage

  1. Open any code file in Zed
  2. Start typing - completions appear as ghost text after 75ms
  3. Press Tab to accept the completion
  4. Check the status bar icon for connection status

Test Plan

Automated Tests (17 total)

running 17 tests
test tests::test_completion_metrics ... ok
test tests::test_completion_metrics_cached ... ok
test tests::test_has_leading_newline ... ok
test tests::test_ollama_connection_status_default ... ok
test tests::test_ollama_connection_status_equality ... ok
test tests::test_completion_cache ... ok
test tests::test_cache_lru_eviction ... ok
test tests::test_prompt_hashing ... ok
test tests::test_provider_initial_state ... ok
test tests::test_provider_builder_methods ... ok
test tests::test_generate_response_deserialization ... ok
test tests::test_streaming_response_deserialization ... ok
test tests::test_streaming_request_serialization ... ok
test tests::test_generate_request_serialization ... ok
test tests::test_tags_response_deserialization ... ok
test tests::test_tags_response_empty ... ok
test tests::test_trim_to_end_of_line_unless_leading_newline ... ok

test result: ok. 17 passed; 0 failed; 0 ignored

Manual Testing Checklist

  • cargo build -p zed completes without errors
  • cargo test -p ollama_completion passes all tests
  • Ollama provider appears in edit prediction dropdown
  • Selecting Ollama updates settings correctly
  • Completions appear when typing with Ollama running
  • Tab accepts completions correctly
  • Error indicator shows when Ollama not running
  • Context menu shows appropriate status messages
  • Custom URL and model settings are respected
  • Temperature and max_tokens settings work
  • Switching between providers works
  • Health check runs on first request
  • Cache hits avoid redundant API calls
  • Streaming provides responsive completions

Performance Considerations

Aspect Implementation
Debounce 75ms delay prevents API spam
Caching LRU cache (50 entries) eliminates redundant calls
Context Smart truncation (4KB prefix, 1KB suffix)
Health check One-time per session
Streaming Lower perceived latency
Memory Cache eviction prevents unbounded growth

Commits

Commit Description
80df76c Core provider with FIM support
4e76750 User configuration
f61d00e Error handling UI and tests
5b5bd8f Health check, context optimization, caching
1aff3ce Streaming completions and telemetry
dee4579 Telemetry callbacks and configurable parameters

Future Enhancements

Potential improvements for future PRs:

Feature Complexity Description
Multi-completion cycling Medium Generate and cycle through multiple suggestions
Language filtering Medium Enable/disable per language
Model capability validation Medium Verify model supports code completion
Stop sequences config Low User-configurable stop tokens

Related Issues


Compatibility

  • Zed Version: Built against current main branch
  • Ollama Version: Tested with Ollama 0.4.x+
  • Platforms: macOS (tested), Linux/Windows (should work)

🤖 Generated with Claude Code

akhil-p-git and others added 6 commits December 1, 2025 15:36
Adds support for using Ollama as an edit prediction provider, enabling
users to get code completions from locally-running LLMs without requiring
external API keys.

Implementation:
- New `ollama_completion` crate implementing `EditPredictionProvider` trait
- Uses Fill-In-Middle (FIM) prompt format for context-aware completions
- Configured with 75ms debounce, 256 max tokens, 0.2 temperature
- Default model: qwen2.5-coder:7b at localhost:11434

To use:
1. Install Ollama and run: `ollama pull qwen2.5-coder:7b`
2. Add to settings: `{ "features": { "edit_prediction_provider": "ollama" } }`

Closes zed-industries#15968

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Allows users to configure Ollama edit prediction settings:
- `api_url`: Custom Ollama server URL (default: http://localhost:11434)
- `model`: Model to use for completions (default: qwen2.5-coder:7b)

Settings can be configured in settings.json:
```json
{
  "edit_predictions": {
    "ollama": {
      "api_url": "http://localhost:11434",
      "model": "qwen2.5-coder:7b"
    }
  }
}
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add global OllamaConnectionStatus tracking (Unknown, Connected, Error)
- Show error indicator (red dot) on status bar button when connection fails
- Display helpful error messages in context menu (e.g., "Ollama server is not running")
- Show success indicator (green "Connected") when Ollama is responding
- Add 7 unit tests covering helper functions, status tracking, and serialization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…for Ollama

This commit adds three enhancements to the Ollama completion provider:

**Model Health Check**
- Validates Ollama server is running on first completion request
- Verifies configured model is available via /api/tags endpoint
- Provides actionable error messages (e.g., lists available models)
- Only runs once per session to avoid repeated network calls

**Context Optimization**
- Limits prefix to 4KB and suffix to 1KB for efficient context windows
- Smart line-boundary detection avoids cutting code mid-line
- Prevents malformed prompts that could degrade completion quality

**Completion Caching**
- LRU cache (max 50 entries) stores recent completions
- Hash-based lookup for O(1) cache hits
- Eliminates redundant API calls for identical prompts
- Automatic eviction of oldest entries when cache is full

Adds 6 new unit tests (13 total):
- test_prompt_hashing
- test_completion_cache
- test_cache_lru_eviction
- test_tags_response_deserialization
- test_tags_response_empty
- test_provider_initial_state

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
**Streaming Completions**
- Replaced non-streaming API with streaming for lower perceived latency
- Processes line-delimited JSON responses incrementally
- Tracks time-to-first-token for performance monitoring
- Accumulates tokens as they arrive from Ollama

**Telemetry Integration**
- Reports completion events: model, token count, latency metrics
- Tracks cache hits separately (zero latency, zero tokens)
- Reports health check success/failure events
- Reports completion errors with error messages
- Uses standard telemetry::event! macro for consistency

**Metrics Captured**
- Model name used for completion
- Token count from streaming response
- Total completion time (ms)
- Time to first token (ms) - key latency metric
- Cache hit status

Adds 4 new unit tests (17 total):
- test_streaming_response_deserialization
- test_streaming_request_serialization
- test_completion_metrics
- test_completion_metrics_cached

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…for Ollama

**Accept/Reject Telemetry**
- Reports "Ollama Completion Accepted" event when user accepts a completion
- Reports "Ollama Completion Discarded" event when user rejects/discards
- Includes model name in event properties for analytics

**did_show() Callback**
- Reports "Ollama Completion Shown" event when completion is displayed
- Tracks completion_length for understanding suggestion quality
- Enables measuring impression-to-acceptance rate

**Configurable Parameters**
- Added `temperature` setting (default: 0.2) - controls randomness
- Added `max_tokens` setting (default: 256) - controls completion length
- Both configurable via edit_predictions.ollama in settings.json
- Example: `"edit_predictions": { "ollama": { "temperature": 0.3, "max_tokens": 512 } }`

**Settings Schema**
- Updated OllamaEditPredictionSettingsContent with new fields
- Updated OllamaSettings runtime struct
- Wired settings through edit_prediction_registry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@robertherber
Copy link

Just started trying out Zed today - and was looking for exactly this! Amazing work @akhil-p-git and good timing :) When do you think this will be released?

@akhil-p-git
Copy link
Owner Author

Just started trying out Zed today - and was looking for exactly this! Amazing work @akhil-p-git and good timing :) When do you think this will be released?

Sorry about that, life got busy. I can merge it any time!

@akhil-p-git akhil-p-git merged commit a42b879 into main Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support using ollama as an inline_completion_provider

2 participants