feat: local model management for Ollama and LM Studio (browse, pull, configure)

## Summary

Local providers (Ollama, LM Studio) have management APIs that SynthOrg doesn't use. Currently, configured provider model counts come from LiteLLM's static `model_cost` database instead of the actual running instance, and there's no way to download new models or configure launch parameters from the dashboard.

## Bug: model count mismatch in setup wizard

The setup wizard's "Auto-detected Providers" section shows one model count (live probe via `/api/tags`) while the "Configured Providers" card shows a different count (from LiteLLM's hardcoded database). For example, Ollama auto-detect shows 17 models (the real ones) but the configured provider shows 29 (LiteLLM's static list, which includes models like `deepseek-v3.1:671b-cloud` that don't exist locally).

**Root cause:** `create_from_preset` (service.py:394) calls `models_from_litellm("ollama")` which returns 29 entries from LiteLLM's static DB. Since this is truthy, `_maybe_discover_preset_models` (line 453: `if models: return models`) skips live discovery entirely.

**Fix:** For local/no-auth presets (Ollama, LM Studio, vLLM), always prefer live discovery over `models_from_litellm`. The static DB is irrelevant for local providers -- only what's actually running matters.

## Feature: local model management

### Provider capabilities matrix

| Provider | List models | Pull/download | Delete | Launch params | Notes |
|----------|-------------|---------------|--------|---------------|-------|
| **Ollama** | `GET /api/tags` | `POST /api/pull` | `DELETE /api/delete` | `num_ctx`, `num_gpu`, `num_thread` via options | Full management API |
| **LM Studio** | `GET /v1/models` | `GET /api/v0/models/download` | Yes via API | Context length, GPU layers via load params | Newer API versions |
| **vLLM** | `GET /v1/models` | No | No | No (launch-time only) | Read-only -- launched with `--model X` |

### Backend requirements

1. **ProviderPreset capability flags**: Add `supports_model_pull`, `supports_model_delete`, `supports_model_config` booleans to `ProviderPreset` so the UI knows what controls to show per provider.

2. **Model pull/download endpoint**: `POST /api/v1/providers/{name}/models/pull` with `{ "model": "llama3.1:8b" }`. Proxies to Ollama's `/api/pull` or LM Studio's download API. Returns a streaming response for download progress.

3. **Model delete endpoint**: `DELETE /api/v1/providers/{name}/models/{model_id}`. Proxies to the provider's delete API.

4. **Model config endpoint**: `PUT /api/v1/providers/{name}/models/{model_id}/config` with launch parameters (`num_ctx`, `num_gpu`, etc.). Stored in `ProviderModelConfig` and passed to LiteLLM at inference time.

5. **Refresh models endpoint**: `POST /api/v1/providers/{name}/models/refresh`. Re-runs live discovery and updates the stored model list to match reality.

6. **Fix `create_from_preset`**: For presets with `auth_type=NONE`, skip `models_from_litellm` and go directly to live discovery.

### Dashboard requirements

1. **Model browser**: In the provider detail page, show the actual models from the running instance (not LiteLLM's static list). Include model size, quantization info, and last-modified date where available.

2. **Pull new model**: Search/input field to pull a model by name (e.g., `llama3.1:8b`). Show download progress with a progress bar. Only shown for providers with `supports_model_pull`.

3. **Delete model**: Confirmation dialog to remove a model from the local instance. Only shown for providers with `supports_model_delete`.

4. **Model parameters**: Per-model config editor for launch parameters (context window size, GPU layer count, thread count). Only shown for providers with `supports_model_config`. Values stored in `ProviderModelConfig` and applied at inference time.

5. **Refresh button**: Re-discover models from the running instance and sync the stored list.

### Per-model parameters (Ollama)

| Parameter | Description | Default |
|-----------|-------------|---------|
| `num_ctx` | Context window size (tokens) | Model default |
| `num_gpu` | Number of GPU layers to offload | Auto |
| `num_thread` | CPU thread count | Auto |
| `num_batch` | Batch size for prompt processing | 512 |
| `repeat_penalty` | Repetition penalty | 1.1 |

### Per-model parameters (LM Studio)

| Parameter | Description | Default |
|-----------|-------------|---------|
| `n_ctx` | Context window size | Model default |
| `n_gpu_layers` | GPU layers to offload | Auto |

## Design Reference

- Providers design page: `docs/design/operations.md` (provider configuration section)
- Provider presets: `src/synthorg/providers/presets.py`
- Provider management service: `src/synthorg/providers/management/service.py`
- Discovery module: `src/synthorg/providers/discovery.py`
- Probing module: `src/synthorg/providers/probing.py`

## Dependencies

None -- all provider infrastructure already exists.

## Implementation notes

- The bug fix (skip `models_from_litellm` for local presets) should land first as it's the smallest change
- Streaming pull progress will need a WebSocket or SSE endpoint
- vLLM is read-only by design -- only show list/refresh, no pull/delete/config controls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: local model management for Ollama and LM Studio (browse, pull, configure) #1030

Summary

Bug: model count mismatch in setup wizard

Feature: local model management

Provider capabilities matrix

Backend requirements

Dashboard requirements

Per-model parameters (Ollama)

Per-model parameters (LM Studio)

Design Reference

Dependencies

Implementation notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Provider	List models	Pull/download	Delete	Launch params	Notes
Ollama	`GET /api/tags`	`POST /api/pull`	`DELETE /api/delete`	`num_ctx`, `num_gpu`, `num_thread` via options	Full management API
LM Studio	`GET /v1/models`	`GET /api/v0/models/download`	Yes via API	Context length, GPU layers via load params	Newer API versions
vLLM	`GET /v1/models`	No	No	No (launch-time only)	Read-only -- launched with `--model X`

Parameter	Description	Default
`num_ctx`	Context window size (tokens)	Model default
`num_gpu`	Number of GPU layers to offload	Auto
`num_thread`	CPU thread count	Auto
`num_batch`	Batch size for prompt processing	512
`repeat_penalty`	Repetition penalty	1.1

Parameter	Description	Default
`n_ctx`	Context window size	Model default
`n_gpu_layers`	GPU layers to offload	Auto

feat: local model management for Ollama and LM Studio (browse, pull, configure) #1030

Description

Summary

Bug: model count mismatch in setup wizard

Feature: local model management

Provider capabilities matrix

Backend requirements

Dashboard requirements

Per-model parameters (Ollama)

Per-model parameters (LM Studio)

Design Reference

Dependencies

Implementation notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions