Skip to content

Olla with docker model runner #80

@Turboscherbe

Description

@Turboscherbe

Thank you for this great software. I tried to get it running with docker compose and its new model runner. With the following configuration I managed to see the (healthy) endpoint and the available models. But I cannot access the service endpoints from the other end (ollama, openai, ...).

I tried to setup the endpoint with openai-api since docker model runner supports a compatible api (see https://docs.docker.com/ai/model-runner/api-reference/#available-openai-endpoints )

  • Endpoint is running and healthy
  • Models with internal api route /internal/status/models or /olla/models show up
  • Models with external api routes return an empty list (i.e. /olla/openai/v1/models or /olla/llamacpp/v1/models)
  • External api endpoints do not work (404 page not found)

It feels like I'm nearly there, but it seems like I missed something with my config.

## docker-compose.yml ##
services:
  olla:
    image: ghcr.io/thushan/olla:latest
    container_name: olla
    restart: unless-stopped
    ports:
      - "40114:40114"
    volumes:
      - ./olla.yaml:/app/config.yaml:ro
      # - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:40114/internal/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    environment:
      - LOG_LEVEL=info
    models:
      gptoss:
        endpoint_var: GPTOSS_URL
        model_var: GPTOSS_MODEL

models:
  qwen3coder:
    model: ai/qwen3-coder:30B-A3B-UD-Q4_K_XL
    context_size: 30000
#    runtime_flags:
#      - "--n-gpu-layers 1"

For olla I used the default reference config:

## olla.yml ##
server:
  host: "0.0.0.0"
  port: 40114
  read_timeout: 20s
  write_timeout: 0s
  idle_timeout: 120s
  shutdown_timeout: 10s
  request_logging: false
  request_limits:
    max_body_size: 52428800    # 50MB
    max_header_size: 524288     # 512KB
  rate_limits:
    global_requests_per_minute: 0
    per_ip_requests_per_minute: 0
    health_requests_per_minute: 0
    burst_size: 50
    cleanup_interval: 1m
    trust_proxy_headers: false
    trusted_proxy_cidrs: []


proxy:
  engine: "olla"
  profile: "auto"
  load_balancer: "priority"
  connection_timeout: 30s
  response_timeout: 0s
  read_timeout: 0s
  stream_buffer_size: 4096
#  profile_filter:
#    include:
#      - "ollama"        # Include Ollama
#      - "openai*"       # Include all OpenAI variants

translators:
  anthropic:
    enabled: true                   # Enable Anthropic translator
    max_message_size: 10485760     # Max request size (10MB)


discovery:
  type: "static"
  refresh_interval: 5m
  model_discovery:
    enabled: true
    interval: 5m
    timeout: 30s
    concurrent_workers: 5
    retry_attempts: 3
    retry_backoff: 5s
  static:
    endpoints:
      - url: "http://172.17.0.1:12434/engines/llama.cpp/"
        name: "docker-model-loader"
        type: "openai"
        priority: 100
        model_url: "/v1/models"
        health_check_url: "http://172.17.0.1:12434/engines/llama.cpp/v1/models"
        check_interval: 15s
        check_timeout: 10s

model_registry:
  type: "memory"
  enable_unifier: true
  routing_strategy:
    type: "strict"       # strict, optimistic, or discovery
    options:
      fallback_behavior: "all"     # compatible_only, all, or none
      discovery_timeout: 2s
      discovery_refresh_on_miss: false
  unification:
    enabled: true
    stale_threshold: 24h   # Model retention time
    cleanup_interval: 10m  # Cleanup frequency
    cache_ttl: 5m
    custom_rules: []

logging:
  level: "debug"
  format: "text"
  output: "stdout"

So internal endpoints are working fine:

/internal/status/endpoints

{
  "timestamp": "2025-11-05T06:52:26.31293467Z",
  "endpoints": [
    {
      "name": "docker-model-loader",
      "type": "openai",
      "status": "healthy",
      "last_model_sync": "2m ago",
      "health_check": "25s ago",
      "response_time": "2ms",
      "success_rate": "100%",
      "priority": 100,
      "model_count": 2,
      "request_count": 3
    }
  ],
  "total_count": 1,
  "healthy_count": 1,
  "routable_count": 1
}

/olla/models

{
  "object": "list",
  "data": [
    {
      "olla": {
        "family": "",
        "variant": "",
        "parameter_size": "",
        "quantization": "",
        "aliases": [
          "ai/gpt-oss:20B-UD-Q6_K_XL"
        ],
        "availability": [
          {
            "endpoint": "docker-model-loader",
            "state": "unknown"
          }
        ],
        "capabilities": [
          "text-generation"
        ]
      },
      "id": "ai/gpt-oss:20B-UD-Q6_K_XL",
      "object": "model",
      "owned_by": "olla",
      "created": 1762324251
    },
    {
      "olla": {
        "family": "",
        "variant": "",
        "parameter_size": "",
        "quantization": "",
        "aliases": [
          "hf.co/unsloth/qwen3-coder-30b-a3b-instruct-gguf:q4_k_xl"
        ],
        "availability": [
          {
            "endpoint": "docker-model-loader",
            "state": "unknown"
          }
        ],
        "capabilities": [
          "text-generation",
          "code-generation",
          "programming",
          "code-completion",
          "instruction-following",
          "chat"
        ]
      },
      "id": "hf.co/unsloth/qwen3-coder-30b-a3b-instruct-gguf:q4_k_xl",
      "object": "model",
      "owned_by": "olla",
      "created": 1762324251
    }
  ]
}

But external don't:

/olla/openai/v1/models

{
  "object": "list",
  "data": []
}

/olla/openai/v1

404 page not found

How can I get the openai or ollama endpoints up and running?

Best regards
Torsten

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentationinvestigatingWe're actively investigating the issue.llm-backendIssue is about an LLM Backend, provider or type. (Eg. Ollama, vllm)routingThis issue is with routing

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions