Skip to content

Performance Issues #40

@vk2r

Description

@vk2r

Hi, sorry for bothering you again. I'm not sure if I'm setting everything up correctly, but I don’t know how to properly configure my environment to achieve good performance.

I'm using Proxmox, with two LXC containers:

  • One running OpenWebUI
  • Another running the models, where I have Ollama along with the Olla and Sherpa engines in a single docker-compose setup.

When I connect OpenWebUI directly to Ollama, I receive a smooth and continuous stream of responses without any issues.

However, when I connect OpenWebUI through either Olla or Sherpa engines, the response streaming comes in bursts, leading to a poor user experience.

I’ve tested both engines (Olla and Sherpa), but I still haven’t found a configuration that delivers consistent streaming performance.

This is my configuration:

server:
  host: "0.0.0.0"
  port: 40114
  read_timeout: "5m"
  write_timeout: "0s"  # No timeout for streaming
  shutdown_timeout: "30s"
  request_logging: false  # Reduce log volume

request_limits:
  max_body_size: 104857600    # 100MB
  max_header_size: 1048576    # 1MB

rate_limits:
  global_requests_per_minute: 10000
  per_ip_requests_per_minute: 1000
  burst_size: 50
  health_requests_per_minute: 10000
  cleanup_interval: "5m"
  trust_proxy_headers: true  # Behind load balancer
  trusted_proxy_cidrs:
    - "10.0.0.0/8"
    - "172.16.0.0/12"

proxy:
  engine: "olla"  # High-performance engine
  load_balancer: "least-connections"
  connection_timeout: "30s"
  response_timeout: "30m"  # Long timeout for slow models
  read_timeout: "5m"
  max_retries: 3
  retry_backoff: "1s"
  stream_buffer_size: 65536  # 64KB for better streaming

discovery:
  type: "static"
  refresh_interval: 30s
  static:
    endpoints:
      - url: "http://ollama:11434"
        name: "server"
        type: "ollama"
        priority: 100
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 2s
        check_timeout: 1s

      - url: "http://desktop:11434"
        name: "desktop"
        type: "ollama"
        priority: 10
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 2s
        check_timeout: 1s

  model_discovery:
    enabled: true
    interval: 5m
    timeout: 30s
    concurrent_workers: 10
    retry_attempts: 3
    retry_backoff: 1s

model_registry:
  type: "memory"
  enable_unifier: true
  unification:
    enabled: true
    stale_threshold: 24h  # How long to keep models in memory after last seen
    cleanup_interval: 10m  # How often to check for stale models

logging:
  level: "info"  # debug, info, warn, error
  format: "json"  # json, text
  output: "stdout"  # stdout, file

engineering:
  show_nerdstats: false

I would really appreciate any guidance or help on how to properly set this up and improve performance. Thank you!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingresolvedIssue is resolved and merged.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions