-
-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hi, sorry for bothering you again. I'm not sure if I'm setting everything up correctly, but I don’t know how to properly configure my environment to achieve good performance.
I'm using Proxmox, with two LXC containers:
- One running OpenWebUI
- Another running the models, where I have Ollama along with the Olla and Sherpa engines in a single docker-compose setup.
When I connect OpenWebUI directly to Ollama, I receive a smooth and continuous stream of responses without any issues.
However, when I connect OpenWebUI through either Olla or Sherpa engines, the response streaming comes in bursts, leading to a poor user experience.
I’ve tested both engines (Olla and Sherpa), but I still haven’t found a configuration that delivers consistent streaming performance.
This is my configuration:
server:
host: "0.0.0.0"
port: 40114
read_timeout: "5m"
write_timeout: "0s" # No timeout for streaming
shutdown_timeout: "30s"
request_logging: false # Reduce log volume
request_limits:
max_body_size: 104857600 # 100MB
max_header_size: 1048576 # 1MB
rate_limits:
global_requests_per_minute: 10000
per_ip_requests_per_minute: 1000
burst_size: 50
health_requests_per_minute: 10000
cleanup_interval: "5m"
trust_proxy_headers: true # Behind load balancer
trusted_proxy_cidrs:
- "10.0.0.0/8"
- "172.16.0.0/12"
proxy:
engine: "olla" # High-performance engine
load_balancer: "least-connections"
connection_timeout: "30s"
response_timeout: "30m" # Long timeout for slow models
read_timeout: "5m"
max_retries: 3
retry_backoff: "1s"
stream_buffer_size: 65536 # 64KB for better streaming
discovery:
type: "static"
refresh_interval: 30s
static:
endpoints:
- url: "http://ollama:11434"
name: "server"
type: "ollama"
priority: 100
model_url: "/api/tags"
health_check_url: "/"
check_interval: 2s
check_timeout: 1s
- url: "http://desktop:11434"
name: "desktop"
type: "ollama"
priority: 10
model_url: "/api/tags"
health_check_url: "/"
check_interval: 2s
check_timeout: 1s
model_discovery:
enabled: true
interval: 5m
timeout: 30s
concurrent_workers: 10
retry_attempts: 3
retry_backoff: 1s
model_registry:
type: "memory"
enable_unifier: true
unification:
enabled: true
stale_threshold: 24h # How long to keep models in memory after last seen
cleanup_interval: 10m # How often to check for stale models
logging:
level: "info" # debug, info, warn, error
format: "json" # json, text
output: "stdout" # stdout, file
engineering:
show_nerdstats: false
I would really appreciate any guidance or help on how to properly set this up and improve performance. Thank you!