Bug Description
In multi-model router mode (--models-dir), the internal HTTP proxy client uses hardcoded timeout values instead of respecting the --timeout CLI parameter. This causes "Failed to read connection" errors when processing large prompts with slow models.
Steps to Reproduce
- Start server in router mode:
llama-server --models-dir ./models --timeout 36000 --port 8087
-
Send a request to a large model (e.g., 70B+ parameters on CPU)
-
Observe error after ~5 minutes:
http client error: Failed to read connection
Expected Behavior
The proxy should respect --timeout parameter, same as single-model mode (-m model.gguf).
Actual Behavior
Proxy uses hardcoded timeout:
- Connection timeout: 200ms
- Read/Write timeout: httplib default (~300 seconds)
This makes router mode unusable for large CPU models where a single request can take several hours.
Environment
- Large models (70B-235B parameters) running on CPU
- Long context prompts (10k+ tokens)
- Request processing time: 30 minutes to several hours
Suggested Fix
In tools/server/server-models.cpp, the server_http_proxy constructor should receive and use timeout_read from base_params instead of hardcoded values:
// Current code:
cli->set_connection_timeout(0, 200000); // 200 milliseconds
// Should be:
cli->set_connection_timeout(timeout_seconds);
cli->set_read_timeout(timeout_seconds);
cli->set_write_timeout(timeout_seconds);
Workaround
Use single-model mode (-m model.gguf) instead of router mode (--models-dir).
Bug Description
In multi-model router mode (
--models-dir), the internal HTTP proxy client uses hardcoded timeout values instead of respecting the--timeoutCLI parameter. This causes "Failed to read connection" errors when processing large prompts with slow models.Steps to Reproduce
Send a request to a large model (e.g., 70B+ parameters on CPU)
Observe error after ~5 minutes:
Expected Behavior
The proxy should respect
--timeoutparameter, same as single-model mode (-m model.gguf).Actual Behavior
Proxy uses hardcoded timeout:
This makes router mode unusable for large CPU models where a single request can take several hours.
Environment
Suggested Fix
In
tools/server/server-models.cpp, theserver_http_proxyconstructor should receive and usetimeout_readfrombase_paramsinstead of hardcoded values:Workaround
Use single-model mode (
-m model.gguf) instead of router mode (--models-dir).