Environment
- Qwen Code version: 0.15.6
- OS: Windows 11 Pro for Workstations 10.0.26200
- LM Studio version: 0.4.12
- LM Studio JIT loading: Enabled (confirmed working via direct API calls)
- Local models: Qwen3.6-27B, Qwen3.6-35B-A3B (via LM Studio on
localhost:1234)
Description
When switching to a local model via /model that is downloaded but not currently loaded in LM Studio, Qwen Code immediately returns [API Error: Model is unloaded.] without sending the actual chat completion request.
LM Studio 0.4.x supports Just-In-Time (JIT) model loading — when a chat completion request arrives for an unloaded model, LM Studio automatically loads it into GPU memory and serves the request. This works perfectly for all other API clients but fails with Qwen Code because the request is never sent.
Steps to Reproduce
- Configure a local model in
modelProviders:
{
"id": "qwen/qwen3.6-35b-a3b",
"name": "Local Model",
"envKey": "LMSTUDIO_API_KEY",
"baseUrl": "http://localhost:1234/v1"
}
- Ensure LM Studio is running with JIT enabled but no model loaded:
$ lms ps
No models are currently loaded.
- In Qwen Code, switch to the local model:
> /model
→ Select local model
> Say "ready"
✕ [API Error: Model is unloaded.] (Press Ctrl+Y to retry)
- Same request via
curl succeeds — JIT loads the model and responds:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer lm-studio" \
-d '{"model": "qwen/qwen3.6-35b-a3b", "messages": [{"role": "user", "content": "say ready"}], "max_tokens": 5}'
# → 200 OK, model JIT-loaded in ~15s, response returned
Root Cause Analysis
Qwen Code appears to perform a pre-flight model state check before sending the chat completion request. It likely:
- Queries
GET /v1/models or GET /api/v0/models to check model availability
- Detects the model's state as
"not-loaded"
- Returns the error to the user without sending the actual inference request
This prevents LM Studio's JIT loading from ever triggering, since JIT only activates when a chat completion request arrives.
Expected Behavior
When a model is listed in the provider's model catalog (returned by GET /v1/models), Qwen Code should send the chat completion request regardless of the model's loaded state. The server is responsible for model lifecycle management — the client should not second-guess it.
For JIT-capable servers like LM Studio, the first request may take 10-20 seconds (model loading time), but subsequent requests will be fast. The existing generationConfig.timeout (e.g., 300000ms) already accounts for this.
Impact
This blocks a common local AI workflow: using LM Studio's JIT + Auto-Evict to automatically swap between multiple models that don't fit in GPU memory simultaneously. For example, running both Qwen3.6-27B (17.5GB) and Qwen3.6-35B-A3B (22GB) on a 32GB GPU — JIT + Auto-Evict keeps only one loaded at a time and swaps on demand.
Suggested Fix
Skip or make optional the pre-flight model state check when baseUrl points to a local server. Alternatively, add a modelProviders option like "skipModelStateCheck": true or "allowJitLoading": true to let users opt in.
Workaround
Manually load the model via lms load <model> before switching in Qwen Code. This defeats the purpose of JIT but works.
Environment
localhost:1234)Description
When switching to a local model via
/modelthat is downloaded but not currently loaded in LM Studio, Qwen Code immediately returns[API Error: Model is unloaded.]without sending the actual chat completion request.LM Studio 0.4.x supports Just-In-Time (JIT) model loading — when a chat completion request arrives for an unloaded model, LM Studio automatically loads it into GPU memory and serves the request. This works perfectly for all other API clients but fails with Qwen Code because the request is never sent.
Steps to Reproduce
modelProviders:{ "id": "qwen/qwen3.6-35b-a3b", "name": "Local Model", "envKey": "LMSTUDIO_API_KEY", "baseUrl": "http://localhost:1234/v1" }curlsucceeds — JIT loads the model and responds:Root Cause Analysis
Qwen Code appears to perform a pre-flight model state check before sending the chat completion request. It likely:
GET /v1/modelsorGET /api/v0/modelsto check model availability"not-loaded"This prevents LM Studio's JIT loading from ever triggering, since JIT only activates when a chat completion request arrives.
Expected Behavior
When a model is listed in the provider's model catalog (returned by
GET /v1/models), Qwen Code should send the chat completion request regardless of the model's loaded state. The server is responsible for model lifecycle management — the client should not second-guess it.For JIT-capable servers like LM Studio, the first request may take 10-20 seconds (model loading time), but subsequent requests will be fast. The existing
generationConfig.timeout(e.g., 300000ms) already accounts for this.Impact
This blocks a common local AI workflow: using LM Studio's JIT + Auto-Evict to automatically swap between multiple models that don't fit in GPU memory simultaneously. For example, running both Qwen3.6-27B (17.5GB) and Qwen3.6-35B-A3B (22GB) on a 32GB GPU — JIT + Auto-Evict keeps only one loaded at a time and swaps on demand.
Suggested Fix
Skip or make optional the pre-flight model state check when
baseUrlpoints to a local server. Alternatively, add amodelProvidersoption like"skipModelStateCheck": trueor"allowJitLoading": trueto let users opt in.Workaround
Manually load the model via
lms load <model>before switching in Qwen Code. This defeats the purpose of JIT but works.