Skip to content

/model switch to local LM Studio model fails with "Model is unloaded" — pre-flight check blocks JIT loading #3802

@T-Phuong-Nguyen

Description

@T-Phuong-Nguyen

Environment

  • Qwen Code version: 0.15.6
  • OS: Windows 11 Pro for Workstations 10.0.26200
  • LM Studio version: 0.4.12
  • LM Studio JIT loading: Enabled (confirmed working via direct API calls)
  • Local models: Qwen3.6-27B, Qwen3.6-35B-A3B (via LM Studio on localhost:1234)

Description

When switching to a local model via /model that is downloaded but not currently loaded in LM Studio, Qwen Code immediately returns [API Error: Model is unloaded.] without sending the actual chat completion request.

LM Studio 0.4.x supports Just-In-Time (JIT) model loading — when a chat completion request arrives for an unloaded model, LM Studio automatically loads it into GPU memory and serves the request. This works perfectly for all other API clients but fails with Qwen Code because the request is never sent.

Steps to Reproduce

  1. Configure a local model in modelProviders:
{
  "id": "qwen/qwen3.6-35b-a3b",
  "name": "Local Model",
  "envKey": "LMSTUDIO_API_KEY",
  "baseUrl": "http://localhost:1234/v1"
}
  1. Ensure LM Studio is running with JIT enabled but no model loaded:
$ lms ps
No models are currently loaded.
  1. In Qwen Code, switch to the local model:
> /model
→ Select local model
> Say "ready"
✕ [API Error: Model is unloaded.] (Press Ctrl+Y to retry)
  1. Same request via curl succeeds — JIT loads the model and responds:
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer lm-studio" \
  -d '{"model": "qwen/qwen3.6-35b-a3b", "messages": [{"role": "user", "content": "say ready"}], "max_tokens": 5}'
# → 200 OK, model JIT-loaded in ~15s, response returned

Root Cause Analysis

Qwen Code appears to perform a pre-flight model state check before sending the chat completion request. It likely:

  1. Queries GET /v1/models or GET /api/v0/models to check model availability
  2. Detects the model's state as "not-loaded"
  3. Returns the error to the user without sending the actual inference request

This prevents LM Studio's JIT loading from ever triggering, since JIT only activates when a chat completion request arrives.

Expected Behavior

When a model is listed in the provider's model catalog (returned by GET /v1/models), Qwen Code should send the chat completion request regardless of the model's loaded state. The server is responsible for model lifecycle management — the client should not second-guess it.

For JIT-capable servers like LM Studio, the first request may take 10-20 seconds (model loading time), but subsequent requests will be fast. The existing generationConfig.timeout (e.g., 300000ms) already accounts for this.

Impact

This blocks a common local AI workflow: using LM Studio's JIT + Auto-Evict to automatically swap between multiple models that don't fit in GPU memory simultaneously. For example, running both Qwen3.6-27B (17.5GB) and Qwen3.6-35B-A3B (22GB) on a 32GB GPU — JIT + Auto-Evict keeps only one loaded at a time and swaps on demand.

Suggested Fix

Skip or make optional the pre-flight model state check when baseUrl points to a local server. Alternatively, add a modelProviders option like "skipModelStateCheck": true or "allowJitLoading": true to let users opt in.

Workaround

Manually load the model via lms load <model> before switching in Qwen Code. This defeats the purpose of JIT but works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions