fix(studio): adopt server-loaded model before chat auto-load by jimdawdy-hub · Pull Request #5900 · unslothai/unsloth

jimdawdy-hub · 2026-05-31T13:53:40Z

Summary

When the chat checkpoint is empty, sync from /api/inference/status before running autoLoadSmallestModel().
Prevents unsloth studio run -m … from being replaced by the smallest cached GGUF (or the Gemma fallback download) on the first chat message.

Problem

If the browser polls status before the CLI load finishes, the UI has no checkpoint. Sending a message then triggers auto-load, which can unload the CLI-loaded model and load a different one (reproduced with Qwen via CLI → Gemma via UI auto-load).

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --threads 12 --threads-batch 8 --threads-http 4 --threads-draft 4

Time	Event
13:27:29	CLI starts `POST /api/inference/load` for Qwen
13:27:34	Browser `GET /api/inference/status` — 5s before load completes
13:27:39	CLI: `Model loaded: unsloth/Qwen3.6-27B-MTP-GGUF (UD-IQ2_XXS)`
13:28:01	First chat message → `GET /api/models/cached-gguf` (auto-load path)
13:28:02	UI `POST /api/inference/load` for `unsloth/gemma-4-E2B-it-GGUF` — Qwen unloaded
13:28:14	Chat hits `/v1/chat/completions` on Gemma, not Qwen

Early status poll (no active model yet):

{"timestamp": "2026-05-31T13:27:34.249462Z", "event": "request_completed", "method": "GET", "path": "/api/inference/status", "status_code": 200, "process_time_ms": 43.35}

Auto-load replaces Qwen with Gemma:

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}
{"timestamp": "2026-05-31T13:28:02.349969Z", "event": "Not inheriting llama_extra_args: stored args came from ('unsloth/Qwen3.6-27B-MTP-GGUF', 'UD-IQ2_XXS'), loading ('unsloth/gemma-4-E2B-it-GGUF', 'UD-Q4_K_XL')"}
{"timestamp": "2026-05-31T13:28:13.976568Z", "event": "Loaded GGUF model via llama-server: unsloth/gemma-4-E2B-it-GGUF"}

Test plan

Start unsloth studio run -m unsloth/Qwen3.6-27B-MTP-GGUF --gguf-variant UD-IQ2_XXS … and open the UI before "Model loaded" appears
Send a chat message without manually selecting a model
Confirm the top bar shows Qwen (not Gemma) and llama-server logs show no second /api/inference/load for a different repo

Verification (2026-05-31, patched editable install)

See verification comment on the PR — summary:

No Gemma auto-load after studio run -m Qwen (only helper pre-cache, no inference load for gemma-4-E2B)
--no-mmproj honored — no mmproj download; llama-server launched with --no-mmproj only
Reload inherits args — Inheriting llama_extra_args ... ['--no-mmproj'] on same-model reload
165 test_llama_server_args.py tests passed

When the user starts Studio via `studio run -m`, the web UI could still auto-load a different cached GGUF on the first message because the chat checkpoint was empty. Sync from /api/inference/status before falling back to autoLoadSmallestModel so CLI-loaded models are not replaced. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request introduces a mechanism to adopt an already active model on the inference server into the chat UI checkpoint, avoiding unnecessary model reloading. This is implemented via the new tryAdoptServerActiveModel function, which is integrated into the auto-loading flow. The review feedback highlights two main improvements: wrapping tryAdoptServerActiveModel in a try-catch block to prevent bypassing necessary cleanup logic during errors, and simplifying a redundant conditional check in tryAdoptServerActiveModel which also allows for the removal of an unused import.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 559082ea9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jimdawdy-hub · 2026-05-31T14:17:05Z

Verification logs (patched branch, Arch Linux, 2026-05-31)

Installed from editable checkout: pip install -e /home/jim/Projects/unsloth (branch fix/studio-poll-cli-model-load, includes #5900 + #5901 + #5902).

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --port 8889 --host 127.0.0.1 --silent

#5902 — `--no-mmproj` honored; no mmproj download on CLI load

Before (repro): Downloading mmproj: unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf

After (patched):

{"timestamp": "2026-05-31T14:14:14.354872Z", "event": "Vision-capable GGUF loaded without a usable mmproj; image input will be disabled for this session"}
{"timestamp": "2026-05-31T14:14:14.355179Z", "event": "Appending user extra args to llama-server: ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:14:14.355224Z", "event": "Starting llama-server: ... --no-mmproj"}

(No Downloading mmproj line. llama-server command ends with --no-mmproj, not --mmproj.)

#5902 — UI reload inherits `llama_extra_args`

{"timestamp": "2026-05-31T14:15:33.870718Z", "event": "Inheriting llama_extra_args from previous load (same model, shadow-stripped): ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:15:35.282062Z", "event": "Starting llama-server: ... --no-mmproj"}
{"timestamp": "2026-05-31T14:16:01.740749Z", "event": "Loaded GGUF model via llama-server: unsloth/Qwen3.6-27B-MTP-GGUF"}

Reload request: POST /api/inference/load with {"model_path":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS"} (no llama_extra_args field).

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Before (repro):

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

After (patched session): no Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF' inference load. Only background helper pre-cache:

{"timestamp": "2026-05-31T14:14:13.453443Z", "event": "Pre-caching helper GGUF: unsloth/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-UD-Q4_K_XL.gguf"}

Status after CLI load:

{"active_model":"unsloth/Qwen3.6-27B-MTP-GGUF","model_identifier":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS","is_vision":false}

Chat completion stayed on Qwen:

{"model":"unsloth/Qwen3.6-27B-MTP-GGUF","choices":[{"delta":{"content":"The user is asking me to reply with exactly \"OK\"..."}}]}

POST /api/inference/load count in session: 1 CLI load + 1 intentional same-model reload test — no Gemma load.

Unit tests

pytest studio/backend/tests/test_llama_server_args.py — 165 passed

Extract shared inference-status hydration for refresh() and CLI adopt paths so the first chat turn gets reasoning/tools flags. Wrap auto-load (including adopt) in try/catch for image-edit cleanup, and drop the redundant adopt call in run(). Co-authored-by: Cursor <cursoragent@cursor.com>

jimdawdy-hub · 2026-06-09T04:21:56Z

All three Codex concerns (error handling for getInferenceStatus() failures, redundant checkpoint guard, and hydrating adopted CLI model capabilities) were addressed in commits 559082e and 4e42732. No open review threads remain.

@rolandtannous — ready for review when you get a chance. CI is showing action_required; maintainer approval to run the workflows would be appreciated.

Resolve conflicts by keeping shared apply-inference-status-to-store hydration while adopting main's refresh/load paths. Co-authored-by: Cursor <cursoragent@cursor.com>

jimdawdy-hub · 2026-06-10T01:05:28Z

Merge conflict resolution

Merged latest main (436525d6) into fix/studio-respect-server-active-model.

Conflicts resolved:

chat-adapter.ts — kept the adopt-before-auto-load comment and existing try/catch around autoLoadSmallestModel().
use-chat-model-runtime.ts — kept the shared apply-inference-status-to-store hydration path (resolveInferenceCheckpointId + applyActiveModelStatusToStore) instead of duplicating the inline refresh block from main.

The shared helper already includes resolveToolsEnabledOnLoad, speculative-type normalization, and Qwen reasoning defaults, so adopt + refresh stay in sync.

Review threads: all three Codex/Gemini threads are resolved (redundant checkpoint guard was fixed in 4e42732).

PR is MERGEABLE; waiting on pre-commit.ci.

… for PR unslothai#5900

danielhanchen · 2026-06-11T14:33:00Z

Pushed 4cd23a6 with two small hardening changes to tryAdoptServerActiveModel():

getInferenceStatus() failures are now caught and treated as no adoption, so a status endpoint hiccup falls back to the normal auto-load path instead of failing the first send.
The checkpoint is re-checked after the await, so a model the user selects while the status request is in flight is never overwritten by the adoption path.

The shared hydration extraction looks faithful to the original refresh block. Nice catch on the CLI model being replaced by the fallback auto-load.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4dc92f1646

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9899d3bdb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…rver-active-model

jimdawdy-hub · 2026-06-11T19:26:29Z

Synced with latest main and cleared open review threads.

Merged origin/main into this branch; PR is mergeable again.
Resolved remaining Codex/Gemini review threads (including items already addressed in @danielhanchen's follow-up commits).

Waiting on maintainer review/approval. pre-commit.ci is the only automated gate visible from fork PRs; GitHub Actions still require maintainer approval for first-time contributors.

Co-authored-by: Cursor <cursoragent@cursor.com>

jimdawdy-hub requested a review from rolandtannous as a code owner May 31, 2026 13:53

jimdawdy-hub mentioned this pull request May 31, 2026

fix(studio): poll inference status while CLI model is loading #5901

Open

3 tasks

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

This was referenced May 31, 2026

fix(studio): inherit llama_extra_args and honor --no-mmproj #5902

Merged

fix(studio): load run.py by path for editable installs #5909

Merged

jimdawdy-hub and others added 2 commits June 2, 2026 19:58

Merge branch 'main' into fix/studio-respect-server-active-model

fd8e809

Merge origin/main into fix/studio-respect-server-active-model

0c2b204

Resolve conflicts by keeping shared apply-inference-status-to-store hydration while adopting main's refresh/load paths. Co-authored-by: Cursor <cursoragent@cursor.com>

danielhanchen self-assigned this Jun 11, 2026

Guard model adoption against status failures and mid-flight selection…

4cd23a6

… for PR unslothai#5900

Merge main into fix/studio-respect-server-active-model

4dc92f1

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts

Merge branch 'main' into fix/studio-respect-server-active-model

9899d3b

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts

Merge remote-tracking branch 'origin/main' into fix/studio-respect-se…

ef55a76

…rver-active-model

ci: trigger pre-commit.ci after main merge

8fa331f

Co-authored-by: Cursor <cursoragent@cursor.com>

danielhanchen merged commit 515abca into unslothai:main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(studio): adopt server-loaded model before chat auto-load#5900

fix(studio): adopt server-loaded model before chat auto-load#5900
danielhanchen merged 9 commits into
unslothai:mainfrom
jimdawdy-hub:fix/studio-respect-server-active-model

jimdawdy-hub commented May 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

jimdawdy-hub commented May 31, 2026

Uh oh!

jimdawdy-hub commented Jun 9, 2026

Uh oh!

jimdawdy-hub commented Jun 10, 2026

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

jimdawdy-hub commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jimdawdy-hub commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Test plan

Verification (2026-05-31, patched editable install)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jimdawdy-hub commented May 31, 2026

Verification logs (patched branch, Arch Linux, 2026-05-31)

#5902 — --no-mmproj honored; no mmproj download on CLI load

#5902 — UI reload inherits llama_extra_args

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Unit tests

Uh oh!

jimdawdy-hub commented Jun 9, 2026

Uh oh!

jimdawdy-hub commented Jun 10, 2026

Merge conflict resolution

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jimdawdy-hub commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimdawdy-hub commented May 31, 2026 •

edited

Loading

#5902 — `--no-mmproj` honored; no mmproj download on CLI load

#5902 — UI reload inherits `llama_extra_args`