Skip to content

fix(studio): adopt server-loaded model before chat auto-load#5900

Merged
danielhanchen merged 9 commits into
unslothai:mainfrom
jimdawdy-hub:fix/studio-respect-server-active-model
Jun 12, 2026
Merged

fix(studio): adopt server-loaded model before chat auto-load#5900
danielhanchen merged 9 commits into
unslothai:mainfrom
jimdawdy-hub:fix/studio-respect-server-active-model

Conversation

@jimdawdy-hub

@jimdawdy-hub jimdawdy-hub commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • When the chat checkpoint is empty, sync from /api/inference/status before running autoLoadSmallestModel().
  • Prevents unsloth studio run -m … from being replaced by the smallest cached GGUF (or the Gemma fallback download) on the first chat message.

Problem

If the browser polls status before the CLI load finishes, the UI has no checkpoint. Sending a message then triggers auto-load, which can unload the CLI-loaded model and load a different one (reproduced with Qwen via CLI → Gemma via UI auto-load).

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --threads 12 --threads-batch 8 --threads-http 4 --threads-draft 4
Time Event
13:27:29 CLI starts POST /api/inference/load for Qwen
13:27:34 Browser GET /api/inference/status5s before load completes
13:27:39 CLI: Model loaded: unsloth/Qwen3.6-27B-MTP-GGUF (UD-IQ2_XXS)
13:28:01 First chat message → GET /api/models/cached-gguf (auto-load path)
13:28:02 UI POST /api/inference/load for unsloth/gemma-4-E2B-it-GGUF — Qwen unloaded
13:28:14 Chat hits /v1/chat/completions on Gemma, not Qwen

Early status poll (no active model yet):

{"timestamp": "2026-05-31T13:27:34.249462Z", "event": "request_completed", "method": "GET", "path": "/api/inference/status", "status_code": 200, "process_time_ms": 43.35}

Auto-load replaces Qwen with Gemma:

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}
{"timestamp": "2026-05-31T13:28:02.349969Z", "event": "Not inheriting llama_extra_args: stored args came from ('unsloth/Qwen3.6-27B-MTP-GGUF', 'UD-IQ2_XXS'), loading ('unsloth/gemma-4-E2B-it-GGUF', 'UD-Q4_K_XL')"}
{"timestamp": "2026-05-31T13:28:13.976568Z", "event": "Loaded GGUF model via llama-server: unsloth/gemma-4-E2B-it-GGUF"}

Test plan

  • Start unsloth studio run -m unsloth/Qwen3.6-27B-MTP-GGUF --gguf-variant UD-IQ2_XXS … and open the UI before "Model loaded" appears
  • Send a chat message without manually selecting a model
  • Confirm the top bar shows Qwen (not Gemma) and llama-server logs show no second /api/inference/load for a different repo

Verification (2026-05-31, patched editable install)

See verification comment on the PR — summary:

  • No Gemma auto-load after studio run -m Qwen (only helper pre-cache, no inference load for gemma-4-E2B)
  • --no-mmproj honored — no mmproj download; llama-server launched with --no-mmproj only
  • Reload inherits argsInheriting llama_extra_args ... ['--no-mmproj'] on same-model reload
  • 165 test_llama_server_args.py tests passed

When the user starts Studio via `studio run -m`, the web UI could still
auto-load a different cached GGUF on the first message because the chat
checkpoint was empty. Sync from /api/inference/status before falling back
to autoLoadSmallestModel so CLI-loaded models are not replaced.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to adopt an already active model on the inference server into the chat UI checkpoint, avoiding unnecessary model reloading. This is implemented via the new tryAdoptServerActiveModel function, which is integrated into the auto-loading flow. The review feedback highlights two main improvements: wrapping tryAdoptServerActiveModel in a try-catch block to prevent bypassing necessary cleanup logic during errors, and simplifying a redundant conditional check in tryAdoptServerActiveModel which also allows for the removal of an unused import.

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 559082ea9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Verification logs (patched branch, Arch Linux, 2026-05-31)

Installed from editable checkout: pip install -e /home/jim/Projects/unsloth (branch fix/studio-poll-cli-model-load, includes #5900 + #5901 + #5902).

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --port 8889 --host 127.0.0.1 --silent

#5902--no-mmproj honored; no mmproj download on CLI load

Before (repro): Downloading mmproj: unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf

After (patched):

{"timestamp": "2026-05-31T14:14:14.354872Z", "event": "Vision-capable GGUF loaded without a usable mmproj; image input will be disabled for this session"}
{"timestamp": "2026-05-31T14:14:14.355179Z", "event": "Appending user extra args to llama-server: ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:14:14.355224Z", "event": "Starting llama-server: ... --no-mmproj"}

(No Downloading mmproj line. llama-server command ends with --no-mmproj, not --mmproj.)

#5902 — UI reload inherits llama_extra_args

{"timestamp": "2026-05-31T14:15:33.870718Z", "event": "Inheriting llama_extra_args from previous load (same model, shadow-stripped): ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:15:35.282062Z", "event": "Starting llama-server: ... --no-mmproj"}
{"timestamp": "2026-05-31T14:16:01.740749Z", "event": "Loaded GGUF model via llama-server: unsloth/Qwen3.6-27B-MTP-GGUF"}

Reload request: POST /api/inference/load with {"model_path":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS"} (no llama_extra_args field).

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Before (repro):

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

After (patched session): no Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF' inference load. Only background helper pre-cache:

{"timestamp": "2026-05-31T14:14:13.453443Z", "event": "Pre-caching helper GGUF: unsloth/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-UD-Q4_K_XL.gguf"}

Status after CLI load:

{"active_model":"unsloth/Qwen3.6-27B-MTP-GGUF","model_identifier":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS","is_vision":false}

Chat completion stayed on Qwen:

{"model":"unsloth/Qwen3.6-27B-MTP-GGUF","choices":[{"delta":{"content":"The user is asking me to reply with exactly \"OK\"..."}}]}

POST /api/inference/load count in session: 1 CLI load + 1 intentional same-model reload test — no Gemma load.

Unit tests

pytest studio/backend/tests/test_llama_server_args.py — 165 passed

jimdawdy-hub and others added 2 commits June 2, 2026 19:58
Extract shared inference-status hydration for refresh() and CLI adopt
paths so the first chat turn gets reasoning/tools flags. Wrap auto-load
(including adopt) in try/catch for image-edit cleanup, and drop the
redundant adopt call in run().

Co-authored-by: Cursor <cursoragent@cursor.com>
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

All three Codex concerns (error handling for getInferenceStatus() failures, redundant checkpoint guard, and hydrating adopted CLI model capabilities) were addressed in commits 559082e and 4e42732. No open review threads remain.

@rolandtannous — ready for review when you get a chance. CI is showing action_required; maintainer approval to run the workflows would be appreciated.

Resolve conflicts by keeping shared apply-inference-status-to-store
hydration while adopting main's refresh/load paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Merge conflict resolution

Merged latest main (436525d6) into fix/studio-respect-server-active-model.

Conflicts resolved:

  • chat-adapter.ts — kept the adopt-before-auto-load comment and existing try/catch around autoLoadSmallestModel().
  • use-chat-model-runtime.ts — kept the shared apply-inference-status-to-store hydration path (resolveInferenceCheckpointId + applyActiveModelStatusToStore) instead of duplicating the inline refresh block from main.

The shared helper already includes resolveToolsEnabledOnLoad, speculative-type normalization, and Qwen reasoning defaults, so adopt + refresh stay in sync.

Review threads: all three Codex/Gemini threads are resolved (redundant checkpoint guard was fixed in 4e42732).

PR is MERGEABLE; waiting on pre-commit.ci.

@danielhanchen danielhanchen self-assigned this Jun 11, 2026
@danielhanchen

Copy link
Copy Markdown
Member

Pushed 4cd23a6 with two small hardening changes to tryAdoptServerActiveModel():

  1. getInferenceStatus() failures are now caught and treated as no adoption, so a status endpoint hiccup falls back to the normal auto-load path instead of failing the first send.
  2. The checkpoint is re-checked after the await, so a model the user selects while the status request is in flight is never overwritten by the adoption path.

The shared hydration extraction looks faithful to the original refresh block. Nice catch on the CLI model being replaced by the fallback auto-load.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4dc92f1646

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9899d3bdb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Synced with latest main and cleared open review threads.

  • Merged origin/main into this branch; PR is mergeable again.
  • Resolved remaining Codex/Gemini review threads (including items already addressed in @danielhanchen's follow-up commits).

Waiting on maintainer review/approval. pre-commit.ci is the only automated gate visible from fork PRs; GitHub Actions still require maintainer approval for first-time contributors.

Co-authored-by: Cursor <cursoragent@cursor.com>
@danielhanchen danielhanchen merged commit 515abca into unslothai:main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants