fix(studio): poll inference status while CLI model is loading by jimdawdy-hub · Pull Request #5901 · unslothai/unsloth

jimdawdy-hub · 2026-05-31T13:54:18Z

Summary

Poll /api/inference/status for up to 60s on chat page mount when no checkpoint is set (covers the race where the UI loads before studio run -m finishes).
Extend waitForModelReady() to adopt externally loaded models, not only UI modelLoading state.
Use model_identifier from status when syncing the checkpoint (HF repo id vs display label).

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Same studio run -m unsloth/Qwen3.6-27B-MTP-GGUF … session as #5900. The UI opened while the CLI load was still in flight:

Time	Event
13:27:34	`GET /api/inference/status` returns empty checkpoint / no active model
13:27:39	CLI finishes Qwen load
13:28:02	UI auto-loads Gemma because checkpoint never synced from server status

Status polled before CLI load finished:

{"timestamp": "2026-05-31T13:27:34.249462Z", "event": "request_completed", "method": "GET", "path": "/api/inference/status", "status_code": 200}

Wrong model loaded by UI auto-load:

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

Test plan

Start studio run -m …, open the UI immediately, wait without refreshing
Confirm the model selector updates to the CLI-loaded model within ~60s
Send a chat message during CLI load; confirm it waits/adopts instead of auto-loading Gemma

Verification (2026-05-31, patched editable install)

See verification comment on the PR — summary:

No Gemma auto-load after studio run -m Qwen (only helper pre-cache, no inference load for gemma-4-E2B)
--no-mmproj honored — no mmproj download; llama-server launched with --no-mmproj only
Reload inherits args — Inheriting llama_extra_args ... ['--no-mmproj'] on same-model reload
165 test_llama_server_args.py tests passed

When the user starts Studio via `studio run -m`, the web UI could still auto-load a different cached GGUF on the first message because the chat checkpoint was empty. Sync from /api/inference/status before falling back to autoLoadSmallestModel so CLI-loaded models are not replaced. Co-authored-by: Cursor <cursoragent@cursor.com>

The chat page could refresh /api/inference/status before `studio run -m` finished loading, leaving the UI checkpoint empty. Poll status on mount when no checkpoint is set, and extend waitForModelReady to adopt external loads. Co-authored-by: Cursor <cursoragent@cursor.com>

Reloading the same GGUF from the UI without gguf_variant no longer drops CLI pass-through args like --no-mmproj. Skip mmproj download and launch when --no-mmproj is present in llama_extra_args. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request introduces the ability to adopt an already active model on the inference server into the chat UI without triggering a new load, adding polling mechanisms during startup and page refresh. The review feedback suggests simplifying a redundant condition in tryAdoptServerActiveModel and wrapping server status checks in try-catch blocks to prevent crashes from transient network errors. Additionally, it is recommended to defer model and Lora listing requests until after the active model polling completes to improve efficiency and robustness.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab3f95856b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cecc631ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

jimdawdy-hub · 2026-05-31T14:17:06Z

Verification logs (patched branch, Arch Linux, 2026-05-31)

Installed from editable checkout: pip install -e /home/jim/Projects/unsloth (branch fix/studio-poll-cli-model-load, includes #5900 + #5901 + #5902).

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --port 8889 --host 127.0.0.1 --silent

#5902 — `--no-mmproj` honored; no mmproj download on CLI load

Before (repro): Downloading mmproj: unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf

After (patched):

{"timestamp": "2026-05-31T14:14:14.354872Z", "event": "Vision-capable GGUF loaded without a usable mmproj; image input will be disabled for this session"}
{"timestamp": "2026-05-31T14:14:14.355179Z", "event": "Appending user extra args to llama-server: ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:14:14.355224Z", "event": "Starting llama-server: ... --no-mmproj"}

(No Downloading mmproj line. llama-server command ends with --no-mmproj, not --mmproj.)

#5902 — UI reload inherits `llama_extra_args`

{"timestamp": "2026-05-31T14:15:33.870718Z", "event": "Inheriting llama_extra_args from previous load (same model, shadow-stripped): ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:15:35.282062Z", "event": "Starting llama-server: ... --no-mmproj"}
{"timestamp": "2026-05-31T14:16:01.740749Z", "event": "Loaded GGUF model via llama-server: unsloth/Qwen3.6-27B-MTP-GGUF"}

Reload request: POST /api/inference/load with {"model_path":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS"} (no llama_extra_args field).

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Before (repro):

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

After (patched session): no Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF' inference load. Only background helper pre-cache:

{"timestamp": "2026-05-31T14:14:13.453443Z", "event": "Pre-caching helper GGUF: unsloth/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-UD-Q4_K_XL.gguf"}

Status after CLI load:

{"active_model":"unsloth/Qwen3.6-27B-MTP-GGUF","model_identifier":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS","is_vision":false}

Chat completion stayed on Qwen:

{"model":"unsloth/Qwen3.6-27B-MTP-GGUF","choices":[{"delta":{"content":"The user is asking me to reply with exactly \"OK\"..."}}]}

POST /api/inference/load count in session: 1 CLI load + 1 intentional same-model reload test — no Gemma load.

Unit tests

pytest studio/backend/tests/test_llama_server_args.py — 165 passed

…polling Extract shared inference-status hydration, poll status before listing models on CLI startup, wait for empty checkpoints before auto-load, and reject llama_extra_args inheritance when the resolved GGUF variant differs. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1194142d5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…efresh Keep CLI poll/adopt logic and combine refresh options with main's AbortSignal cancellation. Retain project instruction helpers from main alongside the extended waitForModelReady adopt loop. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b59d7b87c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for more information, see https://pre-commit.ci

P1: only enter waitForModelReady() when a UI-initiated load is actually in progress (modelLoading). Removing the checkpointEmpty condition means a fresh empty session goes straight to autoLoadSmallestModel(), which already calls tryAdoptServerActiveModel() first. This avoids the 120 s spin-to-deadline on every normal startup where no CLI model is loading. P2: fetch listModels() / listLoras() and commit them to the store before starting the 60 s CLI-load poll, so the model selector is never blocked for a full minute during an idle Studio session. The poll still runs concurrently on mount when the checkpoint is empty; the final status fetch is re-used from the poll result to avoid an extra round-trip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d56af4609

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jimdawdy-hub · 2026-06-09T04:08:12Z

Pushed 6d56af4 to address the two P1/P2 Codex concerns:

P1: waitForModelReady() is no longer entered when checkpointEmpty is true but no UI load is in progress. Normal empty-session starts go straight to autoLoadSmallestModel() (which calls tryAdoptServerActiveModel() first) instead of spinning for 120 s.
P2: listModels() / listLoras() are now fetched and committed to the store before the CLI-load poll starts, so the model selector is never blocked for a full minute during an idle session.

@danielhanchen @rolandtannous — would you be able to review when you get a chance? CI is showing action_required on all workflows; maintainer approval to run the workflows would be appreciated.

Addresses the follow-up Codex review on 6d56af4: - P1: dropping the pre-autoload wait entirely reintroduced the CLI-load race (UI auto-loads the smallest model while `studio run -m` is still loading). autoLoadSmallestModel now calls adoptInFlightServerLoad, which adopts an already-active model, and -- only when load-progress reports phase "mmap" (llama-server genuinely paging weights) -- waits for that load to finish before adopting. An idle session has no such evidence and falls straight through to auto-load with no delay. - P3: waitForModelReady no longer spins to a 120s deadline. It returns as soon as modelLoading clears, so a cancelled/failed UI-initiated load no longer hangs the send for two minutes. - P2: refresh() no longer clobbers a model the user picks while the mount-time CLI poll is running -- the poll stops early on selection and the polled active_model is not applied over a fresh local selection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# Conflicts: # studio/backend/routes/inference.py # studio/frontend/src/features/chat/api/chat-adapter.ts # studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b8f1c516c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Resolve chat-adapter conflict (keep abortSignal on autoLoadSmallestModel) and address latest Codex review: report in-flight GGUF loads on /status, gate adopt waits on loading/mmap evidence, re-check checkpoint before adopt, and skip multimodal reset when the user picked during CLI poll. Co-authored-by: Cursor <cursoragent@cursor.com>

jimdawdy-hub · 2026-06-10T01:19:04Z

Merge + Codex follow-up (`c5e6e70`)

Merged latest main and addressed the remaining Codex threads:

Merge conflict (chat-adapter.ts): kept autoLoadSmallestModel(abortSignal) so abort/cleanup still works.
P1 — wait before mmap / HF download (chat-adapter.ts + inference.py): /api/inference/status now reports in-flight GGUF loads while _serial_load_lock is held; adoptInFlightServerLoad waits on status.loading or any non-null load-progress phase (adaptive poll — idle sessions still return immediately).
P2 — multimodal reset during poll (use-chat-model-runtime.ts): skip the no-active-model reset branch when userSelectedDuringPoll.
P2 — adopt race (apply-inference-status-to-store.ts): re-read params.checkpoint after getInferenceStatus() before calling setCheckpoint.

PR is MERGEABLE; waiting on pre-commit.ci.

…s for PR unslothai#5901

danielhanchen · 2026-06-11T14:37:16Z

Pushed b709b8f with two changes:

extra_args_disable_mmproj() now matches the version on fix(studio): inherit llama_extra_args and honor --no-mmproj #5902: it recognises the --no-mmproj-auto alias and mirrors llama-server's last-wins parsing for the --mmproj-auto / --no-mmproj / --no-mmproj-auto boolean group, with tests for both. Identical content on both branches so they merge cleanly in either order.
Restored the original shorter wording for three comments in use-chat-model-runtime.ts that were rewritten into longer versions with the same meaning, which shrinks the diff.

The polling design here is solid: evidence-gated waiting via /load-progress and status.loading means idle sessions still auto-load with zero delay, and the post-poll checkpoint re-check protects user selections.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b709b8f168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…odel-load

jimdawdy-hub · 2026-06-11T19:26:30Z

Synced with latest main and cleared open review threads.

Merged origin/main into this branch; PR is mergeable again.
Resolved remaining Codex/Gemini review threads (including items already addressed in @danielhanchen's follow-up commits).

Waiting on maintainer review/approval. pre-commit.ci is the only automated gate visible from fork PRs; GitHub Actions still require maintainer approval for first-time contributors.

…odel-load # Conflicts: # studio/backend/routes/inference.py # studio/frontend/src/features/chat/api/chat-adapter.ts # studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts # studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts

…odel-load

The load-orchestrator canary failed on GitHub-hosted runners at 361 ms against a 350 ms ceiling. Widen to 400 ms so the guard still catches pathological serialisation without flaking on shared CI hardware. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c47589b73c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Keep the empty-checkpoint refresh poll running past 60s when inference status still reports an in-flight load, so slow studio run -m sessions adopt into the checkpoint without a manual refresh. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f85c07b566

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Only clear multimodal/trust flags when refresh finishes with no active model and no checkpoint in the store, so a selection during in-flight list/status calls is not wiped by stale pre-await guards. Co-authored-by: Cursor <cursoragent@cursor.com>

jimdawdy-hub · 2026-06-12T17:41:31Z

Addressed in 010142e: the capability-clear branch now gates on checkpointAfterPoll instead of the stale pre-await isExternalSelectionActive / poll-only userSelectedDuringPoll, so a local selection during in-flight refresh no longer clears multimodal flags.

jimdawdy-hub and others added 3 commits May 31, 2026 08:51

jimdawdy-hub requested a review from rolandtannous as a code owner May 31, 2026 13:54

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

Merge backend llama_extra_args/mmproj fix for verification

bfea8f4

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

jimdawdy-hub requested a review from danielhanchen as a code owner May 31, 2026 14:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

6cecc63

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/backend/routes/inference.py Outdated

This was referenced May 31, 2026

fix(studio): adopt server-loaded model before chat auto-load #5900

Merged

fix(studio): inherit llama_extra_args and honor --no-mmproj #5902

Merged

fix(studio): load run.py by path for editable installs #5909

Merged

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts

jimdawdy-hub and others added 4 commits June 8, 2026 10:16

Merge branch 'main' into fix/studio-poll-cli-model-load

ed2cc42

[pre-commit.ci] auto fixes from pre-commit.com hooks

c8358fe

for more information, see https://pre-commit.ci

Merge branch 'main' into fix/studio-poll-cli-model-load

8fe82b9

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Jim Dawdy and others added 2 commits June 9, 2026 01:22

Merge remote-tracking branch 'upstream/main' into pr5901

4b8f1c5

# Conflicts: # studio/backend/routes/inference.py # studio/frontend/src/features/chat/api/chat-adapter.ts # studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated

Comment thread studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts Outdated

danielhanchen self-assigned this Jun 11, 2026

Recognize --no-mmproj-auto with last-wins parsing and tighten comment…

b709b8f

…s for PR unslothai#5901

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts

danielhanchen and others added 2 commits June 11, 2026 15:10

Merge main into fix/studio-poll-cli-model-load

024ee36

Merge remote-tracking branch 'origin/main' into fix/studio-poll-cli-m…

3a30aa0

…odel-load

danielhanchen and others added 5 commits June 12, 2026 07:34

Merge remote-tracking branch 'origin/main' into fix/studio-poll-cli-m…

b1a0fc1

…odel-load

Merge branch 'main' into fix/studio-poll-cli-model-load

b42a1c0

Merge branch 'main' into fix/studio-poll-cli-model-load

c47589b

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

Uh oh!

Conversation

jimdawdy-hub commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Test plan

Verification (2026-05-31, patched editable install)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

jimdawdy-hub commented May 31, 2026

Verification logs (patched branch, Arch Linux, 2026-05-31)

#5902 — --no-mmproj honored; no mmproj download on CLI load

#5902 — UI reload inherits llama_extra_args

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Unit tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimdawdy-hub commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimdawdy-hub commented Jun 10, 2026

Merge + Codex follow-up (c5e6e70)

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jimdawdy-hub commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

jimdawdy-hub commented May 31, 2026 •

edited

Loading

#5902 — `--no-mmproj` honored; no mmproj download on CLI load

#5902 — UI reload inherits `llama_extra_args`

Merge + Codex follow-up (`c5e6e70`)