Skip to content

fix(studio): poll inference status while CLI model is loading#5901

Open
jimdawdy-hub wants to merge 24 commits into
unslothai:mainfrom
jimdawdy-hub:fix/studio-poll-cli-model-load
Open

fix(studio): poll inference status while CLI model is loading#5901
jimdawdy-hub wants to merge 24 commits into
unslothai:mainfrom
jimdawdy-hub:fix/studio-poll-cli-model-load

Conversation

@jimdawdy-hub

@jimdawdy-hub jimdawdy-hub commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Poll /api/inference/status for up to 60s on chat page mount when no checkpoint is set (covers the race where the UI loads before studio run -m finishes).
  • Extend waitForModelReady() to adopt externally loaded models, not only UI modelLoading state.
  • Use model_identifier from status when syncing the checkpoint (HF repo id vs display label).

Related

Reproduction logs (Arch Linux, dual RTX 5060 Ti, 2026-05-31)

Same studio run -m unsloth/Qwen3.6-27B-MTP-GGUF … session as #5900. The UI opened while the CLI load was still in flight:

Time Event
13:27:34 GET /api/inference/status returns empty checkpoint / no active model
13:27:39 CLI finishes Qwen load
13:28:02 UI auto-loads Gemma because checkpoint never synced from server status

Status polled before CLI load finished:

{"timestamp": "2026-05-31T13:27:34.249462Z", "event": "request_completed", "method": "GET", "path": "/api/inference/status", "status_code": 200}

Wrong model loaded by UI auto-load:

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

Test plan

  • Start studio run -m …, open the UI immediately, wait without refreshing
  • Confirm the model selector updates to the CLI-loaded model within ~60s
  • Send a chat message during CLI load; confirm it waits/adopts instead of auto-loading Gemma

Verification (2026-05-31, patched editable install)

See verification comment on the PR — summary:

  • No Gemma auto-load after studio run -m Qwen (only helper pre-cache, no inference load for gemma-4-E2B)
  • --no-mmproj honored — no mmproj download; llama-server launched with --no-mmproj only
  • Reload inherits argsInheriting llama_extra_args ... ['--no-mmproj'] on same-model reload
  • 165 test_llama_server_args.py tests passed

jimdawdy-hub and others added 3 commits May 31, 2026 08:51
When the user starts Studio via `studio run -m`, the web UI could still
auto-load a different cached GGUF on the first message because the chat
checkpoint was empty. Sync from /api/inference/status before falling back
to autoLoadSmallestModel so CLI-loaded models are not replaced.

Co-authored-by: Cursor <cursoragent@cursor.com>
The chat page could refresh /api/inference/status before `studio run -m`
finished loading, leaving the UI checkpoint empty. Poll status on mount when
no checkpoint is set, and extend waitForModelReady to adopt external loads.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reloading the same GGUF from the UI without gguf_variant no longer drops
CLI pass-through args like --no-mmproj. Skip mmproj download and launch
when --no-mmproj is present in llama_extra_args.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to adopt an already active model on the inference server into the chat UI without triggering a new load, adding polling mechanisms during startup and page refresh. The review feedback suggests simplifying a redundant condition in tryAdoptServerActiveModel and wrapping server status checks in try-catch blocks to prevent crashes from transient network errors. Additionally, it is recommended to defer model and Lora listing requests until after the active model polling completes to improve efficiency and robustness.

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab3f95856b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cecc631ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/backend/routes/inference.py Outdated
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Verification logs (patched branch, Arch Linux, 2026-05-31)

Installed from editable checkout: pip install -e /home/jim/Projects/unsloth (branch fix/studio-poll-cli-model-load, includes #5900 + #5901 + #5902).

Command:

unsloth studio run \
  -m unsloth/Qwen3.6-27B-MTP-GGUF \
  --gguf-variant UD-IQ2_XXS \
  --max-seq-length 8192 \
  --no-mmproj \
  --port 8889 --host 127.0.0.1 --silent

#5902--no-mmproj honored; no mmproj download on CLI load

Before (repro): Downloading mmproj: unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf

After (patched):

{"timestamp": "2026-05-31T14:14:14.354872Z", "event": "Vision-capable GGUF loaded without a usable mmproj; image input will be disabled for this session"}
{"timestamp": "2026-05-31T14:14:14.355179Z", "event": "Appending user extra args to llama-server: ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:14:14.355224Z", "event": "Starting llama-server: ... --no-mmproj"}

(No Downloading mmproj line. llama-server command ends with --no-mmproj, not --mmproj.)

#5902 — UI reload inherits llama_extra_args

{"timestamp": "2026-05-31T14:15:33.870718Z", "event": "Inheriting llama_extra_args from previous load (same model, shadow-stripped): ['--no-mmproj']"}
{"timestamp": "2026-05-31T14:15:35.282062Z", "event": "Starting llama-server: ... --no-mmproj"}
{"timestamp": "2026-05-31T14:16:01.740749Z", "event": "Loaded GGUF model via llama-server: unsloth/Qwen3.6-27B-MTP-GGUF"}

Reload request: POST /api/inference/load with {"model_path":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS"} (no llama_extra_args field).

#5900 / #5901 — no Gemma auto-load; Qwen stays active

Before (repro):

{"timestamp": "2026-05-31T13:28:02.197259Z", "event": "Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF', variant=UD-Q4_K_XL, vision=True"}

After (patched session): no Detected remote GGUF repo 'unsloth/gemma-4-E2B-it-GGUF' inference load. Only background helper pre-cache:

{"timestamp": "2026-05-31T14:14:13.453443Z", "event": "Pre-caching helper GGUF: unsloth/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-UD-Q4_K_XL.gguf"}

Status after CLI load:

{"active_model":"unsloth/Qwen3.6-27B-MTP-GGUF","model_identifier":"unsloth/Qwen3.6-27B-MTP-GGUF","gguf_variant":"UD-IQ2_XXS","is_vision":false}

Chat completion stayed on Qwen:

{"model":"unsloth/Qwen3.6-27B-MTP-GGUF","choices":[{"delta":{"content":"The user is asking me to reply with exactly \"OK\"..."}}]}

POST /api/inference/load count in session: 1 CLI load + 1 intentional same-model reload test — no Gemma load.

Unit tests

pytest studio/backend/tests/test_llama_server_args.py — 165 passed

…polling

Extract shared inference-status hydration, poll status before listing models
on CLI startup, wait for empty checkpoints before auto-load, and reject
llama_extra_args inheritance when the resolved GGUF variant differs.

Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1194142d5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
…efresh

Keep CLI poll/adopt logic and combine refresh options with main's AbortSignal
cancellation. Retain project instruction helpers from main alongside the
extended waitForModelReady adopt loop.

Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b59d7b87c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts
jimdawdy-hub and others added 4 commits June 8, 2026 10:16
P1: only enter waitForModelReady() when a UI-initiated load is actually
in progress (modelLoading). Removing the checkpointEmpty condition means
a fresh empty session goes straight to autoLoadSmallestModel(), which
already calls tryAdoptServerActiveModel() first.  This avoids the 120 s
spin-to-deadline on every normal startup where no CLI model is loading.

P2: fetch listModels() / listLoras() and commit them to the store before
starting the 60 s CLI-load poll, so the model selector is never blocked
for a full minute during an idle Studio session.  The poll still runs
concurrently on mount when the checkpoint is empty; the final status
fetch is re-used from the poll result to avoid an extra round-trip.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d56af4609

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated
Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Pushed 6d56af4 to address the two P1/P2 Codex concerns:

  • P1: waitForModelReady() is no longer entered when checkpointEmpty is true but no UI load is in progress. Normal empty-session starts go straight to autoLoadSmallestModel() (which calls tryAdoptServerActiveModel() first) instead of spinning for 120 s.
  • P2: listModels() / listLoras() are now fetched and committed to the store before the CLI-load poll starts, so the model selector is never blocked for a full minute during an idle session.

@danielhanchen @rolandtannous — would you be able to review when you get a chance? CI is showing action_required on all workflows; maintainer approval to run the workflows would be appreciated.

Jim Dawdy and others added 2 commits June 9, 2026 01:22
Addresses the follow-up Codex review on 6d56af4:

- P1: dropping the pre-autoload wait entirely reintroduced the CLI-load
  race (UI auto-loads the smallest model while `studio run -m` is still
  loading). autoLoadSmallestModel now calls adoptInFlightServerLoad,
  which adopts an already-active model, and -- only when load-progress
  reports phase "mmap" (llama-server genuinely paging weights) -- waits
  for that load to finish before adopting. An idle session has no such
  evidence and falls straight through to auto-load with no delay.

- P3: waitForModelReady no longer spins to a 120s deadline. It returns as
  soon as modelLoading clears, so a cancelled/failed UI-initiated load no
  longer hangs the send for two minutes.

- P2: refresh() no longer clobbers a model the user picks while the
  mount-time CLI poll is running -- the poll stops early on selection and
  the polled active_model is not applied over a fresh local selection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	studio/backend/routes/inference.py
#	studio/frontend/src/features/chat/api/chat-adapter.ts
#	studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b8f1c516c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated
Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts Outdated
Comment thread studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts Outdated
Resolve chat-adapter conflict (keep abortSignal on autoLoadSmallestModel)
and address latest Codex review: report in-flight GGUF loads on /status,
gate adopt waits on loading/mmap evidence, re-check checkpoint before
adopt, and skip multimodal reset when the user picked during CLI poll.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Merge + Codex follow-up (c5e6e70)

Merged latest main and addressed the remaining Codex threads:

  • Merge conflict (chat-adapter.ts): kept autoLoadSmallestModel(abortSignal) so abort/cleanup still works.
  • P1 — wait before mmap / HF download (chat-adapter.ts + inference.py): /api/inference/status now reports in-flight GGUF loads while _serial_load_lock is held; adoptInFlightServerLoad waits on status.loading or any non-null load-progress phase (adaptive poll — idle sessions still return immediately).
  • P2 — multimodal reset during poll (use-chat-model-runtime.ts): skip the no-active-model reset branch when userSelectedDuringPoll.
  • P2 — adopt race (apply-inference-status-to-store.ts): re-read params.checkpoint after getInferenceStatus() before calling setCheckpoint.

PR is MERGEABLE; waiting on pre-commit.ci.

@danielhanchen danielhanchen self-assigned this Jun 11, 2026
@danielhanchen

Copy link
Copy Markdown
Member

Pushed b709b8f with two changes:

  1. extra_args_disable_mmproj() now matches the version on fix(studio): inherit llama_extra_args and honor --no-mmproj #5902: it recognises the --no-mmproj-auto alias and mirrors llama-server's last-wins parsing for the --mmproj-auto / --no-mmproj / --no-mmproj-auto boolean group, with tests for both. Identical content on both branches so they merge cleanly in either order.
  2. Restored the original shorter wording for three comments in use-chat-model-runtime.ts that were rewritten into longer versions with the same meaning, which shrinks the diff.

The polling design here is solid: evidence-gated waiting via /load-progress and status.loading means idle sessions still auto-load with zero delay, and the post-poll checkpoint re-check protects user selections.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b709b8f168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/api/chat-adapter.ts
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Synced with latest main and cleared open review threads.

  • Merged origin/main into this branch; PR is mergeable again.
  • Resolved remaining Codex/Gemini review threads (including items already addressed in @danielhanchen's follow-up commits).

Waiting on maintainer review/approval. pre-commit.ci is the only automated gate visible from fork PRs; GitHub Actions still require maintainer approval for first-time contributors.

danielhanchen and others added 5 commits June 12, 2026 07:34
…odel-load

# Conflicts:
#	studio/backend/routes/inference.py
#	studio/frontend/src/features/chat/api/chat-adapter.ts
#	studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts
#	studio/frontend/src/features/chat/lib/apply-inference-status-to-store.ts
The load-orchestrator canary failed on GitHub-hosted runners at 361 ms
against a 350 ms ceiling. Widen to 400 ms so the guard still catches
pathological serialisation without flaking on shared CI hardware.

Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c47589b73c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated
Keep the empty-checkpoint refresh poll running past 60s when inference
status still reports an in-flight load, so slow studio run -m sessions
adopt into the checkpoint without a manual refresh.

Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f85c07b566

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Outdated
Only clear multimodal/trust flags when refresh finishes with no active
model and no checkpoint in the store, so a selection during in-flight
list/status calls is not wiped by stale pre-await guards.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Addressed in 010142e: the capability-clear branch now gates on checkpointAfterPoll instead of the stale pre-await isExternalSelectionActive / poll-only userSelectedDuringPoll, so a local selection during in-flight refresh no longer clears multimodal flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants