Studio: kill in-flight llama-server before spawning a new one#5171
Conversation
Two rapid Apply clicks in the chat settings panel can race two load_model calls. Both pass the Phase 1 _kill_process (because neither has stored its Popen handle yet), both download / read metadata, and both reach Phase 3 and spawn a server. Only the last reference is tracked in self._process. The first server becomes an orphan that holds the model in RAM until the kernel OOM kicks in. Addresses #5161. Two complementary changes in studio/backend/core/inference/llama_cpp.py: 1. At load_model entry, set the existing _cancel_event so any in-flight load aborts at its next checkpoint, then bind a fresh Event for the new load. Subsequent _kill_process and download phases pick up the new event. 2. Inside the Phase 3 lock, immediately before subprocess.Popen, run a defensive _kill_process that removes any orphan handle a racing load might have stored after the first kill ran. Speculative decoding cleanup (also touched while in this code path): The chat UI used to send "ngram-mod" as the wire value when the speculative dropdown was On, and the backend mapped that to the 4-flag combo --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64. Switch the wire value to "default" and have the backend pass the single llama-server flag --spec-default. That flag expands to the exact same params (see common/arg.cpp:3905-3914 in llama.cpp). Default-on for non-vision models is preserved. Verified: - Unit test: planted Popen orphan handle is terminated before self._process is overwritten. - All ten spec-cmd mappings produce the expected llama-server args ("default" -> --spec-default, "off" / null -> no flag, vision -> always disabled, manual "ngram-mod" / "ngram-simple" still work).
There was a problem hiding this comment.
Code Review
This pull request introduces a cancellation mechanism for in-flight model loads and adds defensive process termination to prevent orphan processes. It also updates the speculative decoding configuration to use a "default" setting instead of "ngram-mod" across the backend and frontend. A critical issue was identified in the cancellation logic: re-initializing self._cancel_event immediately after signaling it causes concurrent threads to lose the cancellation signal, as they will reference the new event object instead of the signaled one.
| self._cancel_event.set() | ||
| self._cancel_event = threading.Event() |
There was a problem hiding this comment.
The cancellation logic for in-flight loads is partially ineffective due to how self._cancel_event is accessed in sub-methods. Methods like _download_gguf (e.g., at line 1193) check self._cancel_event.is_set() by accessing the attribute on the instance. When a racing call to load_model replaces self._cancel_event with a new Event object at line 1313, any thread already executing a download will begin checking the new event object instead of the one that was just signaled. This prevents the cancellation action from being successfully performed, which can lead to the main thread getting stuck or redundant resource consumption. To fix this, the download methods should capture a reference to the event at the start of their execution or be modified to accept the event as an argument, ensuring the cancellation signal is reliably received.
References
- A background watcher thread responsible for cancellation (e.g., by closing a resource) should only terminate after successfully performing its action. If the action fails, the thread should retry to prevent the main thread from getting stuck.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f973d2a564
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._cancel_event.set() | ||
| self._cancel_event = threading.Event() |
There was a problem hiding this comment.
Avoid dropping unload cancellation when swapping cancel events
Rebinding self._cancel_event here introduces a race with unload_model(): if an unload request lands after set() but before the reassignment, unload_model sets the old event object and then load_model immediately replaces it with a fresh clear event, so the in-flight load no longer sees cancellation and can still spawn llama-server after the user asked to stop. This regression is specific to concurrent unload/load timing and is actionable by keeping a stable event identity (or guarding replacement with synchronization and per-load event snapshots).
Useful? React with 👍 / 👎.
The previous attempt at cancelling in-flight loads via ``self._cancel_event.set()`` followed by ``self._cancel_event = threading.Event()`` was broken in two ways (flagged independently by Gemini and Codex): 1. Sub-methods like _download_gguf consult ``self._cancel_event`` on every check. After the rebind, the in-flight thread reads the FRESH unset Event, not the one we just set, so cancellation never propagates. 2. Worse, if unload_model() lands between ``set()`` and the rebind, unload's signal hits the OLD event and is then immediately discarded when load_model swaps in a fresh Event. The user's stop-request silently no-ops. Revert to the original ``self._cancel_event.clear()``. The Phase 3 defensive ``_kill_process()`` introduced in this branch still closes the orphan-process race that #5161 reports: even if two concurrent loads both pass Phase 1 with self._process == None, the loser's Phase 3 kill terminates the winner's Popen handle before overwriting it, so we end up with exactly one llama-server process.
…down The Speculative Decoding control was simplified to On (default) / Off, but the backend still accepts and reports the older manual modes (ngram-mod, ngram-simple). When a load response or status refresh comes back with one of those values -- whether from an external API caller, a model loaded before this PR landed, or a not-yet-upgraded backend -- the controlled Select renders with an empty trigger because the value is not in the SelectItem list. Add a tiny normaliser at both entry points (status refresh + post-load) so legacy manual modes coerce to "default". The user sees "On" instead of a blank dropdown, and reapplying lets llama.cpp pick its own preferred strategy via --spec-default. Reviewer-flagged finding on PR #5171.
|
Thanks for the review. Pushed 3cda845 addressing the dropdown finding: Fixed: simplified Speculative Decoding dropdown rendered blank for legacy values (1/5 reviewers, P2) The Select was simplified to On (default) / Off, but the backend still accepts and reports the older Added a normaliser at both entry points (status refresh + post-load) so legacy manual modes coerce to "default". The user sees "On" instead of empty; reapplying lets llama.cpp pick its own preferred strategy via Deferred (out of scope or low impact):
|
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3cda84533a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (v == null) return null; | ||
| if (v === "default" || v === "off") return v; | ||
| return "default"; |
There was a problem hiding this comment.
Preserve explicit speculative modes instead of coercing
normalizeSpeculativeType now rewrites any non-null mode except "default"/"off" to "default", which means a model loaded with explicit ngram-mod/ngram-simple (e.g., via API or older Studio) is silently converted in store state; the next Apply for unrelated settings sends speculative_type: "default" and unintentionally drops that manual override. This is a behavior regression for power users because backend still supports explicit modes but the UI no longer preserves them.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request updates the speculative decoding configuration to use llama.cpp's --spec-default flag by introducing a "default" type, while maintaining backward compatibility for manual draft tuning. It also includes a defensive process cleanup in the backend to prevent orphaned processes during concurrent model loads and updates the frontend to normalize speculative decoding types. I have no feedback to provide.
…lothai#5427) * Studio: serialise GGUF reload and inherit unsloth-run extra args Closes unslothai#5401. Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`: 1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server. 2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line. 3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`. Fix: `studio/backend/core/inference/llama_cpp.py` * Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties. * Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap. `studio/backend/routes/inference.py` * Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back. * GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear". Verified end to end against a live `unsloth studio run` instance: | Scenario | Before | After | | --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ | | `/load` same (model, variant, settings) | 1 PID, `already_loaded` | unchanged | | `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192 | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` | | Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42` | | `/unload` + two parallel `/load` | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline | | `/load` with `llama_extra_args=[]` (explicit clear) | n/a | `loaded`, new server has no `--top-k` / `--seed` | | `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a | `loaded`, new server has the supplied flags | `pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions. * Studio: track requested n_ctx so Auto-slider flips trigger a reload Review feedback on PR unslothai#5427 from gemini-code-assist. The original short-circuit compared ``request.max_seq_length`` against ``llama_backend.context_length`` (the effective context). VRAM-fit logic can cap the running server below what the caller asked for, so this comparison incorrectly returns ``already_loaded`` when the user flips the slider from an explicit length (e.g. 8192) back to "Auto" (0): the explicit request was capped to, say, 4096, and the new "Auto" request reads ``backend.context_length == 4096`` and decides nothing changed. Track the originally requested ``n_ctx`` on the backend instead and compare against that. ``requested_n_ctx == 0`` means the last load asked for the model's native length; ``request.max_seq_length == 0`` matches it. Verified in the sandbox suite (now 90 tests): - ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit 8192, then Apply with ``max_seq_length=0`` falls through to a real reload and the new server runs at the native 40960. - ``test_auto_to_explicit_triggers_reload`` -- inverse direction. - ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with the same explicit value still short-circuits (no needless reload). - Existing scenarios (kv change, spec change, template change, extra args inherit, parallel-load stress, frontend Apply flow) unchanged. ``pytest studio/backend/tests`` still green on the same set of tests; the pre-existing ``test_help_output`` failure and ``test_studio_api`` fixture errors are unaffected. * Studio: tighten comments in the 5401 fix Trim the verbose explanatory comments and docstrings introduced in f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still points at issue unslothai#5401; the multi-paragraph rationale belonged in the PR body, not the source. No behaviour change. * ci: retrigger after zoo drift + IPython fixes landed in main * ci: retrigger Mac Studio UI CI after transient fetch flake * Studio: address six P2 followups on the 5401 reload PR Tightens the inheritance and serial-load paths to close the six P2 findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` / `dd0b1d58`. 1. Re-check loaded state before killing queued loads. Two duplicate `/api/inference/load` requests both pass the route-level `is_loaded` gate before the first publishes `_healthy = True`. The second waits on `_serial_load_lock`, enters Phase 1, and tears down the just-spawned llama-server for a redundant full reload. Added `LlamaCppBackend._already_in_target_state(...)` and a short-circuit at the top of the serial-lock block: if the live server already satisfies the kwargs, return True without killing. 2. Don't inherit CLI overrides that shadow new first-class settings. `unsloth run -c 4096` is a permitted pass-through; the validator docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args` and appended after Studio's own flags, the inherited `-c 4096` silently won the last-wins parse against a new `max_seq_length=8192`. Added `strip_shadowing_flags` in `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`, `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the inherited list through it before validate + forward. 3. Restrict inherited llama args to the same GGUF model. `_extra_args` is deliberately preserved across `unload_model()` for the chat- settings Apply flow (`/unload` + `/load` with no `llama_extra_args` field). Now also track `_extra_args_source = (model_identifier, hf_variant)` so the route can refuse cross-model inheritance. `LlamaCppBackend.extra_args_source` exposes the tuple. 4. Persist extras only after a successful load. `_extra_args` was written at the top of `load_model` before Popen + health check, so a failed startup left bad args in place to poison the next UI retry. The write (along with `_requested_n_ctx`) is now deferred until after `_healthy = True`. 5. Ignore speculative diffs for vision loads. `load_model` silently gates speculative decoding on `not is_vision`, so the backend's `_speculative_type` stays `None` for vision models. The route's comparator now normalises the request's value to `"off"` when `llama_backend.is_vision` to avoid a no-op reload of a vision server every time the dropdown defaults to `default`. The `_already_in_target_state` helper applies the same rule. 6. Wait for the replacement server before short-circuiting. `_kill_process` did not clear `_healthy`; the new first-class settings (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`) are written under `_lock` BEFORE Popen + `_wait_for_health`. A duplicate `/load` arriving during the new server's warm-up window could short-circuit against the not-yet-healthy replacement and the caller would start inference against a server that was still loading. `_kill_process` now sets `_healthy = False` in its `finally` block so `is_loaded` returns False from the moment the old server is killed until the new one finishes warm-up. Tests: - Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was 90): new unit coverage for `strip_shadowing_flags` (12 cases), `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and cross-model behaviour, failed-load preserving prior extras, and the duplicate-load short-circuit at `load_model` level. New live integration cases verify shadow-strip via `pgrep` on the live llama-server cmdline, cross-model refusal, and PID stability across a duplicate-load race. All 136 pass. - `pytest studio/backend/tests --deselect test_studio_api.py`: 1079 passed, 46 skipped, identical to the pre-change count. The pre-existing `test_studio_api.py` fixture errors and the terminal-width-sensitive `test_help_output` are unaffected. - Ruff: clean on the three modified files. * Studio: tighten GGUF reload inheritance and duplicate-load guard Re-narrow llama_extra_args to None after validate_extra_args when the incoming request omitted the field, so the backend can distinguish "caller omitted, inherit prior load" from "caller explicitly cleared to []". Without this a queued duplicate /load reaches the backend as [] and fails _already_in_target_state's exact-equality check, killing the just-started llama-server. The pass-through validate call from the original "forward llama-server args from unsloth studio run / unsloth run" change is preserved as-is; only the post-pass narrowing is new. Cross-source loads now explicitly clear extras so a model switch can't accidentally inherit via the backend's "no opinion" semantics. Store the caller's hf_variant kwarg (None for local GGUF files) in _extra_args_source instead of the derived self._hf_variant (an extracted filename quant label like "Q4_K_M"). Same-source check in the route is now symmetric for HF and direct-file loads. Add gguf_path to _already_in_target_state and prefer on-disk path identity when both backend and caller have a path. This stops the duplicate-load guard from killing a healthy server on repeat local loads (where hf_variant is None on the caller side but extracted on the backend side). Split shadow-flag stripping into per-group toggles (context / cache / spec / template). The route now opts into stripping only the groups whose first-class field was actually set on the incoming request, so an inherited --chat-template-file survives an Apply that omits chat_template_override. _request_matches_loaded_settings detects shadowing extras on the inherit path and falls through to a real reload so the strip can run. Mark --spec-default, --jinja, --no-jinja as boolean inside the shadow stripper so the value-consuming heuristic no longer eats the following positional token. * Studio: trim comments around GGUF reload inheritance * Studio: cover GGUF reload inheritance and shadow-flag stripping * Studio: drop redundant issue refs from inheritance comments * Studio: drop redundant issue refs from inheritance comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: key inheritance source off resolved gguf_variant codex-connector P2 on PR unslothai#5427 cd14cae: the inheritance gate at ``routes/inference.py:696`` compared the stored ``source[1]`` against ``request.gguf_variant``, but the HF branch loaded with ``hf_variant = config.gguf_variant`` (the *resolved* variant after ModelConfig auto-pick). When the caller omitted ``gguf_variant`` on a follow-up Apply, ``source[1] == "Q4_K_M"`` but ``(request.gguf_variant or "") == ""``, ``same_source`` returned False, and the chat-settings Apply silently dropped CLI pass-through flags for every auto-pick / local-file load. Fix both sides of the comparison to key off ``config.gguf_variant``: * The route compares ``source[1]`` to ``config.gguf_variant`` (the resolved label) rather than the request field. * The local-mode load_model call now passes ``hf_variant = config.gguf_variant`` so ``_extra_args_source`` stores the same string the route reads back. The HF branch already did this. Sandbox: added test_source_records_caller_variant_not_extracted_label to lock the storage key contract. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: deny upstream --ui family on llama-server pass-through The validator's web-UI block named only ``--webui`` / ``--no-webui``, which is llama.cpp's pre-rename spelling. Current upstream (``tools/server/README.md``) uses ``--ui`` / ``--no-ui`` plus ``--ui-config``, ``--ui-config-file``, and ``--ui-mcp-proxy`` / ``--no-ui-mcp-proxy``. Without these in the denylist a user could ``unsloth run --ui`` and enable llama-server's built-in web UI on the port Studio's reverse proxy targets, breaking the UI surface. Keep the legacy ``--webui`` group so the validator still rejects old binaries that haven't been re-spelled. Cross-referenced against the README's full flag list; this was the only gap for the post-unslothai#5401 inheritance / shadow-strip work. Pass- through flags from every other README category (sampling, jinja, ctx, cache, threads, GPU, reasoning, grammar, chat-template-kwargs) already validate cleanly; sandbox suite exercises ~60 of them in the new ``test_08_llama_server_pass_through.py``. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Addresses #5161.
Problem
Two rapid Apply clicks in the chat settings panel can race two
load_modelcalls. Both pass the Phase 1_kill_process(because neither has stored itsPopenhandle yet), both download / read metadata, and both reach Phase 3 and spawn a server. Only the last reference is tracked inself._process. The first server becomes an orphan that holds the model in RAM until the kernel OOM kicks in. For an 86 GB Q8_0 MoE on a 128 GB host, two clicks blow the memory budget.Fix
Two complementary changes in
studio/backend/core/inference/llama_cpp.py:load_model:_cancel_eventso any in-flight load aborts at its next checkpoint, then bind a freshEventfor the new load. Subsequent_kill_processand download phases pick up the new event.subprocess.Popen, run a defensive_kill_process. If a racing load already wrote aPopenhandle intoself._processafter the first kill ran, this drops the orphan before we overwrite the reference.Speculative decoding cleanup
While in this code path, swap the wire value the chat UI sends for the speculative dropdown from
"ngram-mod"to"default". Backend now passes the single llama-server flag--spec-default, which incommon/arg.cpp:3905-3914expands to the exact same params Studio was already passing manually:{"--spec-default"}, string_format("enable default speculative decoding config"), [](common_params & params) { params.speculative.type = COMMON_SPECULATIVE_TYPE_NGRAM_MOD; params.speculative.ngram_size_n = 24; params.speculative.n_min = 48; params.speculative.n_max = 64; }Default-on for non-vision models is preserved. Power users can still pick
"ngram-mod"or"ngram-simple"explicitly and the manual draft tuning still applies for those values.Verified
Popenorphan handle is terminated beforeself._processis overwritten.default->--spec-default,off/ null /""-> no flag, vision -> always disabled,ngram-mod/ngram-simplestill work; case-insensitive).Caveats
The issue also reports two related symptoms which I could not reproduce without the user's specific 80B-class local GGUF on a small-VRAM host:
-c 4096ignoring the UI Context Length.--spec-type ngram-modflags lingering when the toggle is set to Off. With the race fix in this PR, the most likely cause (the user observing the orphan server's cmdline) goes away. If the symptom persists, it's a separate issue worth a dedicated reproduction.Test plan
llama-serverprocess exists.--spec-defaultand--spec-typeare absent from the livellama-servercommand line.--spec-defaultis present.