Skip to content

Studio: kill in-flight llama-server before spawning a new one#5171

Merged
danielhanchen merged 3 commits into
mainfrom
dh/fix-5161-llama-server-race
Apr 24, 2026
Merged

Studio: kill in-flight llama-server before spawning a new one#5171
danielhanchen merged 3 commits into
mainfrom
dh/fix-5161-llama-server-race

Conversation

@danielhanchen

@danielhanchen danielhanchen commented Apr 24, 2026

Copy link
Copy Markdown
Member

Addresses #5161.

Problem

Two rapid Apply clicks in the chat settings panel can race two load_model calls. Both pass the Phase 1 _kill_process (because neither has stored its Popen handle yet), both download / read metadata, and both reach Phase 3 and spawn a server. Only the last reference is tracked in self._process. The first server becomes an orphan that holds the model in RAM until the kernel OOM kicks in. For an 86 GB Q8_0 MoE on a 128 GB host, two clicks blow the memory budget.

Fix

Two complementary changes in studio/backend/core/inference/llama_cpp.py:load_model:

  1. At entry, set the existing _cancel_event so any in-flight load aborts at its next checkpoint, then bind a fresh Event for the new load. Subsequent _kill_process and download phases pick up the new event.
  2. Inside the Phase 3 lock, immediately before subprocess.Popen, run a defensive _kill_process. If a racing load already wrote a Popen handle into self._process after the first kill ran, this drops the orphan before we overwrite the reference.

Speculative decoding cleanup

While in this code path, swap the wire value the chat UI sends for the speculative dropdown from "ngram-mod" to "default". Backend now passes the single llama-server flag --spec-default, which in common/arg.cpp:3905-3914 expands to the exact same params Studio was already passing manually:

{"--spec-default"},
string_format("enable default speculative decoding config"),
[](common_params & params) {
    params.speculative.type = COMMON_SPECULATIVE_TYPE_NGRAM_MOD;
    params.speculative.ngram_size_n = 24;
    params.speculative.n_min = 48;
    params.speculative.n_max = 64;
}

Default-on for non-vision models is preserved. Power users can still pick "ngram-mod" or "ngram-simple" explicitly and the manual draft tuning still applies for those values.

Verified

  • Defensive kill unit test: a planted Popen orphan handle is terminated before self._process is overwritten.
  • Ten spec-cmd mappings produce the expected llama-server args (default -> --spec-default, off / null / "" -> no flag, vision -> always disabled, ngram-mod / ngram-simple still work; case-insensitive).

Caveats

The issue also reports two related symptoms which I could not reproduce without the user's specific 80B-class local GGUF on a small-VRAM host:

  • -c 4096 ignoring the UI Context Length.
  • --spec-type ngram-mod flags lingering when the toggle is set to Off. With the race fix in this PR, the most likely cause (the user observing the orphan server's cmdline) goes away. If the symptom persists, it's a separate issue worth a dedicated reproduction.

Test plan

  • Open Studio chat, load a small GGUF, click Apply twice quickly with a context length change in between, confirm only one llama-server process exists.
  • Toggle Speculative Decoding to Off, click Apply, confirm --spec-default and --spec-type are absent from the live llama-server command line.
  • Toggle Speculative Decoding to On, click Apply, confirm --spec-default is present.

Two rapid Apply clicks in the chat settings panel can race two
load_model calls. Both pass the Phase 1 _kill_process (because neither
has stored its Popen handle yet), both download / read metadata, and
both reach Phase 3 and spawn a server. Only the last reference is
tracked in self._process. The first server becomes an orphan that
holds the model in RAM until the kernel OOM kicks in. Addresses #5161.

Two complementary changes in studio/backend/core/inference/llama_cpp.py:

1. At load_model entry, set the existing _cancel_event so any in-flight
   load aborts at its next checkpoint, then bind a fresh Event for
   the new load. Subsequent _kill_process and download phases pick up
   the new event.
2. Inside the Phase 3 lock, immediately before subprocess.Popen, run a
   defensive _kill_process that removes any orphan handle a racing
   load might have stored after the first kill ran.

Speculative decoding cleanup (also touched while in this code path):

The chat UI used to send "ngram-mod" as the wire value when the
speculative dropdown was On, and the backend mapped that to the
4-flag combo --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min
48 --draft-max 64. Switch the wire value to "default" and have the
backend pass the single llama-server flag --spec-default. That flag
expands to the exact same params (see common/arg.cpp:3905-3914 in
llama.cpp). Default-on for non-vision models is preserved.

Verified:
- Unit test: planted Popen orphan handle is terminated before
  self._process is overwritten.
- All ten spec-cmd mappings produce the expected llama-server args
  ("default" -> --spec-default, "off" / null -> no flag, vision ->
  always disabled, manual "ngram-mod" / "ngram-simple" still work).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cancellation mechanism for in-flight model loads and adds defensive process termination to prevent orphan processes. It also updates the speculative decoding configuration to use a "default" setting instead of "ngram-mod" across the backend and frontend. A critical issue was identified in the cancellation logic: re-initializing self._cancel_event immediately after signaling it causes concurrent threads to lose the cancellation signal, as they will reference the new event object instead of the signaled one.

Comment on lines +1312 to +1313
self._cancel_event.set()
self._cancel_event = threading.Event()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cancellation logic for in-flight loads is partially ineffective due to how self._cancel_event is accessed in sub-methods. Methods like _download_gguf (e.g., at line 1193) check self._cancel_event.is_set() by accessing the attribute on the instance. When a racing call to load_model replaces self._cancel_event with a new Event object at line 1313, any thread already executing a download will begin checking the new event object instead of the one that was just signaled. This prevents the cancellation action from being successfully performed, which can lead to the main thread getting stuck or redundant resource consumption. To fix this, the download methods should capture a reference to the event at the start of their execution or be modified to accept the event as an argument, ensuring the cancellation signal is reliably received.

References
  1. A background watcher thread responsible for cancellation (e.g., by closing a resource) should only terminate after successfully performing its action. If the action fails, the thread should retry to prevent the main thread from getting stuck.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f973d2a564

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1312 to +1313
self._cancel_event.set()
self._cancel_event = threading.Event()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dropping unload cancellation when swapping cancel events

Rebinding self._cancel_event here introduces a race with unload_model(): if an unload request lands after set() but before the reassignment, unload_model sets the old event object and then load_model immediately replaces it with a fresh clear event, so the in-flight load no longer sees cancellation and can still spawn llama-server after the user asked to stop. This regression is specific to concurrent unload/load timing and is actionable by keeping a stable event identity (or guarding replacement with synchronization and per-load event snapshots).

Useful? React with 👍 / 👎.

The previous attempt at cancelling in-flight loads via
``self._cancel_event.set()`` followed by
``self._cancel_event = threading.Event()`` was broken in two ways
(flagged independently by Gemini and Codex):

1. Sub-methods like _download_gguf consult ``self._cancel_event``
   on every check. After the rebind, the in-flight thread reads the
   FRESH unset Event, not the one we just set, so cancellation never
   propagates.

2. Worse, if unload_model() lands between ``set()`` and the rebind,
   unload's signal hits the OLD event and is then immediately
   discarded when load_model swaps in a fresh Event. The user's
   stop-request silently no-ops.

Revert to the original ``self._cancel_event.clear()``. The Phase 3
defensive ``_kill_process()`` introduced in this branch still closes
the orphan-process race that #5161 reports: even if two concurrent
loads both pass Phase 1 with self._process == None, the loser's
Phase 3 kill terminates the winner's Popen handle before overwriting
it, so we end up with exactly one llama-server process.
…down

The Speculative Decoding control was simplified to On (default) / Off,
but the backend still accepts and reports the older manual modes
(ngram-mod, ngram-simple). When a load response or status refresh comes
back with one of those values -- whether from an external API caller, a
model loaded before this PR landed, or a not-yet-upgraded backend -- the
controlled Select renders with an empty trigger because the value is not
in the SelectItem list.

Add a tiny normaliser at both entry points (status refresh + post-load)
so legacy manual modes coerce to "default". The user sees "On" instead
of a blank dropdown, and reapplying lets llama.cpp pick its own preferred
strategy via --spec-default.

Reviewer-flagged finding on PR #5171.
@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks for the review. Pushed 3cda845 addressing the dropdown finding:

Fixed: simplified Speculative Decoding dropdown rendered blank for legacy values (1/5 reviewers, P2)

The Select was simplified to On (default) / Off, but the backend still accepts and reports the older ngram-mod and ngram-simple modes. A user upgrading with one of those persisted, or any external API caller setting them, would land on a blank dropdown trigger.

Added a normaliser at both entry points (status refresh + post-load) so legacy manual modes coerce to "default". The user sees "On" instead of empty; reapplying lets llama.cpp pick its own preferred strategy via --spec-default.

Deferred (out of scope or low impact):

  • --spec-default backwards-compat probe for older binaries discovered via LLAMA_SERVER_PATH -- the bundled unsloth/llama.cpp build is current and well past the --spec-default introduction. Custom-binary users are vanishingly rare.
  • "Latest Apply wins" race-semantics finding -- the Phase 3 defensive kill ships the orphan-prevention fix the issue requested. The deeper "newest-by-wall-clock" guarantee is a behavioural redesign that needs its own RFC.
  • Stale-branch resolve_model_class drift -- already auto-rebased on the staging branch.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3cda84533a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +35 to +37
if (v == null) return null;
if (v === "default" || v === "off") return v;
return "default";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit speculative modes instead of coercing

normalizeSpeculativeType now rewrites any non-null mode except "default"/"off" to "default", which means a model loaded with explicit ngram-mod/ngram-simple (e.g., via API or older Studio) is silently converted in store state; the next Apply for unrelated settings sends speculative_type: "default" and unintentionally drops that manual override. This is a behavior regression for power users because backend still supports explicit modes but the UI no longer preserves them.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the speculative decoding configuration to use llama.cpp's --spec-default flag by introducing a "default" type, while maintaining backward compatibility for manual draft tuning. It also includes a defensive process cleanup in the backend to prevent orphaned processes during concurrent model loads and updates the frontend to normalize speculative decoding types. I have no feedback to provide.

@danielhanchen danielhanchen merged commit c2dc2eb into main Apr 24, 2026
6 checks passed
@danielhanchen danielhanchen deleted the dh/fix-5161-llama-server-race branch April 24, 2026 16:05
alkinun pushed a commit to alkinun/unsloth that referenced this pull request May 17, 2026
…lothai#5427)

* Studio: serialise GGUF reload and inherit unsloth-run extra args

Closes unslothai#5401.

Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`:

1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server.

2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line.

3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`.

Fix:

`studio/backend/core/inference/llama_cpp.py`

* Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties.
* Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap.

`studio/backend/routes/inference.py`

* Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back.
* GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear".

Verified end to end against a live `unsloth studio run` instance:

| Scenario                                                        | Before    | After                                                                    |
| --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ |
| `/load` same (model, variant, settings)                         | 1 PID, `already_loaded` | unchanged                                                                |
| `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192  | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` |
| Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42`                                          |
| `/unload` + two parallel `/load`                                | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline                  |
| `/load` with `llama_extra_args=[]` (explicit clear)             | n/a       | `loaded`, new server has no `--top-k` / `--seed`                          |
| `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a       | `loaded`, new server has the supplied flags                                |

`pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.

* Studio: track requested n_ctx so Auto-slider flips trigger a reload

Review feedback on PR unslothai#5427 from gemini-code-assist.

The original short-circuit compared ``request.max_seq_length`` against
``llama_backend.context_length`` (the effective context). VRAM-fit
logic can cap the running server below what the caller asked for, so
this comparison incorrectly returns ``already_loaded`` when the user
flips the slider from an explicit length (e.g. 8192) back to "Auto"
(0): the explicit request was capped to, say, 4096, and the new "Auto"
request reads ``backend.context_length == 4096`` and decides nothing
changed.

Track the originally requested ``n_ctx`` on the backend instead and
compare against that. ``requested_n_ctx == 0`` means the last load
asked for the model's native length; ``request.max_seq_length == 0``
matches it.

Verified in the sandbox suite (now 90 tests):

- ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit
  8192, then Apply with ``max_seq_length=0`` falls through to a real
  reload and the new server runs at the native 40960.
- ``test_auto_to_explicit_triggers_reload`` -- inverse direction.
- ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with
  the same explicit value still short-circuits (no needless reload).
- Existing scenarios (kv change, spec change, template change, extra
  args inherit, parallel-load stress, frontend Apply flow) unchanged.

``pytest studio/backend/tests`` still green on the same set of tests;
the pre-existing ``test_help_output`` failure and ``test_studio_api``
fixture errors are unaffected.

* Studio: tighten comments in the 5401 fix

Trim the verbose explanatory comments and docstrings introduced in
f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still
points at issue unslothai#5401; the multi-paragraph rationale belonged in the
PR body, not the source. No behaviour change.

* ci: retrigger after zoo drift + IPython fixes landed in main

* ci: retrigger Mac Studio UI CI after transient fetch flake

* Studio: address six P2 followups on the 5401 reload PR

Tightens the inheritance and serial-load paths to close the six P2
findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` /
`dd0b1d58`.

1. Re-check loaded state before killing queued loads. Two duplicate
   `/api/inference/load` requests both pass the route-level
   `is_loaded` gate before the first publishes `_healthy = True`. The
   second waits on `_serial_load_lock`, enters Phase 1, and tears down
   the just-spawned llama-server for a redundant full reload. Added
   `LlamaCppBackend._already_in_target_state(...)` and a short-circuit
   at the top of the serial-lock block: if the live server already
   satisfies the kwargs, return True without killing.

2. Don't inherit CLI overrides that shadow new first-class settings.
   `unsloth run -c 4096` is a permitted pass-through; the validator
   docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args`
   and appended after Studio's own flags, the inherited `-c 4096`
   silently won the last-wins parse against a new
   `max_seq_length=8192`. Added `strip_shadowing_flags` in
   `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`,
   `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the
   inherited list through it before validate + forward.

3. Restrict inherited llama args to the same GGUF model. `_extra_args`
   is deliberately preserved across `unload_model()` for the chat-
   settings Apply flow (`/unload` + `/load` with no `llama_extra_args`
   field). Now also track `_extra_args_source = (model_identifier,
   hf_variant)` so the route can refuse cross-model inheritance.
   `LlamaCppBackend.extra_args_source` exposes the tuple.

4. Persist extras only after a successful load. `_extra_args` was
   written at the top of `load_model` before Popen + health check, so
   a failed startup left bad args in place to poison the next UI
   retry. The write (along with `_requested_n_ctx`) is now deferred
   until after `_healthy = True`.

5. Ignore speculative diffs for vision loads. `load_model` silently
   gates speculative decoding on `not is_vision`, so the backend's
   `_speculative_type` stays `None` for vision models. The route's
   comparator now normalises the request's value to `"off"` when
   `llama_backend.is_vision` to avoid a no-op reload of a vision
   server every time the dropdown defaults to `default`. The
   `_already_in_target_state` helper applies the same rule.

6. Wait for the replacement server before short-circuiting. `_kill_process`
   did not clear `_healthy`; the new first-class settings
   (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`)
   are written under `_lock` BEFORE Popen + `_wait_for_health`. A
   duplicate `/load` arriving during the new server's warm-up window
   could short-circuit against the not-yet-healthy replacement and the
   caller would start inference against a server that was still
   loading. `_kill_process` now sets `_healthy = False` in its
   `finally` block so `is_loaded` returns False from the moment the
   old server is killed until the new one finishes warm-up.

Tests:

- Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was
  90): new unit coverage for `strip_shadowing_flags` (12 cases),
  `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and
  cross-model behaviour, failed-load preserving prior extras, and the
  duplicate-load short-circuit at `load_model` level. New live
  integration cases verify shadow-strip via `pgrep` on the live
  llama-server cmdline, cross-model refusal, and PID stability across
  a duplicate-load race. All 136 pass.
- `pytest studio/backend/tests --deselect test_studio_api.py`:
  1079 passed, 46 skipped, identical to the pre-change count. The
  pre-existing `test_studio_api.py` fixture errors and the
  terminal-width-sensitive `test_help_output` are unaffected.
- Ruff: clean on the three modified files.

* Studio: tighten GGUF reload inheritance and duplicate-load guard

Re-narrow llama_extra_args to None after validate_extra_args when the
incoming request omitted the field, so the backend can distinguish
"caller omitted, inherit prior load" from "caller explicitly cleared
to []". Without this a queued duplicate /load reaches the backend as
[] and fails _already_in_target_state's exact-equality check, killing
the just-started llama-server. The pass-through validate call from
the original "forward llama-server args from unsloth studio run /
unsloth run" change is preserved as-is; only the post-pass narrowing
is new. Cross-source loads now explicitly clear extras so a model
switch can't accidentally inherit via the backend's "no opinion"
semantics.

Store the caller's hf_variant kwarg (None for local GGUF files) in
_extra_args_source instead of the derived self._hf_variant
(an extracted filename quant label like "Q4_K_M"). Same-source check
in the route is now symmetric for HF and direct-file loads.

Add gguf_path to _already_in_target_state and prefer on-disk path
identity when both backend and caller have a path. This stops the
duplicate-load guard from killing a healthy server on repeat local
loads (where hf_variant is None on the caller side but extracted on
the backend side).

Split shadow-flag stripping into per-group toggles (context / cache /
spec / template). The route now opts into stripping only the groups
whose first-class field was actually set on the incoming request, so
an inherited --chat-template-file survives an Apply that omits
chat_template_override. _request_matches_loaded_settings detects
shadowing extras on the inherit path and falls through to a real
reload so the strip can run.

Mark --spec-default, --jinja, --no-jinja as boolean inside the
shadow stripper so the value-consuming heuristic no longer eats the
following positional token.

* Studio: trim comments around GGUF reload inheritance

* Studio: cover GGUF reload inheritance and shadow-flag stripping

* Studio: drop redundant issue refs from inheritance comments

* Studio: drop redundant issue refs from inheritance comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: key inheritance source off resolved gguf_variant

codex-connector P2 on PR unslothai#5427 cd14cae: the inheritance gate at
``routes/inference.py:696`` compared the stored ``source[1]`` against
``request.gguf_variant``, but the HF branch loaded with
``hf_variant = config.gguf_variant`` (the *resolved* variant after
ModelConfig auto-pick). When the caller omitted ``gguf_variant`` on a
follow-up Apply, ``source[1] == "Q4_K_M"`` but
``(request.gguf_variant or "") == ""``, ``same_source`` returned False,
and the chat-settings Apply silently dropped CLI pass-through flags
for every auto-pick / local-file load.

Fix both sides of the comparison to key off ``config.gguf_variant``:

* The route compares ``source[1]`` to ``config.gguf_variant`` (the
  resolved label) rather than the request field.
* The local-mode load_model call now passes
  ``hf_variant = config.gguf_variant`` so ``_extra_args_source``
  stores the same string the route reads back. The HF branch already
  did this.

Sandbox: added test_source_records_caller_variant_not_extracted_label
to lock the storage key contract.
``pytest studio/backend/tests --deselect test_studio_api.py``:
1100 passed, identical to pre-change.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: deny upstream --ui family on llama-server pass-through

The validator's web-UI block named only ``--webui`` / ``--no-webui``,
which is llama.cpp's pre-rename spelling. Current upstream
(``tools/server/README.md``) uses ``--ui`` / ``--no-ui`` plus
``--ui-config``, ``--ui-config-file``, and ``--ui-mcp-proxy`` /
``--no-ui-mcp-proxy``. Without these in the denylist a user could
``unsloth run --ui`` and enable llama-server's built-in web UI on
the port Studio's reverse proxy targets, breaking the UI surface.

Keep the legacy ``--webui`` group so the validator still rejects
old binaries that haven't been re-spelled.

Cross-referenced against the README's full flag list; this was the
only gap for the post-unslothai#5401 inheritance / shadow-strip work. Pass-
through flags from every other README category (sampling, jinja,
ctx, cache, threads, GPU, reasoning, grammar, chat-template-kwargs)
already validate cleanly; sandbox suite exercises ~60 of them in
the new ``test_08_llama_server_pass_through.py``.

``pytest studio/backend/tests --deselect test_studio_api.py``:
1100 passed, identical to pre-change.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant