Studio: kill in-flight llama-server before spawning a new one by danielhanchen · Pull Request #5171 · unslothai/unsloth

danielhanchen · 2026-04-24T14:07:22Z

Addresses #5161.

Problem

Two rapid Apply clicks in the chat settings panel can race two load_model calls. Both pass the Phase 1 _kill_process (because neither has stored its Popen handle yet), both download / read metadata, and both reach Phase 3 and spawn a server. Only the last reference is tracked in self._process. The first server becomes an orphan that holds the model in RAM until the kernel OOM kicks in. For an 86 GB Q8_0 MoE on a 128 GB host, two clicks blow the memory budget.

Fix

Two complementary changes in studio/backend/core/inference/llama_cpp.py:load_model:

At entry, set the existing _cancel_event so any in-flight load aborts at its next checkpoint, then bind a fresh Event for the new load. Subsequent _kill_process and download phases pick up the new event.
Inside the Phase 3 lock, immediately before subprocess.Popen, run a defensive _kill_process. If a racing load already wrote a Popen handle into self._process after the first kill ran, this drops the orphan before we overwrite the reference.

Speculative decoding cleanup

While in this code path, swap the wire value the chat UI sends for the speculative dropdown from "ngram-mod" to "default". Backend now passes the single llama-server flag --spec-default, which in common/arg.cpp:3905-3914 expands to the exact same params Studio was already passing manually:

{"--spec-default"},
string_format("enable default speculative decoding config"),
[](common_params & params) {
    params.speculative.type = COMMON_SPECULATIVE_TYPE_NGRAM_MOD;
    params.speculative.ngram_size_n = 24;
    params.speculative.n_min = 48;
    params.speculative.n_max = 64;
}

Default-on for non-vision models is preserved. Power users can still pick "ngram-mod" or "ngram-simple" explicitly and the manual draft tuning still applies for those values.

Verified

Defensive kill unit test: a planted Popen orphan handle is terminated before self._process is overwritten.
Ten spec-cmd mappings produce the expected llama-server args (default -> --spec-default, off / null / "" -> no flag, vision -> always disabled, ngram-mod / ngram-simple still work; case-insensitive).

Caveats

The issue also reports two related symptoms which I could not reproduce without the user's specific 80B-class local GGUF on a small-VRAM host:

-c 4096 ignoring the UI Context Length.
--spec-type ngram-mod flags lingering when the toggle is set to Off. With the race fix in this PR, the most likely cause (the user observing the orphan server's cmdline) goes away. If the symptom persists, it's a separate issue worth a dedicated reproduction.

Test plan

Open Studio chat, load a small GGUF, click Apply twice quickly with a context length change in between, confirm only one llama-server process exists.
Toggle Speculative Decoding to Off, click Apply, confirm --spec-default and --spec-type are absent from the live llama-server command line.
Toggle Speculative Decoding to On, click Apply, confirm --spec-default is present.

Two rapid Apply clicks in the chat settings panel can race two load_model calls. Both pass the Phase 1 _kill_process (because neither has stored its Popen handle yet), both download / read metadata, and both reach Phase 3 and spawn a server. Only the last reference is tracked in self._process. The first server becomes an orphan that holds the model in RAM until the kernel OOM kicks in. Addresses #5161. Two complementary changes in studio/backend/core/inference/llama_cpp.py: 1. At load_model entry, set the existing _cancel_event so any in-flight load aborts at its next checkpoint, then bind a fresh Event for the new load. Subsequent _kill_process and download phases pick up the new event. 2. Inside the Phase 3 lock, immediately before subprocess.Popen, run a defensive _kill_process that removes any orphan handle a racing load might have stored after the first kill ran. Speculative decoding cleanup (also touched while in this code path): The chat UI used to send "ngram-mod" as the wire value when the speculative dropdown was On, and the backend mapped that to the 4-flag combo --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64. Switch the wire value to "default" and have the backend pass the single llama-server flag --spec-default. That flag expands to the exact same params (see common/arg.cpp:3905-3914 in llama.cpp). Default-on for non-vision models is preserved. Verified: - Unit test: planted Popen orphan handle is terminated before self._process is overwritten. - All ten spec-cmd mappings produce the expected llama-server args ("default" -> --spec-default, "off" / null -> no flag, vision -> always disabled, manual "ngram-mod" / "ngram-simple" still work).

gemini-code-assist

Code Review

This pull request introduces a cancellation mechanism for in-flight model loads and adds defensive process termination to prevent orphan processes. It also updates the speculative decoding configuration to use a "default" setting instead of "ngram-mod" across the backend and frontend. A critical issue was identified in the cancellation logic: re-initializing self._cancel_event immediately after signaling it causes concurrent threads to lose the cancellation signal, as they will reference the new event object instead of the signaled one.

gemini-code-assist · 2026-04-24T14:09:51Z

+        self._cancel_event.set()
+        self._cancel_event = threading.Event()


The cancellation logic for in-flight loads is partially ineffective due to how self._cancel_event is accessed in sub-methods. Methods like _download_gguf (e.g., at line 1193) check self._cancel_event.is_set() by accessing the attribute on the instance. When a racing call to load_model replaces self._cancel_event with a new Event object at line 1313, any thread already executing a download will begin checking the new event object instead of the one that was just signaled. This prevents the cancellation action from being successfully performed, which can lead to the main thread getting stuck or redundant resource consumption. To fix this, the download methods should capture a reference to the event at the start of their execution or be modified to accept the event as an argument, ensuring the cancellation signal is reliably received.

References

A background watcher thread responsible for cancellation (e.g., by closing a resource) should only terminate after successfully performing its action. If the action fails, the thread should retry to prevent the main thread from getting stuck.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f973d2a564

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T14:12:32Z

+        self._cancel_event.set()
+        self._cancel_event = threading.Event()


Avoid dropping unload cancellation when swapping cancel events

Rebinding self._cancel_event here introduces a race with unload_model(): if an unload request lands after set() but before the reassignment, unload_model sets the old event object and then load_model immediately replaces it with a fresh clear event, so the in-flight load no longer sees cancellation and can still spawn llama-server after the user asked to stop. This regression is specific to concurrent unload/load timing and is actionable by keeping a stable event identity (or guarding replacement with synchronization and per-load event snapshots).

Useful? React with 👍 / 👎.

The previous attempt at cancelling in-flight loads via ``self._cancel_event.set()`` followed by ``self._cancel_event = threading.Event()`` was broken in two ways (flagged independently by Gemini and Codex): 1. Sub-methods like _download_gguf consult ``self._cancel_event`` on every check. After the rebind, the in-flight thread reads the FRESH unset Event, not the one we just set, so cancellation never propagates. 2. Worse, if unload_model() lands between ``set()`` and the rebind, unload's signal hits the OLD event and is then immediately discarded when load_model swaps in a fresh Event. The user's stop-request silently no-ops. Revert to the original ``self._cancel_event.clear()``. The Phase 3 defensive ``_kill_process()`` introduced in this branch still closes the orphan-process race that #5161 reports: even if two concurrent loads both pass Phase 1 with self._process == None, the loser's Phase 3 kill terminates the winner's Popen handle before overwriting it, so we end up with exactly one llama-server process.

…down The Speculative Decoding control was simplified to On (default) / Off, but the backend still accepts and reports the older manual modes (ngram-mod, ngram-simple). When a load response or status refresh comes back with one of those values -- whether from an external API caller, a model loaded before this PR landed, or a not-yet-upgraded backend -- the controlled Select renders with an empty trigger because the value is not in the SelectItem list. Add a tiny normaliser at both entry points (status refresh + post-load) so legacy manual modes coerce to "default". The user sees "On" instead of a blank dropdown, and reapplying lets llama.cpp pick its own preferred strategy via --spec-default. Reviewer-flagged finding on PR #5171.

danielhanchen · 2026-04-24T15:57:14Z

Thanks for the review. Pushed 3cda845 addressing the dropdown finding:

Fixed: simplified Speculative Decoding dropdown rendered blank for legacy values (1/5 reviewers, P2)

The Select was simplified to On (default) / Off, but the backend still accepts and reports the older ngram-mod and ngram-simple modes. A user upgrading with one of those persisted, or any external API caller setting them, would land on a blank dropdown trigger.

Added a normaliser at both entry points (status refresh + post-load) so legacy manual modes coerce to "default". The user sees "On" instead of empty; reapplying lets llama.cpp pick its own preferred strategy via --spec-default.

Deferred (out of scope or low impact):

--spec-default backwards-compat probe for older binaries discovered via LLAMA_SERVER_PATH -- the bundled unsloth/llama.cpp build is current and well past the --spec-default introduction. Custom-binary users are vanishingly rare.
"Latest Apply wins" race-semantics finding -- the Phase 3 defensive kill ships the orphan-prevention fix the issue requested. The deeper "newest-by-wall-clock" guarantee is a behavioural redesign that needs its own RFC.
Stale-branch resolve_model_class drift -- already auto-rebased on the staging branch.

danielhanchen · 2026-04-24T15:58:49Z

/gemini review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3cda84533a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:00:00Z

+  if (v == null) return null;
+  if (v === "default" || v === "off") return v;
+  return "default";


Preserve explicit speculative modes instead of coercing

normalizeSpeculativeType now rewrites any non-null mode except "default"/"off" to "default", which means a model loaded with explicit ngram-mod/ngram-simple (e.g., via API or older Studio) is silently converted in store state; the next Apply for unrelated settings sends speculative_type: "default" and unintentionally drops that manual override. This is a behavior regression for power users because backend still supports explicit modes but the UI no longer preserves them.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request updates the speculative decoding configuration to use llama.cpp's --spec-default flag by introducing a "default" type, while maintaining backward compatibility for manual draft tuning. It also includes a defensive process cleanup in the backend to prevent orphaned processes during concurrent model loads and updates the frontend to normalize speculative decoding types. I have no feedback to provide.

…lothai#5427) * Studio: serialise GGUF reload and inherit unsloth-run extra args Closes unslothai#5401. Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`: 1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server. 2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line. 3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`. Fix: `studio/backend/core/inference/llama_cpp.py` * Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties. * Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap. `studio/backend/routes/inference.py` * Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back. * GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear". Verified end to end against a live `unsloth studio run` instance: | Scenario | Before | After | | --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ | | `/load` same (model, variant, settings) | 1 PID, `already_loaded` | unchanged | | `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192 | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` | | Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42` | | `/unload` + two parallel `/load` | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline | | `/load` with `llama_extra_args=[]` (explicit clear) | n/a | `loaded`, new server has no `--top-k` / `--seed` | | `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a | `loaded`, new server has the supplied flags | `pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions. * Studio: track requested n_ctx so Auto-slider flips trigger a reload Review feedback on PR unslothai#5427 from gemini-code-assist. The original short-circuit compared ``request.max_seq_length`` against ``llama_backend.context_length`` (the effective context). VRAM-fit logic can cap the running server below what the caller asked for, so this comparison incorrectly returns ``already_loaded`` when the user flips the slider from an explicit length (e.g. 8192) back to "Auto" (0): the explicit request was capped to, say, 4096, and the new "Auto" request reads ``backend.context_length == 4096`` and decides nothing changed. Track the originally requested ``n_ctx`` on the backend instead and compare against that. ``requested_n_ctx == 0`` means the last load asked for the model's native length; ``request.max_seq_length == 0`` matches it. Verified in the sandbox suite (now 90 tests): - ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit 8192, then Apply with ``max_seq_length=0`` falls through to a real reload and the new server runs at the native 40960. - ``test_auto_to_explicit_triggers_reload`` -- inverse direction. - ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with the same explicit value still short-circuits (no needless reload). - Existing scenarios (kv change, spec change, template change, extra args inherit, parallel-load stress, frontend Apply flow) unchanged. ``pytest studio/backend/tests`` still green on the same set of tests; the pre-existing ``test_help_output`` failure and ``test_studio_api`` fixture errors are unaffected. * Studio: tighten comments in the 5401 fix Trim the verbose explanatory comments and docstrings introduced in f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still points at issue unslothai#5401; the multi-paragraph rationale belonged in the PR body, not the source. No behaviour change. * ci: retrigger after zoo drift + IPython fixes landed in main * ci: retrigger Mac Studio UI CI after transient fetch flake * Studio: address six P2 followups on the 5401 reload PR Tightens the inheritance and serial-load paths to close the six P2 findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` / `dd0b1d58`. 1. Re-check loaded state before killing queued loads. Two duplicate `/api/inference/load` requests both pass the route-level `is_loaded` gate before the first publishes `_healthy = True`. The second waits on `_serial_load_lock`, enters Phase 1, and tears down the just-spawned llama-server for a redundant full reload. Added `LlamaCppBackend._already_in_target_state(...)` and a short-circuit at the top of the serial-lock block: if the live server already satisfies the kwargs, return True without killing. 2. Don't inherit CLI overrides that shadow new first-class settings. `unsloth run -c 4096` is a permitted pass-through; the validator docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args` and appended after Studio's own flags, the inherited `-c 4096` silently won the last-wins parse against a new `max_seq_length=8192`. Added `strip_shadowing_flags` in `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`, `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the inherited list through it before validate + forward. 3. Restrict inherited llama args to the same GGUF model. `_extra_args` is deliberately preserved across `unload_model()` for the chat- settings Apply flow (`/unload` + `/load` with no `llama_extra_args` field). Now also track `_extra_args_source = (model_identifier, hf_variant)` so the route can refuse cross-model inheritance. `LlamaCppBackend.extra_args_source` exposes the tuple. 4. Persist extras only after a successful load. `_extra_args` was written at the top of `load_model` before Popen + health check, so a failed startup left bad args in place to poison the next UI retry. The write (along with `_requested_n_ctx`) is now deferred until after `_healthy = True`. 5. Ignore speculative diffs for vision loads. `load_model` silently gates speculative decoding on `not is_vision`, so the backend's `_speculative_type` stays `None` for vision models. The route's comparator now normalises the request's value to `"off"` when `llama_backend.is_vision` to avoid a no-op reload of a vision server every time the dropdown defaults to `default`. The `_already_in_target_state` helper applies the same rule. 6. Wait for the replacement server before short-circuiting. `_kill_process` did not clear `_healthy`; the new first-class settings (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`) are written under `_lock` BEFORE Popen + `_wait_for_health`. A duplicate `/load` arriving during the new server's warm-up window could short-circuit against the not-yet-healthy replacement and the caller would start inference against a server that was still loading. `_kill_process` now sets `_healthy = False` in its `finally` block so `is_loaded` returns False from the moment the old server is killed until the new one finishes warm-up. Tests: - Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was 90): new unit coverage for `strip_shadowing_flags` (12 cases), `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and cross-model behaviour, failed-load preserving prior extras, and the duplicate-load short-circuit at `load_model` level. New live integration cases verify shadow-strip via `pgrep` on the live llama-server cmdline, cross-model refusal, and PID stability across a duplicate-load race. All 136 pass. - `pytest studio/backend/tests --deselect test_studio_api.py`: 1079 passed, 46 skipped, identical to the pre-change count. The pre-existing `test_studio_api.py` fixture errors and the terminal-width-sensitive `test_help_output` are unaffected. - Ruff: clean on the three modified files. * Studio: tighten GGUF reload inheritance and duplicate-load guard Re-narrow llama_extra_args to None after validate_extra_args when the incoming request omitted the field, so the backend can distinguish "caller omitted, inherit prior load" from "caller explicitly cleared to []". Without this a queued duplicate /load reaches the backend as [] and fails _already_in_target_state's exact-equality check, killing the just-started llama-server. The pass-through validate call from the original "forward llama-server args from unsloth studio run / unsloth run" change is preserved as-is; only the post-pass narrowing is new. Cross-source loads now explicitly clear extras so a model switch can't accidentally inherit via the backend's "no opinion" semantics. Store the caller's hf_variant kwarg (None for local GGUF files) in _extra_args_source instead of the derived self._hf_variant (an extracted filename quant label like "Q4_K_M"). Same-source check in the route is now symmetric for HF and direct-file loads. Add gguf_path to _already_in_target_state and prefer on-disk path identity when both backend and caller have a path. This stops the duplicate-load guard from killing a healthy server on repeat local loads (where hf_variant is None on the caller side but extracted on the backend side). Split shadow-flag stripping into per-group toggles (context / cache / spec / template). The route now opts into stripping only the groups whose first-class field was actually set on the incoming request, so an inherited --chat-template-file survives an Apply that omits chat_template_override. _request_matches_loaded_settings detects shadowing extras on the inherit path and falls through to a real reload so the strip can run. Mark --spec-default, --jinja, --no-jinja as boolean inside the shadow stripper so the value-consuming heuristic no longer eats the following positional token. * Studio: trim comments around GGUF reload inheritance * Studio: cover GGUF reload inheritance and shadow-flag stripping * Studio: drop redundant issue refs from inheritance comments * Studio: drop redundant issue refs from inheritance comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: key inheritance source off resolved gguf_variant codex-connector P2 on PR unslothai#5427 cd14cae: the inheritance gate at ``routes/inference.py:696`` compared the stored ``source[1]`` against ``request.gguf_variant``, but the HF branch loaded with ``hf_variant = config.gguf_variant`` (the *resolved* variant after ModelConfig auto-pick). When the caller omitted ``gguf_variant`` on a follow-up Apply, ``source[1] == "Q4_K_M"`` but ``(request.gguf_variant or "") == ""``, ``same_source`` returned False, and the chat-settings Apply silently dropped CLI pass-through flags for every auto-pick / local-file load. Fix both sides of the comparison to key off ``config.gguf_variant``: * The route compares ``source[1]`` to ``config.gguf_variant`` (the resolved label) rather than the request field. * The local-mode load_model call now passes ``hf_variant = config.gguf_variant`` so ``_extra_args_source`` stores the same string the route reads back. The HF branch already did this. Sandbox: added test_source_records_caller_variant_not_extracted_label to lock the storage key contract. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: deny upstream --ui family on llama-server pass-through The validator's web-UI block named only ``--webui`` / ``--no-webui``, which is llama.cpp's pre-rename spelling. Current upstream (``tools/server/README.md``) uses ``--ui`` / ``--no-ui`` plus ``--ui-config``, ``--ui-config-file``, and ``--ui-mcp-proxy`` / ``--no-ui-mcp-proxy``. Without these in the denylist a user could ``unsloth run --ui`` and enable llama-server's built-in web UI on the port Studio's reverse proxy targets, breaking the UI surface. Keep the legacy ``--webui`` group so the validator still rejects old binaries that haven't been re-spelled. Cross-referenced against the README's full flag list; this was the only gap for the post-unslothai#5401 inheritance / shadow-strip work. Pass- through flags from every other README category (sampling, jinja, ctx, cache, threads, GPU, reasoning, grammar, chat-template-kwargs) already validate cleanly; sandbox suite exercises ~60 of them in the new ``test_08_llama_server_pass_through.py``. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen requested review from Manan17 and rolandtannous as code owners April 24, 2026 14:07

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

danielhanchen added 2 commits April 24, 2026 14:20

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

danielhanchen merged commit c2dc2eb into main Apr 24, 2026
6 checks passed

danielhanchen deleted the dh/fix-5161-llama-server-race branch April 24, 2026 16:05

Imagineer99 mentioned this pull request Apr 30, 2026

Apply button spawns new llama-server without killing previous one, causing OOM #5161

Closed

This was referenced May 14, 2026

Studio: serialise GGUF reload and inherit unsloth-run extra args #5427

Merged

Studio: serialise GGUF reload and inherit unsloth-run extra args shimmyshimmer/unsloth-staging-4#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: kill in-flight llama-server before spawning a new one#5171

Studio: kill in-flight llama-server before spawning a new one#5171
danielhanchen merged 3 commits into
mainfrom
dh/fix-5161-llama-server-race

danielhanchen commented Apr 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

danielhanchen commented Apr 24, 2026

Uh oh!

danielhanchen commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self._cancel_event.set()
		self._cancel_event = threading.Event()

Uh oh!

Conversation

danielhanchen commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Speculative decoding cleanup

Verified

Caveats

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 24, 2026

Uh oh!

danielhanchen commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented Apr 24, 2026 •

edited

Loading