Apply button spawns new llama-server without killing previous one, causing OOM
Summary
Clicking "Apply" in the model configuration panel spawns a new llama-server process without terminating the existing one. Each Apply click stacks another ~86 GB model load in RAM, causing out-of-memory crashes within 2 clicks on a system with 128 GB RAM.
Additionally, the UI's Context Length setting does not propagate to the spawned llama-server — it launches with -c 4096 regardless of what the UI shows.
Environment
- OS: Ubuntu 24
- Hardware: RTX 5090 (32 GB VRAM) + 128 GB DDR5 RAM
- CUDA: 13.1, Driver 590.48.01
- Unsloth Studio version: [version number here]
- Install method: official installer script
- Model: Qwen3-Coder-Next (80B-A3B MoE) at Q8_0, ~86 GB
Steps to reproduce
-
Install Unsloth Studio via the official installer
-
Launch with: unsloth studio -H 127.0.0.1 -p
-
Open web UI, navigate to Model Configuration
-
Load a large model (tested with Qwen3-Coder-Next Q8_0, ~86 GB)
-
Set Context Length to 131072 via the slider/input
-
Set KV Cache Dtype to q8_0
-
Set Speculative Decoding to Off
-
Click Apply
-
Check process list — observe the first llama-server running with -c 4096 regardless of the UI setting, and with --spec-type ngram-mod flags despite Speculative Decoding being Off:
llama-server -m --port -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
-
Change Context Length to 262144, click Apply again
-
Observe a second llama-server process running in addition to the first, on a different port, also with -c 4096
-
RAM usage climbs past 120 GB, system begins swapping, OOM kill follows
Expected behavior
- Clicking Apply should terminate the existing llama-server process before starting a new one
- The new llama-server should launch with flags matching the UI configuration — specifically -c should reflect the Context Length setting
- Speculative decoding flags should be absent when the UI toggle is set to Off
Actual behavior
- Apply spawns a new llama-server process without stopping the previous one
- New process launches with -c 4096 regardless of UI setting
- --spec-type ngram-mod flags remain in the command even when Speculative Decoding is toggled Off in the UI
- Multiple processes attempt to hold the full model in RAM, exceeding available memory
Diagnostic output
Memory state after one Apply click followed by a config change and a second Apply:
total used free shared buff/cache available
Mem: 125Gi 121Gi 593Mi 46Mi 4.7Gi 4.0Gi
Swap: 14Gi 13Gi 1.0Gi
Process still running after OOM kills earlier ones:
llama-server -m <model-path> --port <port> -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
Only the newest process shows because earlier ones were killed by the kernel OOM killer. Between Apply clicks, multiple processes coexist until the OOM kill fires.
Workaround
Manually kill processes between Apply clicks:
Or bypass Unsloth Studio and run llama-server directly with the desired flags:
llama-server \
-m /path/to/model.gguf \
--host 127.0.0.1 --port <port> \
-c 131072 -ngl 99 \
--override-tensor ".ffn_.*_exps.=CPU" \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--jinja
Likely related
UI settings appear to not propagate to the spawned backend process. Beyond context length, the Speculative Decoding toggle shows Off in the UI but --spec-type ngram-mod flags remain in the command line. These may be symptoms of the same config-sync issue between the UI and the llama-server wrapper.
Notes
Thanks for the project — Unsloth Studio is genuinely impressive. Filing this as a critical bug because it makes the UI unusable for large MoE models on memory-constrained systems.
Apply button spawns new llama-server without killing previous one, causing OOM
Summary
Clicking "Apply" in the model configuration panel spawns a new llama-server process without terminating the existing one. Each Apply click stacks another ~86 GB model load in RAM, causing out-of-memory crashes within 2 clicks on a system with 128 GB RAM.
Additionally, the UI's Context Length setting does not propagate to the spawned llama-server — it launches with -c 4096 regardless of what the UI shows.
Environment
Steps to reproduce
Install Unsloth Studio via the official installer
Launch with: unsloth studio -H 127.0.0.1 -p
Open web UI, navigate to Model Configuration
Load a large model (tested with Qwen3-Coder-Next Q8_0, ~86 GB)
Set Context Length to 131072 via the slider/input
Set KV Cache Dtype to q8_0
Set Speculative Decoding to Off
Click Apply
Check process list — observe the first llama-server running with -c 4096 regardless of the UI setting, and with --spec-type ngram-mod flags despite Speculative Decoding being Off:
llama-server -m --port -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
Change Context Length to 262144, click Apply again
Observe a second llama-server process running in addition to the first, on a different port, also with -c 4096
RAM usage climbs past 120 GB, system begins swapping, OOM kill follows
Expected behavior
Actual behavior
Diagnostic output
Memory state after one Apply click followed by a config change and a second Apply:
Process still running after OOM kills earlier ones:
Only the newest process shows because earlier ones were killed by the kernel OOM killer. Between Apply clicks, multiple processes coexist until the OOM kill fires.
Workaround
Manually kill processes between Apply clicks:
Or bypass Unsloth Studio and run llama-server directly with the desired flags:
Likely related
UI settings appear to not propagate to the spawned backend process. Beyond context length, the Speculative Decoding toggle shows Off in the UI but --spec-type ngram-mod flags remain in the command line. These may be symptoms of the same config-sync issue between the UI and the llama-server wrapper.
Notes
Thanks for the project — Unsloth Studio is genuinely impressive. Filing this as a critical bug because it makes the UI unusable for large MoE models on memory-constrained systems.