Skip to content

Apply button spawns new llama-server without killing previous one, causing OOM #5161

@borris345

Description

@borris345

Apply button spawns new llama-server without killing previous one, causing OOM

Summary

Clicking "Apply" in the model configuration panel spawns a new llama-server process without terminating the existing one. Each Apply click stacks another ~86 GB model load in RAM, causing out-of-memory crashes within 2 clicks on a system with 128 GB RAM.

Additionally, the UI's Context Length setting does not propagate to the spawned llama-server — it launches with -c 4096 regardless of what the UI shows.

Environment

  • OS: Ubuntu 24
  • Hardware: RTX 5090 (32 GB VRAM) + 128 GB DDR5 RAM
  • CUDA: 13.1, Driver 590.48.01
  • Unsloth Studio version: [version number here]
  • Install method: official installer script
  • Model: Qwen3-Coder-Next (80B-A3B MoE) at Q8_0, ~86 GB

Steps to reproduce

  1. Install Unsloth Studio via the official installer

  2. Launch with: unsloth studio -H 127.0.0.1 -p

  3. Open web UI, navigate to Model Configuration

  4. Load a large model (tested with Qwen3-Coder-Next Q8_0, ~86 GB)

  5. Set Context Length to 131072 via the slider/input

  6. Set KV Cache Dtype to q8_0

  7. Set Speculative Decoding to Off

  8. Click Apply

  9. Check process list — observe the first llama-server running with -c 4096 regardless of the UI setting, and with --spec-type ngram-mod flags despite Speculative Decoding being Off:

    llama-server -m --port -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

  10. Change Context Length to 262144, click Apply again

  11. Observe a second llama-server process running in addition to the first, on a different port, also with -c 4096

  12. RAM usage climbs past 120 GB, system begins swapping, OOM kill follows

Expected behavior

  • Clicking Apply should terminate the existing llama-server process before starting a new one
  • The new llama-server should launch with flags matching the UI configuration — specifically -c should reflect the Context Length setting
  • Speculative decoding flags should be absent when the UI toggle is set to Off

Actual behavior

  • Apply spawns a new llama-server process without stopping the previous one
  • New process launches with -c 4096 regardless of UI setting
  • --spec-type ngram-mod flags remain in the command even when Speculative Decoding is toggled Off in the UI
  • Multiple processes attempt to hold the full model in RAM, exceeding available memory

Diagnostic output

Memory state after one Apply click followed by a config change and a second Apply:

               total        used        free      shared  buff/cache   available
Mem:           125Gi       121Gi       593Mi        46Mi       4.7Gi       4.0Gi
Swap:           14Gi        13Gi       1.0Gi

Process still running after OOM kills earlier ones:

llama-server -m <model-path> --port <port> -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Only the newest process shows because earlier ones were killed by the kernel OOM killer. Between Apply clicks, multiple processes coexist until the OOM kill fires.

Workaround

Manually kill processes between Apply clicks:

pkill -f llama-server

Or bypass Unsloth Studio and run llama-server directly with the desired flags:

llama-server \
  -m /path/to/model.gguf \
  --host 127.0.0.1 --port <port> \
  -c 131072 -ngl 99 \
  --override-tensor ".ffn_.*_exps.=CPU" \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja

Likely related

UI settings appear to not propagate to the spawned backend process. Beyond context length, the Speculative Decoding toggle shows Off in the UI but --spec-type ngram-mod flags remain in the command line. These may be symptoms of the same config-sync issue between the UI and the llama-server wrapper.

Notes

Thanks for the project — Unsloth Studio is genuinely impressive. Filing this as a critical bug because it makes the UI unusable for large MoE models on memory-constrained systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions