Apply button spawns new llama-server without killing previous one, causing OOM

Apply button spawns new llama-server without killing previous one, causing OOM

## Summary

Clicking "Apply" in the model configuration panel spawns a new llama-server process without terminating the existing one. Each Apply click stacks another ~86 GB model load in RAM, causing out-of-memory crashes within 2 clicks on a system with 128 GB RAM.

Additionally, the UI's Context Length setting does not propagate to the spawned llama-server — it launches with -c 4096 regardless of what the UI shows.

## Environment

- OS: Ubuntu 24
- Hardware: RTX 5090 (32 GB VRAM) + 128 GB DDR5 RAM
- CUDA: 13.1, Driver 590.48.01
- Unsloth Studio version: [version number here]
- Install method: official installer script
- Model: Qwen3-Coder-Next (80B-A3B MoE) at Q8_0, ~86 GB

## Steps to reproduce

1. Install Unsloth Studio via the official installer
2. Launch with: unsloth studio -H 127.0.0.1 -p <port>
3. Open web UI, navigate to Model Configuration
4. Load a large model (tested with Qwen3-Coder-Next Q8_0, ~86 GB)
5. Set Context Length to 131072 via the slider/input
6. Set KV Cache Dtype to q8_0
7. Set Speculative Decoding to Off
8. Click Apply
9. Check process list — observe the first llama-server running with -c 4096 regardless of the UI setting, and with --spec-type ngram-mod flags despite Speculative Decoding being Off:

    llama-server -m <model-path> --port <port> -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

10. Change Context Length to 262144, click Apply again
11. Observe a second llama-server process running in addition to the first, on a different port, also with -c 4096
12. RAM usage climbs past 120 GB, system begins swapping, OOM kill follows

## Expected behavior

- Clicking Apply should terminate the existing llama-server process before starting a new one
- The new llama-server should launch with flags matching the UI configuration — specifically -c should reflect the Context Length setting
- Speculative decoding flags should be absent when the UI toggle is set to Off

## Actual behavior

- Apply spawns a new llama-server process without stopping the previous one
- New process launches with -c 4096 regardless of UI setting
- --spec-type ngram-mod flags remain in the command even when Speculative Decoding is toggled Off in the UI
- Multiple processes attempt to hold the full model in RAM, exceeding available memory

## Diagnostic output

Memory state after one Apply click followed by a config change and a second Apply:

                   total        used        free      shared  buff/cache   available
    Mem:           125Gi       121Gi       593Mi        46Mi       4.7Gi       4.0Gi
    Swap:           14Gi        13Gi       1.0Gi

Process still running after OOM kills earlier ones:

    llama-server -m <model-path> --port <port> -c 4096 --parallel 1 --flash-attn on --no-context-shift --fit on --jinja --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Only the newest process shows because earlier ones were killed by the kernel OOM killer. Between Apply clicks, multiple processes coexist until the OOM kill fires.

## Workaround

Manually kill processes between Apply clicks:

    pkill -f llama-server

Or bypass Unsloth Studio and run llama-server directly with the desired flags:

    llama-server \
      -m /path/to/model.gguf \
      --host 127.0.0.1 --port <port> \
      -c 131072 -ngl 99 \
      --override-tensor ".ffn_.*_exps.=CPU" \
      --flash-attn on \
      --cache-type-k q8_0 --cache-type-v q8_0 \
      --jinja

## Likely related

UI settings appear to not propagate to the spawned backend process. Beyond context length, the Speculative Decoding toggle shows Off in the UI but --spec-type ngram-mod flags remain in the command line. These may be symptoms of the same config-sync issue between the UI and the llama-server wrapper.

## Notes

Thanks for the project — Unsloth Studio is genuinely impressive. Filing this as a critical bug because it makes the UI unusable for large MoE models on memory-constrained systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply button spawns new llama-server without killing previous one, causing OOM #5161

Summary

Environment

Steps to reproduce

Expected behavior

Actual behavior

Diagnostic output

Workaround

Likely related

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Apply button spawns new llama-server without killing previous one, causing OOM #5161

Description

Summary

Environment

Steps to reproduce

Expected behavior

Actual behavior

Diagnostic output

Workaround

Likely related

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions