Skip to content

Eval bug: [llama-server + MTP + CUDA] - GPU VRAM not freed with --sleep-idle-seconds leading to crash #23395

@ali0une

Description

@ali0une

Name and Version

./bin/llama-cli --version
version: 9246 (871b0b7)
built with GNU 12.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

12th Gen Intel(R) Core(TM) i5-12600K
NVIDIA GeForce RTX 3090

Models

Qwen3.6-27B-MTP-GGUF Q4_K_M

Problem description & steps to reproduce

If i'm not misleading, bug is critic because with several turns llama-server ends up crashing.
This does not happen when using non MTP models, GPU VRAM is freed to near 0 with --sleep-idle-seconds parameter.
I've tried to set --fit-target 2048 but this doesn't help.
Please let me know if i'm doing something wrong.

compile flags
cmake -B . --fresh -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FA_ALL_QUANTS=ON

-fit off launch command
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit off --ctx-size 131072 --verbose --log-verbosity 4

nvidia-smi GPU Memory Usage at startup
./bin/llama-server 22034MiB

nvidia-smi GPU Memory Usage when idle
./bin/llama-server 1094MiB <-- this is 302MiB with non MTP Qwen3.6-27B-Q4_K_M.gguf

-fit on launch command
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit on --fit-ctx 131072 --verbose --log-verbosity 4

nvidia-smi GPU Memory Usage at startup
./bin/llama-server 24048MiB

nvidia-smi GPU Memory Usage when idle
./bin/llama-server 1698MiB <-- this is 302MiB with non MTP Qwen3.6-27B-Q4_K_M.gguf

First Bad Commit

git rev-parse --short HEAD
871b0b7

Relevant log output

Relevant logs from log-fit-on
0.00.197.711 I common_init_result: fitting params to device memory ...
0.00.197.712 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.197.717 I common_params_fit_impl: getting device memory data for initial parameters:
0.00.483.098 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.00.483.104 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 23403 + (25610 = 15621 +    9152 +     836) +      -24888 |
0.00.483.104 I common_memory_breakdown_print: |   - Host               |                   1214 =   682 +       0 +     532                |
0.00.502.344 I common_params_fit_impl: projected to use 25610 MiB of device memory vs. 23403 MiB of free device memory
0.00.502.347 I common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 3231 MiB
0.00.789.718 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.00.789.722 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 23403 + (20927 = 15621 +    4800 +     505) +      -20204 |
0.00.789.722 I common_memory_breakdown_print: |   - Host               |                    958 =   682 +       0 +     276                |
0.00.816.948 I common_params_fit_impl: context size reduced from 262144 to 171520 -> need 3238 MiB less memory in total
0.00.816.951 I common_params_fit_impl: entire model can be fit by reducing context
0.00.816.952 I common_fit_params: successfully fit params to free device memory
0.00.816.954 I common_fit_params: fitting params to free memory took 0.62 seconds
[...]
0.21.380.993 I common_init_result: fitting params to device memory ...
0.21.380.993 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.21.380.995 I common_params_fit_impl: getting device memory data for initial parameters:
0.21.740.870 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.21.740.873 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 21959 + (25610 = 15621 +    9152 +     836) +      -23444 |
0.21.740.873 I common_memory_breakdown_print: |   - Host               |                   1214 =   682 +       0 +     532                |
0.21.785.743 I common_params_fit_impl: projected to use 25610 MiB of device memory vs. 21959 MiB of free device memory
0.21.785.747 I common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 4675 MiB
0.22.084.480 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
0.22.084.484 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 21959 + (20927 = 15621 +    4800 +     505) +      -18760 |
0.22.084.485 I common_memory_breakdown_print: |   - Host               |                    958 =   682 +       0 +     276                |
0.22.128.464 I common_params_fit_impl: context size reduced from 262144 to 131072 -> need 4683 MiB less memory in total
0.22.128.468 I common_params_fit_impl: entire model can be fit by reducing context
0.22.128.469 I common_fit_params: successfully fit params to free device memory
0.22.128.473 I common_fit_params: fitting params to free memory took 0.75 seconds
[...]
4.59.291.647 I common_init_result: fitting params to device memory ...
4.59.291.647 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
4.59.291.649 I common_params_fit_impl: getting device memory data for initial parameters:
4.59.641.091 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
4.59.641.096 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 21823 + (25610 = 15621 +    9152 +     836) +      -23308 |
4.59.641.096 I common_memory_breakdown_print: |   - Host               |                   1214 =   682 +       0 +     532                |
4.59.686.367 I common_params_fit_impl: projected to use 25610 MiB of device memory vs. 21823 MiB of free device memory
4.59.686.373 I common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 4811 MiB
5.00.001.155 I common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
5.00.001.160 I common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 21823 + (20927 = 15621 +    4800 +     505) +      -18624 |
5.00.001.160 I common_memory_breakdown_print: |   - Host               |                    958 =   682 +       0 +     276                |
5.00.045.807 I common_params_fit_impl: context size reduced from 262144 to 131072 -> need 4683 MiB less memory in total
5.00.045.866 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to -2, abort
5.00.045.874 I common_fit_params: fitting params to free memory took 0.75 seconds
[...]
5.02.170.544 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 836.28 MiB on device 0: cudaMalloc failed: out of memory
5.02.170.549 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 876904448
5.02.170.550 E graph_reserve: failed to allocate compute buffers
5.02.171.413 E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
/whatever/llama.cpp/tools/server/server-context.cpp:734: failed to reload model after sleeping
5.02.219.467 E srv    load_model: failed to create MTP context
/whatever/llama.cpp/bin/libggml-base.so.0(+0x18ba8)[0x7fef45b5eba8]
/whatever/llama.cpp/bin/libggml-base.so.0(ggml_print_backtrace+0x1e4)[0x7fef45b5ef84]
/whatever/llama.cpp/bin/libggml-base.so.0(ggml_abort+0x11e)[0x7fef45b5f10e]
./bin/llama-server(+0x11b98c)[0x556d3966e98c]
./bin/llama-server(+0x197816)[0x556d396ea816]
./bin/llama-server(+0x700fc)[0x556d395c30fc]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fef4524524a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fef45245305]
./bin/llama-server(+0x70781)[0x556d395c3781]

full-log-fit-off.log

full-log-fit-on.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions