Skip to content

Eval bug: Crash with draft-mtp or draft-simple paired with RPC backend #23242

@GitGerby

Description

@GitGerby

Name and Version

host: version: 9193 (1a68ec9)
built with Clang 19.1.5 for Windows x86_64

rpc server: version: 1 (dd7cad7)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Windows, Linux

GGML backends

RPC

Hardware

Host:
Ryzen 7950x
Radeon 9070xt

RPC:
EPYC 7351p
Quadro RTX 4000 (turing) w/ CUDA 12.9

Models

Qwen 3.6-27B IQ4 NL (MTP GGUF from Unsloth)

Problem description & steps to reproduce

llama-server crashes when using a draft model (including mtp) with RPC backends. Local llama-server instance is vulkan on windows; I've tried both vulkan on windows rpc endpoints as well as cuda on linux rpc endpoints. The command below includes chained speculation because it's what I had on hand but removing the ngram flags results in the same behavior.

The server instance s starts fine; prompt processing completes successfully. The RPC instance crashes shortly after token generation; the webui got as far as sending the below reasoning stream before termination:

Here's a thinking process:

1

exec:

llama-server.exe -m d:\models\Qwen3.6-27B-IQ4_NL.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00  -np 1 --kv-unified --rpc 172.18.200.11:50052  -dev vulkan0,rpc0 --no-warmup -dio --reasoning-budget-message "... thinking budget exceeded, let's answer now." --reasoning-budget 7500 --spec-type draft-mtp,ngram-mod --spec-draft-n-max 6 --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 6 --spec-ngram-mod-n-max 16 --cache-ram 0 -lv 4 -ngl 99 -c 15000 -lv 9

Then feed any prompt.

First Bad Commit

unkown

Relevant log output

relevant llama-server output:

1.24.282.415 D slot update_slots: id  0 | task 0 | created speculative checkpoint (pos_min = 46, pos_max = 46, n_tokens = 47, size = 0.001 MiB, draft = 0.000 MiB)
1.24.282.421 D slot update_batch: id  0 | task 0 | generate_draft: id=8160, #tokens=47, #draft=6, pos_next=47
1.24.282.438 D srv  update_slots: decoding batch, n_tokens = 7
1.24.282.439 D set_adapters_lora: adapters = 0000000000000000
1.24.282.440 D adapters_lora_are_same: adapters = 0000000000000000
1.24.282.441 D set_embeddings: value = 1
1.24.447.883 D slot update_slots: id  0 | task 0 | add accepted tokens: sampled=16, ids.size=7, n_draft=6
1.24.447.916 D res          send: sending result for task id = 0
1.24.447.917 D res          send: task id = 0 pushed to result queue
1.24.447.924 D slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:   579 ''s'
1.24.447.925 D res          send: sending result for task id = 0
1.24.447.926 D res          send: task id = 0 pushed to result queue
1.24.447.927 D slot process_toke: id  0 | task 0 | n_decoded = 3, n_remaining = -1, next token:   264 ' a'
1.24.447.929 D res          send: sending result for task id = 0
1.24.447.929 D res          send: task id = 0 pushed to result queue
1.24.447.930 D slot process_toke: id  0 | task 0 | n_decoded = 4, n_remaining = -1, next token:  7047 ' thinking'
1.24.447.931 D res          send: sending result for task id = 0
1.24.447.932 D res          send: task id = 0 pushed to result queue
1.24.447.932 D slot process_toke: id  0 | task 0 | n_decoded = 5, n_remaining = -1, next token:  1817 ' process'
1.24.447.933 D res          send: sending result for task id = 0
1.24.447.934 D res          send: task id = 0 pushed to result queue
1.24.447.935 D slot process_toke: id  0 | task 0 | n_decoded = 6, n_remaining = -1, next token:    25 ':'
1.24.447.936 D res          send: sending result for task id = 0
1.24.447.937 D res          send: task id = 0 pushed to result queue
1.24.447.937 D slot process_toke: id  0 | task 0 | n_decoded = 7, n_remaining = -1, next token:   271 '

'
1.24.447.938 D res          send: sending result for task id = 0
1.24.447.939 D res          send: task id = 0 pushed to result queue
1.24.447.939 D slot process_toke: id  0 | task 0 | n_decoded = 8, n_remaining = -1, next token:    16 '1'
1.24.447.940 D slot update_slots: id  0 | task 0 | accepted 6/6 draft tokens, new n_tokens = 54
1.24.447.941 D srv  update_slots: run slots completed
1.24.447.942 D que    start_loop: waiting for new tasks
1.24.447.942 D que    start_loop: processing new tasks
1.24.447.947 D que    start_loop: processing task, id = 3
1.24.447.949 D que    start_loop: update slots
1.24.447.949 D srv  update_slots: posting NEXT_RESPONSE
1.24.447.952 D que          post: new task, id = 4, front = 0
1.24.447.954 D slot get_n_draft_: id  0 | task 0 | max possible draft: 15048
1.24.448.085 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"'s"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":2,"predicted_ms":260.279,"predicted_per_token_ms":130.1395,"predicted_per_second":7.684062102589913,"draft_n":6,"draft_n_accepted":6}}


1.24.448.154 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" a"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":3,"predicted_ms":260.279,"predicted_per_token_ms":86.75966666666666,"predicted_per_second":11.52609315388487,"draft_n":6,"draft_n_accepted":6}}


1.24.448.588 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" thinking"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":4,"predicted_ms":260.279,"predicted_per_token_ms":65.06975,"predicted_per_second":15.368124205179827,"draft_n":6,"draft_n_accepted":6}}


1.24.448.660 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" process"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":5,"predicted_ms":260.279,"predicted_per_token_ms":52.0558,"predicted_per_second":19.210155256474785,"draft_n":6,"draft_n_accepted":6}}


1.24.448.719 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":":"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":6,"predicted_ms":260.279,"predicted_per_token_ms":43.37983333333333,"predicted_per_second":23.05218630776974,"draft_n":6,"draft_n_accepted":6}}


1.24.448.773 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\n\n"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":7,"predicted_ms":260.279,"predicted_per_token_ms":37.18271428571428,"predicted_per_second":26.894217359064697,"draft_n":6,"draft_n_accepted":6}}


1.24.448.827 D srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"1"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":8,"predicted_ms":260.279,"predicted_per_token_ms":32.534875,"predicted_per_second":30.736248410359654,"draft_n":6,"draft_n_accepted":6}}


1.24.464.591 D  - seq_id 0, draft candidate   0, pos   0:     13 (   1.000) '.'
1.24.472.269 D  - seq_id 0, draft candidate   0, pos   1:    220 (   1.000) ' '
1.24.480.100 D  - seq_id 0, draft candidate   0, pos   2:   2972 (   1.000) ' **'
1.24.487.769 D  - seq_id 0, draft candidate   0, pos   3:   2014 (   1.000) 'An'
1.24.495.672 D  - seq_id 0, draft candidate   0, pos   4:  53983 (   1.000) 'alyze'
1.24.503.459 D  - seq_id 0, draft candidate   0, pos   5:    279 (   1.000) ' the'
1.24.503.467 D common_speculative_draft: called impl draft-mtp, hist size = 54, call_count = 2, gen = 6
1.24.517.656 D slot update_slots: id  0 | task 0 | created speculative checkpoint (pos_min = 53, pos_max = 53, n_tokens = 54, size = 0.001 MiB, draft = 0.000 MiB)
1.24.517.664 D slot update_batch: id  0 | task 0 | generate_draft: id=16, #tokens=54, #draft=6, pos_next=54
1.24.517.671 D srv  update_slots: decoding batch, n_tokens = 7
1.24.517.675 D set_adapters_lora: adapters = 0000000000000000
1.24.517.676 D adapters_lora_are_same: adapters = 0000000000000000
1.24.517.679 D set_embeddings: value = 1
1.24.680.849 E recv failed (bytes_recv=0, size_to_recv=8)
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:491: Remote RPC server crashed or returned malformed response

RPC server error:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /workspace/gearnsc/internal-tooling/src/ggml/src/ggml-cuda/ggml-cuda.cu:3235
  cudaStreamSynchronize(cuda_ctx->stream())
/workspace/gearnsc/internal-tooling/src/ggml/src/ggml-cuda/ggml-cuda.cu:102: CUDA error
/lib/x86_64-linux-gnu/libggml-base.so.0(+0x1a07b) [0x7f4ea292d07b]
/lib/x86_64-linux-gnu/libggml-base.so.0(ggml_print_backtrace+0x220) [0x7f4ea292d750]
/lib/x86_64-linux-gnu/libggml-base.so.0(ggml_abort+0x163) [0x7f4ea292d933]
/lib/x86_64-linux-gnu/libggml-cuda.so.0(_Z15ggml_cuda_errorPKcS0_S0_iS0_+0xb7) [0x7f4e9a40c1b7]
/lib/x86_64-linux-gnu/libggml-cuda.so.0(+0x20d7f8) [0x7f4e9a40d7f8]
/lib/x86_64-linux-gnu/libggml-base.so.0(ggml_backend_graph_compute+0x25) [0x7f4ea294cab5]
/lib/x86_64-linux-gnu/libggml-rpc.so.0(_ZN10rpc_server15graph_recomputeERK27rpc_msg_graph_recompute_req+0x6a) [0x7f4ea28cadda]
/lib/x86_64-linux-gnu/libggml-rpc.so.0(+0x16b45) [0x7f4ea28d6b45]
/lib/x86_64-linux-gnu/libggml-rpc.so.0(ggml_backend_rpc_start_server+0x5a2) [0x7f4ea28d9c82]
rpc-server(+0x7a0e) [0x556f51543a0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f4ea2435ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f4ea2435d65]
rpc-server(+0x83f5) [0x556f515443f5]
Aborted

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions