llama-server crashes when using a draft model (including mtp) with RPC backends. Local llama-server instance is vulkan on windows; I've tried both vulkan on windows rpc endpoints as well as cuda on linux rpc endpoints. The command below includes chained speculation because it's what I had on hand but removing the ngram flags results in the same behavior.
The server instance s starts fine; prompt processing completes successfully. The RPC instance crashes shortly after token generation; the webui got as far as sending the below reasoning stream before termination:
Then feed any prompt.
1.24.282.415 D slot update_slots: id 0 | task 0 | created speculative checkpoint (pos_min = 46, pos_max = 46, n_tokens = 47, size = 0.001 MiB, draft = 0.000 MiB)
1.24.282.421 D slot update_batch: id 0 | task 0 | generate_draft: id=8160, #tokens=47, #draft=6, pos_next=47
1.24.282.438 D srv update_slots: decoding batch, n_tokens = 7
1.24.282.439 D set_adapters_lora: adapters = 0000000000000000
1.24.282.440 D adapters_lora_are_same: adapters = 0000000000000000
1.24.282.441 D set_embeddings: value = 1
1.24.447.883 D slot update_slots: id 0 | task 0 | add accepted tokens: sampled=16, ids.size=7, n_draft=6
1.24.447.916 D res send: sending result for task id = 0
1.24.447.917 D res send: task id = 0 pushed to result queue
1.24.447.924 D slot process_toke: id 0 | task 0 | n_decoded = 2, n_remaining = -1, next token: 579 ''s'
1.24.447.925 D res send: sending result for task id = 0
1.24.447.926 D res send: task id = 0 pushed to result queue
1.24.447.927 D slot process_toke: id 0 | task 0 | n_decoded = 3, n_remaining = -1, next token: 264 ' a'
1.24.447.929 D res send: sending result for task id = 0
1.24.447.929 D res send: task id = 0 pushed to result queue
1.24.447.930 D slot process_toke: id 0 | task 0 | n_decoded = 4, n_remaining = -1, next token: 7047 ' thinking'
1.24.447.931 D res send: sending result for task id = 0
1.24.447.932 D res send: task id = 0 pushed to result queue
1.24.447.932 D slot process_toke: id 0 | task 0 | n_decoded = 5, n_remaining = -1, next token: 1817 ' process'
1.24.447.933 D res send: sending result for task id = 0
1.24.447.934 D res send: task id = 0 pushed to result queue
1.24.447.935 D slot process_toke: id 0 | task 0 | n_decoded = 6, n_remaining = -1, next token: 25 ':'
1.24.447.936 D res send: sending result for task id = 0
1.24.447.937 D res send: task id = 0 pushed to result queue
1.24.447.937 D slot process_toke: id 0 | task 0 | n_decoded = 7, n_remaining = -1, next token: 271 '
'
1.24.447.938 D res send: sending result for task id = 0
1.24.447.939 D res send: task id = 0 pushed to result queue
1.24.447.939 D slot process_toke: id 0 | task 0 | n_decoded = 8, n_remaining = -1, next token: 16 '1'
1.24.447.940 D slot update_slots: id 0 | task 0 | accepted 6/6 draft tokens, new n_tokens = 54
1.24.447.941 D srv update_slots: run slots completed
1.24.447.942 D que start_loop: waiting for new tasks
1.24.447.942 D que start_loop: processing new tasks
1.24.447.947 D que start_loop: processing task, id = 3
1.24.447.949 D que start_loop: update slots
1.24.447.949 D srv update_slots: posting NEXT_RESPONSE
1.24.447.952 D que post: new task, id = 4, front = 0
1.24.447.954 D slot get_n_draft_: id 0 | task 0 | max possible draft: 15048
1.24.448.085 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"'s"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":2,"predicted_ms":260.279,"predicted_per_token_ms":130.1395,"predicted_per_second":7.684062102589913,"draft_n":6,"draft_n_accepted":6}}
1.24.448.154 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" a"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":3,"predicted_ms":260.279,"predicted_per_token_ms":86.75966666666666,"predicted_per_second":11.52609315388487,"draft_n":6,"draft_n_accepted":6}}
1.24.448.588 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" thinking"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":4,"predicted_ms":260.279,"predicted_per_token_ms":65.06975,"predicted_per_second":15.368124205179827,"draft_n":6,"draft_n_accepted":6}}
1.24.448.660 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" process"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":5,"predicted_ms":260.279,"predicted_per_token_ms":52.0558,"predicted_per_second":19.210155256474785,"draft_n":6,"draft_n_accepted":6}}
1.24.448.719 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":":"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":6,"predicted_ms":260.279,"predicted_per_token_ms":43.37983333333333,"predicted_per_second":23.05218630776974,"draft_n":6,"draft_n_accepted":6}}
1.24.448.773 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\n\n"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":7,"predicted_ms":260.279,"predicted_per_token_ms":37.18271428571428,"predicted_per_second":26.894217359064697,"draft_n":6,"draft_n_accepted":6}}
1.24.448.827 D srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"1"}}],"created":1779069172,"id":"chatcmpl-h0qiSQIhXs0LLmPFP5pTTRtum9Z3U18P","model":"Qwen3.6-27B-IQ4_NL.gguf","system_fingerprint":"b9193-1a68ec937","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":47,"prompt_ms":510.741,"prompt_per_token_ms":10.866829787234042,"prompt_per_second":92.02315850891156,"predicted_n":8,"predicted_ms":260.279,"predicted_per_token_ms":32.534875,"predicted_per_second":30.736248410359654,"draft_n":6,"draft_n_accepted":6}}
1.24.464.591 D - seq_id 0, draft candidate 0, pos 0: 13 ( 1.000) '.'
1.24.472.269 D - seq_id 0, draft candidate 0, pos 1: 220 ( 1.000) ' '
1.24.480.100 D - seq_id 0, draft candidate 0, pos 2: 2972 ( 1.000) ' **'
1.24.487.769 D - seq_id 0, draft candidate 0, pos 3: 2014 ( 1.000) 'An'
1.24.495.672 D - seq_id 0, draft candidate 0, pos 4: 53983 ( 1.000) 'alyze'
1.24.503.459 D - seq_id 0, draft candidate 0, pos 5: 279 ( 1.000) ' the'
1.24.503.467 D common_speculative_draft: called impl draft-mtp, hist size = 54, call_count = 2, gen = 6
1.24.517.656 D slot update_slots: id 0 | task 0 | created speculative checkpoint (pos_min = 53, pos_max = 53, n_tokens = 54, size = 0.001 MiB, draft = 0.000 MiB)
1.24.517.664 D slot update_batch: id 0 | task 0 | generate_draft: id=16, #tokens=54, #draft=6, pos_next=54
1.24.517.671 D srv update_slots: decoding batch, n_tokens = 7
1.24.517.675 D set_adapters_lora: adapters = 0000000000000000
1.24.517.676 D adapters_lora_are_same: adapters = 0000000000000000
1.24.517.679 D set_embeddings: value = 1
1.24.680.849 E recv failed (bytes_recv=0, size_to_recv=8)
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:491: Remote RPC server crashed or returned malformed response
Name and Version
host: version: 9193 (1a68ec9)
built with Clang 19.1.5 for Windows x86_64
rpc server: version: 1 (dd7cad7)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Windows, Linux
GGML backends
RPC
Hardware
Host:
Ryzen 7950x
Radeon 9070xt
RPC:
EPYC 7351p
Quadro RTX 4000 (turing) w/ CUDA 12.9
Models
Qwen 3.6-27B IQ4 NL (MTP GGUF from Unsloth)
Problem description & steps to reproduce
llama-server crashes when using a draft model (including mtp) with RPC backends. Local llama-server instance is vulkan on windows; I've tried both vulkan on windows rpc endpoints as well as cuda on linux rpc endpoints. The command below includes chained speculation because it's what I had on hand but removing the ngram flags results in the same behavior.
The server instance s starts fine; prompt processing completes successfully. The RPC instance crashes shortly after token generation; the webui got as far as sending the below reasoning stream before termination:
exec:
Then feed any prompt.
First Bad Commit
unkown
Relevant log output
relevant llama-server output:
RPC server error: