Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines.

### Name and Version

$ rpc-server --help
load_backend: loaded RPC backend from /llm/runtimes/llama-b8487/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /llm/runtimes/llama-b8487/libggml-vulkan.so
load_backend: loaded CPU backend from /llm/runtimes/llama-b8487/libggml-cpu-zen4.so

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

AMD AI Max+ 395 on Fedora 43

### Models

[Unsloth ](https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF)

IQ4_NL quantization

### Problem description & steps to reproduce

Introduced with: https://github.com/ggml-org/llama.cpp/pull/20908

Started rpc-server on two machines.

Issues the following command to balance the load across the two rpc-server machines.  The tensor split designation here appears to load the model evenly between the two machines, and nothing on the local machine.  That's not an issue for me, it's just what I had to do to make it work.

`$ taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf
`

An error is reported by llama-server
`/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:669: Remote RPC server crashed or returned malformed response`

The rpc-server did not crash, it just appears to have rejected the model load after the model had been transferred, and returned an error.

This style of crash is seen on all versions from [b8492](https://github.com/ggml-org/llama.cpp/releases/tag/b8492) onwards to now.

b8487 which is the version immediately prior works just fine with the above command.  The model gets split evenly, can inferences are successful.

### First Bad Commit

This style of crash is seen on all versions from [b8492](https://github.com/ggml-org/llama.cpp/releases/tag/b8492) onwards to now.

### Relevant log output


```
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007fcebf67bc3c in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007fcebf67bc84 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007fcebf6ebb8f in wait4 () from /lib64/libc.so.6
#4  0x00007fcebff6cb9b in ggml_print_backtrace () from /llm/runtimes/llama-b8514/libggml-base.so.0
#5  0x00007fcebff6cd32 in ggml_abort () from /llm/runtimes/llama-b8514/libggml-base.so.0
#6  0x00007fcec071aeaf in ggml_backend_rpc_buffer_get_tensor(ggml_backend_buffer*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /llm/runtimes/llama-b8514/libggml-rpc.so
#7  0x00007fcebff84c96 in ggml_backend_tensor_copy () from /llm/runtimes/llama-b8514/libggml-base.so.0
#8  0x00007fcebff89cae in ggml_backend_sched_graph_compute_async () from /llm/runtimes/llama-b8514/libggml-base.so.0
#9  0x00007fcebfcbf941 in llama_context::graph_compute(ggml_cgraph*, bool) () from /llm/runtimes/llama-b8514/libllama.so.0
#10 0x00007fcebfcbfd75 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /llm/runtimes/llama-b8514/libllama.so.0
#11 0x00007fcebfcc826b in llama_context::decode(llama_batch const&) () from /llm/runtimes/llama-b8514/libllama.so.0
#12 0x00007fcebfcc98b0 in llama_decode () from /llm/runtimes/llama-b8514/libllama.so.0
#13 0x000055c63eb631fc in common_init_from_params(common_params&) ()
#14 0x000055c63ea9f1c2 in server_context_impl::load_model(common_params const&) ()
#15 0x000055c63e9e9f78 in main ()
[Inferior 1 (process 21882) detached]
./runme: line 23: 21882 Aborted                    (core dumped) taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions