Skip to content

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006

@stew675

Description

@stew675

Name and Version

$ rpc-server --help
load_backend: loaded RPC backend from /llm/runtimes/llama-b8487/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /llm/runtimes/llama-b8487/libggml-vulkan.so
load_backend: loaded CPU backend from /llm/runtimes/llama-b8487/libggml-cpu-zen4.so

Operating systems

Linux

GGML backends

Vulkan

Hardware

AMD AI Max+ 395 on Fedora 43

Models

Unsloth

IQ4_NL quantization

Problem description & steps to reproduce

Introduced with: #20908

Started rpc-server on two machines.

Issues the following command to balance the load across the two rpc-server machines. The tensor split designation here appears to load the model evenly between the two machines, and nothing on the local machine. That's not an issue for me, it's just what I had to do to make it work.

$ taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf

An error is reported by llama-server
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:669: Remote RPC server crashed or returned malformed response

The rpc-server did not crash, it just appears to have rejected the model load after the model had been transferred, and returned an error.

This style of crash is seen on all versions from b8492 onwards to now.

b8487 which is the version immediately prior works just fine with the above command. The model gets split evenly, can inferences are successful.

First Bad Commit

This style of crash is seen on all versions from b8492 onwards to now.

Relevant log output

Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007fcebf67bc3c in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007fcebf67bc84 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007fcebf6ebb8f in wait4 () from /lib64/libc.so.6
#4  0x00007fcebff6cb9b in ggml_print_backtrace () from /llm/runtimes/llama-b8514/libggml-base.so.0
#5  0x00007fcebff6cd32 in ggml_abort () from /llm/runtimes/llama-b8514/libggml-base.so.0
#6  0x00007fcec071aeaf in ggml_backend_rpc_buffer_get_tensor(ggml_backend_buffer*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /llm/runtimes/llama-b8514/libggml-rpc.so
#7  0x00007fcebff84c96 in ggml_backend_tensor_copy () from /llm/runtimes/llama-b8514/libggml-base.so.0
#8  0x00007fcebff89cae in ggml_backend_sched_graph_compute_async () from /llm/runtimes/llama-b8514/libggml-base.so.0
#9  0x00007fcebfcbf941 in llama_context::graph_compute(ggml_cgraph*, bool) () from /llm/runtimes/llama-b8514/libllama.so.0
#10 0x00007fcebfcbfd75 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /llm/runtimes/llama-b8514/libllama.so.0
#11 0x00007fcebfcc826b in llama_context::decode(llama_batch const&) () from /llm/runtimes/llama-b8514/libllama.so.0
#12 0x00007fcebfcc98b0 in llama_decode () from /llm/runtimes/llama-b8514/libllama.so.0
#13 0x000055c63eb631fc in common_init_from_params(common_params&) ()
#14 0x000055c63ea9f1c2 in server_context_impl::load_model(common_params const&) ()
#15 0x000055c63e9e9f78 in main ()
[Inferior 1 (process 21882) detached]
./runme: line 23: 21882 Aborted                    (core dumped) taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions