-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006
Description
Name and Version
$ rpc-server --help
load_backend: loaded RPC backend from /llm/runtimes/llama-b8487/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /llm/runtimes/llama-b8487/libggml-vulkan.so
load_backend: loaded CPU backend from /llm/runtimes/llama-b8487/libggml-cpu-zen4.so
Operating systems
Linux
GGML backends
Vulkan
Hardware
AMD AI Max+ 395 on Fedora 43
Models
IQ4_NL quantization
Problem description & steps to reproduce
Introduced with: #20908
Started rpc-server on two machines.
Issues the following command to balance the load across the two rpc-server machines. The tensor split designation here appears to load the model evenly between the two machines, and nothing on the local machine. That's not an issue for me, it's just what I had to do to make it work.
$ taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf
An error is reported by llama-server
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:669: Remote RPC server crashed or returned malformed response
The rpc-server did not crash, it just appears to have rejected the model load after the model had been transferred, and returned an error.
This style of crash is seen on all versions from b8492 onwards to now.
b8487 which is the version immediately prior works just fine with the above command. The model gets split evenly, can inferences are successful.
First Bad Commit
This style of crash is seen on all versions from b8492 onwards to now.
Relevant log output
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007fcebf6879a2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007fcebf67bc3c in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007fcebf67bc84 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007fcebf6ebb8f in wait4 () from /lib64/libc.so.6
#4 0x00007fcebff6cb9b in ggml_print_backtrace () from /llm/runtimes/llama-b8514/libggml-base.so.0
#5 0x00007fcebff6cd32 in ggml_abort () from /llm/runtimes/llama-b8514/libggml-base.so.0
#6 0x00007fcec071aeaf in ggml_backend_rpc_buffer_get_tensor(ggml_backend_buffer*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /llm/runtimes/llama-b8514/libggml-rpc.so
#7 0x00007fcebff84c96 in ggml_backend_tensor_copy () from /llm/runtimes/llama-b8514/libggml-base.so.0
#8 0x00007fcebff89cae in ggml_backend_sched_graph_compute_async () from /llm/runtimes/llama-b8514/libggml-base.so.0
#9 0x00007fcebfcbf941 in llama_context::graph_compute(ggml_cgraph*, bool) () from /llm/runtimes/llama-b8514/libllama.so.0
#10 0x00007fcebfcbfd75 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /llm/runtimes/llama-b8514/libllama.so.0
#11 0x00007fcebfcc826b in llama_context::decode(llama_batch const&) () from /llm/runtimes/llama-b8514/libllama.so.0
#12 0x00007fcebfcc98b0 in llama_decode () from /llm/runtimes/llama-b8514/libllama.so.0
#13 0x000055c63eb631fc in common_init_from_params(common_params&) ()
#14 0x000055c63ea9f1c2 in server_context_impl::load_model(common_params const&) ()
#15 0x000055c63e9e9f78 in main ()
[Inferior 1 (process 21882) detached]
./runme: line 23: 21882 Aborted (core dumped) taskset -c 2-15 /llm/bin/llama-server --rpc 192.168.2.103:50001 --rpc 192.168.2.101:50001 --tensor-split 1,1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --threads 14 --batch-size 4096 --ubatch-size 1024 --cache-ram 8192 --ctx-size 131072 --no-mmap --kv-unified --flash-attn on --ctx-checkpoints 128 --n-gpu-layers 999 --parallel 2 --host 0.0.0.0 --port 8033 --jinja --alias "Qwen3.5-397B-A17B-IQ4_NL" --model ./Qwen3.5-397B-A17B-UD-IQ4_NL.gguf