Conversation
|
@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases? |
rgerganov
left a comment
There was a problem hiding this comment.
When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?
| @@ -8,4 +7,4 @@ | |||
| #endif | |||
|
|
|||
| #define RPC_PROTO_MAJOR_VERSION 3 | |||
There was a problem hiding this comment.
let's bump the protocol version as this is a breaking change
Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls? LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ... |
Yes, disabling graph reuse results into a lot of |
|
Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP. |
|
Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge. |
No problem, but it will likely be Thursday before I can run any tests. |
|
I think eventually the proper solution to this and other per-tensor calls such as |
493d0bd to
06116a0
Compare
06116a0 to
590a805
Compare
4953693 to
5c6f2a8
Compare
|
@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit? |
Sorry, I forgot all about this - I'll try and test it this afternoon and report back. |
Using 3 nodes, each with 2x ./llama-bench \
--rpc "192.168.1.2:50052,192.168.1.3:50052" \
--model ~/models/gguf/GLM-4.6-Q5_X.gguf \
--n-gpu-layers 99 \
--flash-attn 1 \
--n-prompt 512 \
--n-gen 128,256,512Master
PR
|
|
@ggerganov - sorry for the delay, been trying to do some work done :) Before patch: build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
build: bde188d (7275) Apply patch: curl -L https://patch-diff.githubusercontent.com/raw/ggml-org/llama.cpp/pull/17116.diff | git applyAfter patch:
build: bde188d (7275) |
5c6f2a8 to
73c5330
Compare
|
Hm, that's strange. Btw, the second run seems to have a bit higher uncertainty in the TG numbers. Is it possible that system was somehow occupied during this test? Either way, I think it's better to merge this change since it fixes RPC support with Metal devices. Optimizations can come later. |
* rpc : fix alloc size logic * rpc : bump version
|
Is this confirmed to be working? Still getting seg fault in llama-cli with Metal as an RPC server even when just listing devices. All other RPCs (x86/Vulkan) work fine. Thanks. |
* rpc : fix alloc size logic * rpc : bump version
* rpc : fix alloc size logic * rpc : bump version
* rpc : fix alloc size logic * rpc : bump version
fix #16657
ref #16276 (review)
This fixes the RPC inference when Metal backend is involved.
Testing:
TODO: