-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Open
Labels
Description
Ref: #18725 (comment)
It seems like on some backend, repeat calls to ggml_concat can be a bottleneck. I suspect that's because each time it's called, a new (bigger) tensor will need to be allocated and data will be copied over. So doing multiple concat: concat(a, concat(b, ...)) will make it slow.
So I'm wondering if multiple concat can be fused the same way multiple add can do.
This will be necessary for qwen3next because chunks will be concat on each loop iteration. So avoiding copy/reallocation in ggml_concat could improve the overall performance on bigger batch (?)
llama.cpp/src/models/qwen3next.cpp
Line 336 in e06088d
| ggml_tensor * gexp_last_chunk = ggml_cont(ctx0, get_slice_2d(ctx0, g_last_exp, chunk)); |
Reactions are currently unavailable