Feature Request: Fused GGML_CONCAT

Ref: https://github.com/ggml-org/llama.cpp/issues/18725#issuecomment-3734848846

It seems like on some backend, repeat calls to `ggml_concat` can be a bottleneck. I suspect that's because each time it's called, a new (bigger) tensor will need to be allocated and data will be copied over. So doing multiple concat: `concat(a, concat(b, ...))` will make it slow.

So I'm wondering if multiple `concat` can be fused the same way multiple `add` can do.

This will be necessary for qwen3next because chunks will be concat on each loop iteration. So avoiding copy/reallocation in `ggml_concat` could improve the overall performance on bigger batch (?)

https://github.com/ggml-org/llama.cpp/blob/e06088da0fa86aa444409f38dff274904931c507/src/models/qwen3next.cpp#L336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Fused GGML_CONCAT #19432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Fused GGML_CONCAT #19432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions