Skip to content

Feature Request: Fused GGML_CONCAT #19432

@ngxson

Description

@ngxson

Ref: #18725 (comment)

It seems like on some backend, repeat calls to ggml_concat can be a bottleneck. I suspect that's because each time it's called, a new (bigger) tensor will need to be allocated and data will be copied over. So doing multiple concat: concat(a, concat(b, ...)) will make it slow.

So I'm wondering if multiple concat can be fused the same way multiple add can do.

This will be necessary for qwen3next because chunks will be concat on each loop iteration. So avoiding copy/reallocation in ggml_concat could improve the overall performance on bigger batch (?)

ggml_tensor * gexp_last_chunk = ggml_cont(ctx0, get_slice_2d(ctx0, g_last_exp, chunk));

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions