model : refactor QKV into common build_qkv and create_tensor_qkv helpers#21245
Merged
Conversation
da129d5 to
26e72e0
Compare
CISC
reviewed
Apr 1, 2026
26e72e0 to
bcc69fd
Compare
Contributor
Author
|
hi @CISC,
|
CISC
reviewed
Apr 1, 2026
bcc69fd to
42eae08
Compare
Contributor
Author
|
@CISC Done:
|
CISC
reviewed
Apr 1, 2026
42eae08 to
75d759d
Compare
Contributor
Author
|
@CISC Done:
|
CISC
reviewed
Apr 2, 2026
CISC
left a comment
Member
There was a problem hiding this comment.
OP is inaccurate, there's nothing special about these:
nemotron-h: just addbuild_qkvinllm_build_nemotron_h::build_attention_layergranite-hybrid: just addbuild_qkvinlm_build_granite_hybrid::build_attention_layerolmo/mpt/dbrx: usebuild_qkv, add clampinggemma3n-iswa: just dobuild_qkvt5-dec/t5-enc: dobuild_qkvon normal self-attentionbert: usebuild_qkvlfm2: dobuild_qkvinbuild_attn_block
Member
|
I meant move the clamping to |
09d8066 to
04506d4
Compare
CISC
reviewed
Apr 6, 2026
050b5a9 to
623ed29
Compare
Contributor
Author
|
@CISC Done:
|
CISC
reviewed
Apr 10, 2026
623ed29 to
ccd1f60
Compare
CISC
reviewed
Apr 10, 2026
ccd1f60 to
67a8492
Compare
CISC
reviewed
Apr 11, 2026
67a8492 to
d8bf733
Compare
Contributor
Author
|
@ngxson @am17an @ggerganov This PR is ready. Could you take a look when you have time? |
ggerganov
approved these changes
Apr 16, 2026
ggerganov
reviewed
Apr 16, 2026
…e-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
d8bf733 to
51dbd8c
Compare
CISC
approved these changes
Apr 16, 2026
6 tasks
cnsiva
pushed a commit
to saas-home/llama.cpp
that referenced
this pull request
Apr 17, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
samuraieng
pushed a commit
to samuraieng/llama.cpp
that referenced
this pull request
Apr 19, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
jspadgett
pushed a commit
to jspadgett/llama.cpp
that referenced
this pull request
Apr 20, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
mengqin
pushed a commit
to mengqin/llama.cpp
that referenced
this pull request
Apr 20, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
ArberSephirotheca
pushed a commit
to ArberSephirotheca/llama.cpp
that referenced
this pull request
Apr 21, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Apr 23, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 1, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
ljubomirj
pushed a commit
to ljubomirj/llama.cpp
that referenced
this pull request
May 6, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
JoursBleu
added a commit
to JoursBleu/llama.cpp
that referenced
this pull request
May 7, 2026
…F conversion Adds an opt-in '--fuse-qkv' flag to convert_hf_to_gguf.py that concatenates separate Q / K / V weight tensors into a single fused attn_qkv tensor during HF -> GGUF conversion. Fusion happens in the shared base ModelBase.modify_tensors() sink that subclass overrides forward into, so per-layer Q/K/V tensors are buffered until all three are seen, then emitted as one fused tensor via torch.cat([q,k,v], dim=0). At runtime the existing build_qkv helper introduced in ggml-org#21245 already handles the fused path with one matmul + ggml_view_3d split, so a GGUF produced with --fuse-qkv mostly reuses that path. This branch also adds two small C++ correctness fixes for fused-weight + separate-bias models (qwen2 / phi2 / starcoder2 / stablelm and similar): - llama-model.cpp: when wqkv is found but wqkv_b is absent, fall back to loading separate wq_b / wk_b / wv_b. --fuse-qkv only fuses weights, not biases. - llama-graph.cpp: in build_qkv fused path, when wqkv_b is absent but per-head biases exist, concat wq_b + wk_b + wv_b with ggml_concat and add after the fused matmul. Without this fix biases were silently dropped, producing garbage output. gguf-py/gguf/constants.py registers MODEL_TENSOR.ATTN_QKV on every arch that already declares ATTN_Q + ATTN_K + ATTN_V; without this the writer rejects the fused tensor at format_tensor_name(). Tested on 4x AMD R9700 (gfx1201, ROCm 7.2): 34 architectures match nofuse output bit-for-bit at Q8_0 and Q4_0; representative pp512 speedups: refact +22.7%, qwen2 +13.1%, granite +12.5%, seed-oss +12.3%, phi2 +6.3%, mistral3 +5.3%. test-llama-archs passes (0 FAIL). This is PR 2 of the split discussed in ggml-org#20628; PR 1 (ggml-org#21245) is already merged.
my-other-github-account
pushed a commit
to my-other-github-account/llama.cpp
that referenced
this pull request
May 15, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
my-other-github-account
pushed a commit
to my-other-github-account/llama.cpp
that referenced
this pull request
May 15, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
8 tasks
zapabob
pushed a commit
to zapabob/llama.cpp
that referenced
this pull request
May 17, 2026
…ache Wholesale sync of 384 upstream commits since merge-base 7fc1c4e (2026-04-22). Headline upstream feature: MTP / Multi-Token Prediction (ggml-org#22673) + spec-decoding stack (ggml-org#22838 parallel drafting, ggml-org#22227 spec-simple checkpoints, ggml-org#19493 server spec checkpointing, plus 5 spec bug-fixes). 11 conflicts resolved across CUDA fattn / Metal / Vulkan / common: ggml/src/ggml-cuda/fattn-mma-f16.cuh RDNA config matrix: union TQ's (640, 512) entries with upstream's expanded (112..576) RDNA matrix. Took upstream's new sentinel fallback (no ampere fallback for RDNA). ggml/src/ggml-cuda/fattn.cu - Extended hoisted ncols2_max to include 640 head-dim. - Volta: dropped TQ's local ncols2_max redefinition in favor of upstream's hoisted version (with 640 added). - WMMA gate: union exclusions (40, 72, 192, 512, 576, 640). - Preserved TQ's RDNA4 vector-kernel branch for TurboQuant cache types (renamed inner gqa_ratio_eff_rdna4 to avoid shadowing); took upstream's restructured MFMA/CDNA path verbatim. ggml/src/ggml-cuda/ggml-cuda.cu Supported-op switch: union TQ's GGML_OP_TURBO_WHT case with upstream's GGML_OP_ADD/SUB/MUL/DIV FP16 cases. ggml/src/ggml-metal/ggml-metal-device.h Kept TQ's get_pipeline_turbo_wht declaration; took upstream's new get_pipeline_mul_mv_ext(lib, const ggml_tensor * op, ...) signature (replaces split tsrc0/tsrc1 args). ggml/src/ggml-metal/ggml-metal-device.cpp Kept TQ's get_pipeline_turbo_wht implementation; took upstream's new get_pipeline_mul_mv_ext signature — body already uses op-> for tsrc0/tsrc1/ne12/r2/r3. ggml/src/ggml-metal/ggml-metal-ops.cpp Preserved TQ's is_tq_weight rotate→matmul→unrotate path with original hardcoded dispatch shape. Updated non-TQ fallback to upstream's pipeline- param dispatch (pipeline.nr0 / nr1 / nsg + (ne11+nr1-1)/nr1 shape). ggml/src/ggml-vulkan/* (3 files) Upstream-wholesale via `git checkout --theirs`. Upstream architecturally refactored FA from compile-time DATA_A_* variants to runtime FaTypeK/FaTypeV spec-constant switches. TQ's TURBO3_0 GLSL path is DEFERRED — Vulkan TURBO3_0 support needs re-implementation against the new architecture in a follow-up PR. Mac mini + M5 Max have no Vulkan; no in-house validation path for an immediate re-adaptation. common/arg.cpp --spec-default: took upstream's new struct shape (params.speculative.types vector + params.speculative.ngram_mod.{n_match,n_min,n_max}). common/speculative.cpp Low-acceptance reset: took upstream's sinfo.n_low / sinfo.i_last (variables moved into sinfo struct). NOT-CONFLICTED upstream additions that touch TQ-adjacent surface (auto-merged clean, but worth eyes during review): - src/llama-memory-recurrent.{cpp,h} (MTP rollback API) - src/llama-memory-hybrid.{cpp,h} (recall feedback_llama_memory_types + feedback_layer0_hybrid_trap) - src/llama-graph.cpp, src/llama-kv-cache.cpp, src/llama-context.cpp - tools/server/server-context.cpp (+~1100 lines: MTP + parallel drafting + spec checkpointing) - src/models/qwen35*.cpp, qwen3next.cpp, delta-net-base.cpp (entirely new in upstream — MTP draft-head integration) Known-deferred follow-ups: 1. Vulkan TURBO3_0 re-implementation against runtime spec-constant FA arch 2. PR ggml-org#21245 QKV refactor helpers — landed; TQ models not migrated to use them. Migrate in a focused follow-up; do not bundle here. Validation gate (pending): M2 mini PPL/decode comparison @ Qwen2.5-7B-Q8_0 K=q8_0/V=turbo4 asymmetric ctx 2048 + 16384 — see PR body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Currently llama.cpp supports 112 model files in
src/models/.We modified the 85 applicable model files. Our changes abstract the duplicated
Q/K/V tensors' loading and graph-building code into two reusable helpers,
following the
create_tensor_gate_up_expspattern (#19139).•
create_tensor_qkv(llama-model.cpp): tries fusedwqkv/bqkvfirst (TENSOR_NOT_REQUIRED | TENSOR_SKIP_IF_VIRTUAL), falls back to separatewq/wk/wv. Supports adding biases.•
build_qkv(llama-graph.h/cpp): returns{Qcur, Kcur, Vcur}as 3D tensors. Fused case: single fused qkv matmul +ggml_view_3dsplit. Separate case: 3 separate matmuls +ggml_reshape_3d.Test:
test-llama-archs— all OK, 0 FAIL. Zero diff onllama-arch.cpp.The remaining 27 models are not modified for the following reasons:
Additional information
Basing on the discussion in #20628 (@am17an, @ngxson). The plan is:
the two functions above, and adds handling for the fused qkv case.
--fuse-qkvtoconvert_hf_to_gguf.py.Requirements