graph : add optional scale parameter to build_lora_mm by richarddd · Pull Request #20427 · ggml-org/llama.cpp

richarddd · 2026-03-11T21:24:46Z

@ggerganov As discused in #19769. Add optional w_s parameter to build_lora_mm() for applying a multiplicative scale after the matmul. This cleans up the pattern used by bitnet and NVFP4 models where a per-tensor scale is applied after each weight multiplication.

ggerganov · 2026-03-11T21:28:54Z

src/llama-graph.h

    ggml_tensor * build_lora_mm(
              ggml_tensor * w,
-              ggml_tensor * cur) const;
+              ggml_tensor * cur,
+              ggml_tensor * w_s = nullptr) const;


For a follow-up PR, we should move the cur at the front, to increase consistency of the interfaces:

ggml_tensor * build_lora_mm( ggml_tensor * cur, ggml_tensor * w, ggml_tensor * w_s = nullptr) const;

It will touch a lot of lines though.

CISC

Concur with @ggerganov on ordering.

…rg#20427)

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

add optional scale parameter to build_lora_mm

3f5b133

richarddd requested a review from CISC as a code owner March 11, 2026 21:24

github-actions bot added the model Model specific label Mar 11, 2026

ggerganov approved these changes Mar 11, 2026

View reviewed changes

CISC approved these changes Mar 11, 2026

View reviewed changes

CISC merged commit 1eea6a2 into ggml-org:master Mar 11, 2026
7 of 76 checks passed

ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026

graph : add optional scale parameter to build_lora_mm [no ci] (ggml-o…

ce8b2ae

…rg#20427)

richarddd deleted the chore/build-lora-mm-scale branch March 12, 2026 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : add optional scale parameter to build_lora_mm#20427

graph : add optional scale parameter to build_lora_mm#20427
CISC merged 1 commit intoggml-org:masterfrom
richarddd:chore/build-lora-mm-scale

richarddd commented Mar 11, 2026

Uh oh!

ggerganov Mar 11, 2026

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

richarddd commented Mar 11, 2026

Uh oh!

ggerganov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants