vulkan: f16 mixed-precision state for GATED_DELTA_NET by ProgenyAlpha · Pull Request #20376 · ggml-org/llama.cpp

ProgenyAlpha · 2026-03-11T03:19:23Z

Follow-up to #20334. Splits out the f16 mixed-precision state optimization into its own PR per @0cc4m's feedback.

Stores the 128-element state array in float16_t, keeps all arithmetic in float32. No precision loss (13/13 backend-ops tests passing). Lower register pressure gives a measurable PP boost.

Depends on #20334

890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):

Metric	Without f16	With f16	Change
PP-512	165.31 t/s	174.54 t/s	+5.6%
TG-128	21.16 t/s	21.48 t/s	+1.5%

f16 pipeline auto-selects when the device supports shaderFloat16, falls back to f32 otherwise.

0cc4m · 2026-03-12T12:26:32Z

Please rebase and resolve the conflicts.

ProgenyAlpha · 2026-03-12T15:36:33Z

Please rebase and resolve the conflicts.

Working on this now.

ggerganov · 2026-03-12T15:46:14Z

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

0cc4m · 2026-03-12T16:04:21Z

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

ProgenyAlpha · 2026-03-12T16:21:40Z

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

This was more of a first pass to see if the f16 state approach was worth pursuing at all (the register pressure reduction did give measurable PP gains on my 890M). But given that CUDA and Metal both keep state in f32 for this op, I wanted to do more homework before committing to this path, anyway.

Next steps from my end:

Run perplexity comparisons at longer context lengths (2048+) to see if the drift is actually measurable in practice
Look into the inverse approach you suggested since those intermediate values won't accumulate across tokens

I can also close out the PR if you prefer?

ggerganov · 2026-03-12T16:27:58Z

To reduce the register pressure, implement the sharded approach as demonstrated in #20391 and #20361.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

@0cc4m I'm not sure - don't have a good intuition about the recurrent state yet.

The change could be fine - I'm just not really sure.

ProgenyAlpha · 2026-03-12T16:37:21Z

To reduce the register pressure, implement the sharded approach as demonstrated in #20391 and #20361.

@0cc4m I'm not sure - don't have a good intuition about the recurrent state yet.

The change could be fine - I'm just not really sure.

I'll look into the sharded approach from #20391 for Vulkan. I had sharding on my todo list already, but held off on opening another PR due to the new policy. Will do some benchmarks and validity testing first, so I'm not wasting your time. May end up closing this one out depending on results.

jeffbolznv · 2026-03-12T16:42:34Z

My hunch is that spreading the values across more invocations and/or shared memory will be better. The "shape" of the algorithm is similar enough to ssm_scan that it seems like the same techniques should work.

Add subgroup-sharded GATED_DELTA_NET kernel that distributes state columns across subgroup lanes (2 regs/lane on wave64 vs 128 regs/thread). Uses subgroupAdd() for reductions with shared memory fallback. Also add f16 arithmetic variant (f32 state, f16 dot products) for precision comparison testing against f16 state variant. Four GDN pipeline variants now available for benchmarking: - f32 baseline (existing) - f16 state / f32 arithmetic (existing, PR ggml-org#20376) - f32 state / f16 arithmetic (new, flip variant) - sharded f32 (new, preferred when S_V >= subgroup_size)

ProgenyAlpha requested review from 0cc4m and ggerganov as code owners March 11, 2026 03:19

ProgenyAlpha mentioned this pull request Mar 11, 2026

vulkan: add GATED_DELTA_NET op support #20334

Merged

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026

vulkan: f16 mixed-precision state for GATED_DELTA_NET

09c6d3b

ProgenyAlpha force-pushed the vulkan-gdn-f16 branch from be06ee2 to 09c6d3b Compare March 12, 2026 16:27

ProgenyAlpha marked this pull request as draft March 12, 2026 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-f16

ProgenyAlpha commented Mar 11, 2026

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026 •

edited

Loading

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

jeffbolznv commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ProgenyAlpha commented Mar 11, 2026

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Mar 12, 2026

Uh oh!

jeffbolznv commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Mar 12, 2026 •

edited

Loading