Skip to content

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376

Draft
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-f16
Draft

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-f16

Conversation

@ProgenyAlpha
Copy link
Contributor

Follow-up to #20334. Splits out the f16 mixed-precision state optimization into its own PR per @0cc4m's feedback.

Stores the 128-element state array in float16_t, keeps all arithmetic in float32. No precision loss (13/13 backend-ops tests passing). Lower register pressure gives a measurable PP boost.

Depends on #20334

890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):

Metric Without f16 With f16 Change
PP-512 165.31 t/s 174.54 t/s +5.6%
TG-128 21.16 t/s 21.48 t/s +1.5%

f16 pipeline auto-selects when the device supports shaderFloat16, falls back to f32 otherwise.

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026
@0cc4m
Copy link
Contributor

0cc4m commented Mar 12, 2026

Please rebase and resolve the conflicts.

@ProgenyAlpha
Copy link
Contributor Author

Please rebase and resolve the conflicts.

Working on this now.

@ggerganov
Copy link
Member

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

@0cc4m
Copy link
Contributor

0cc4m commented Mar 12, 2026

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

@ProgenyAlpha
Copy link
Contributor Author

Btw, I'm not super confident that this cast is safe in terms of quality. Since this is a recurrent state, even small deviations can accumulate to large errors with time.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

This was more of a first pass to see if the f16 state approach was worth pursuing at all (the register pressure reduction did give measurable PP gains on my 890M). But given that CUDA and Metal both keep state in f32 for this op, I wanted to do more homework before committing to this path, anyway.

Next steps from my end:

  • Run perplexity comparisons at longer context lengths (2048+) to see if the drift is actually measurable in practice
  • Look into the inverse approach you suggested since those intermediate values won't accumulate across tokens

I can also close out the PR if you prefer?

@ggerganov
Copy link
Member

ggerganov commented Mar 12, 2026

To reduce the register pressure, implement the sharded approach as demonstrated in #20391 and #20361.

Do you know if the arithmetic could be moved to fp16 if the state stays in fp32? Is there a good way to find out what is safe and what isn't?

@0cc4m I'm not sure - don't have a good intuition about the recurrent state yet.

The change could be fine - I'm just not really sure.

@ProgenyAlpha ProgenyAlpha marked this pull request as draft March 12, 2026 16:29
@ProgenyAlpha
Copy link
Contributor Author

To reduce the register pressure, implement the sharded approach as demonstrated in #20391 and #20361.

@0cc4m I'm not sure - don't have a good intuition about the recurrent state yet.

The change could be fine - I'm just not really sure.

I'll look into the sharded approach from #20391 for Vulkan. I had sharding on my todo list already, but held off on opening another PR due to the new policy. Will do some benchmarks and validity testing first, so I'm not wasting your time. May end up closing this one out depending on results.

@jeffbolznv
Copy link
Collaborator

My hunch is that spreading the values across more invocations and/or shared memory will be better. The "shape" of the algorithm is similar enough to ssm_scan that it seems like the same techniques should work.

ProgenyAlpha added a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
Add subgroup-sharded GATED_DELTA_NET kernel that distributes state
columns across subgroup lanes (2 regs/lane on wave64 vs 128 regs/thread).
Uses subgroupAdd() for reductions with shared memory fallback.

Also add f16 arithmetic variant (f32 state, f16 dot products) for
precision comparison testing against f16 state variant.

Four GDN pipeline variants now available for benchmarking:
- f32 baseline (existing)
- f16 state / f32 arithmetic (existing, PR ggml-org#20376)
- f32 state / f16 arithmetic (new, flip variant)
- sharded f32 (new, preferred when S_V >= subgroup_size)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants