Skip to content

ggml: MXFP flash attention with SoA layout (CPU scalar reference)#20609

Draft
timothyeburke wants to merge 16 commits intoggml-org:masterfrom
timlikesai:mxfp-flash-attention
Draft

ggml: MXFP flash attention with SoA layout (CPU scalar reference)#20609
timothyeburke wants to merge 16 commits intoggml-org:masterfrom
timlikesai:mxfp-flash-attention

Conversation

@timothyeburke
Copy link
Copy Markdown

@timothyeburke timothyeburke commented Mar 15, 2026

Hi llama.cpp maintainers and community!

I'm Tim, the new enterprise AI enablement engineer at The New York Times (not a journalist). A few weeks ago a blizzard snowed out a show I had tickets to (at least I got this video) and to pass the time I decided to do a deep dive into teaching myself how Flash Attention works, so I could teach others.

I decided to try implementing MXFP4 flash attention for my CUDA GPUs (a pair of 5070 Ti cards), and to improve on perplexity scoring I added a few research methods on recovering loss in low-bit floating point model weight quantization, and also realized that adding most of the plumbing for MXFP6 and MXFP8 were there too and gave us lots of knobs to play with.

I have CUDA (timlikesai#2) and Metal implementations ready as fast follows for this PR.


MXFP Flash Attention — CPU Scalar Reference + SIMD

Add MXFP KV cache quantization for flash attention with Struct-of-Arrays (SoA) memory layout. Three MX format families: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), per the OCP Microscaling v1.0 spec.

SoA Layout

[qs contiguous][e8m0 contiguous] per row, enabling aligned memory access for all backends. The full flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — uses SoA end-to-end. Existing AoS block layout for MUL_MAT weight quantization is untouched.

Shared Constructs (ggml-common.h)

  • mxfp_dequant_traits_t — single source of truth for IEEE-754 bit reconstruction parameters across all backends (CPU, CUDA, Metal, Vulkan)
  • Static const trait instances for all 4 formats: E4M3, E5M2, E2M3, E3M2
  • Portable element converters, FP6 pack/unpack, E8M0 scale computation, Walsh-Hadamard rotation

CPU Implementation

  • Scalar reference in ops.cpp: per-head SoA dequant, hoisted loop invariants, stack-allocated buffers
  • x86 AVX2 and ARM NEON SIMD: shared dequant helpers and FP6 unpack, traits-driven dispatch
  • MXFP4 uses LUT-based dequant (shuffle); MXFP6/MXFP8 use IEEE bit reconstruction via shared helpers

Hadamard Rotation

Block-32 Walsh-Hadamard rotation on Q and K distributes outlier energy across the shared E8M0 exponent group. Essential for perplexity — without it, MXFP6 E2M3 degrades by +3.34 PPL, MXFP8 E4M3 by +0.22. Skipped for MLA models where V is a view of K.

Naming

GGML_TYPE_MXFP4GGML_TYPE_MXFP4_E2M1 across all backends for consistency with the MX type family naming (similarly MXFP8_E4M3, MXFP6_E2M3).


KV Cache Perplexity Results

Date: 2026-03-21 | Backend: CPU (scalar) | Chunks: 16 | ctx: 512

Memory columns show GiB per 100K tokens of KV cache (non-SWA + SWA combined).

gpt-oss-20b

Model: ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf

master

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
q8_0 362.0037 -12.7872 4.86 4.86 9.72 47%
q8_0+q4_0 363.4146 -11.3763 4.86 2.57 7.43 60%
f16 374.7909 9.15 9.15 18.30
q4_0 817.0624 +442.2715 2.57 2.57 5.14 72%

mxfp-flash-attention

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
mxfp6+mxfp4 331.7924 -42.9985 3.57 2.43 6.00 68%
mxfp8+mxfp4 343.5957 -31.1952 4.72 2.43 7.15 61%
mxfp6 349.9909 -24.8000 3.57 3.57 7.14 62%
q8_0 362.0037 -12.7872 4.86 4.86 9.72 47%
q8_0+q4_0 363.4146 -11.3763 4.86 2.57 7.43 60%
mxfp4 370.1863 -4.6046 2.43 2.43 4.86 74%
mxfp8 373.4227 -1.3682 4.72 4.72 9.44 49%
f16 374.7909 9.15 9.15 18.30
q4_0 817.0624 +442.2715 2.57 2.57 5.14 72%

Comparison

Config master mxfp-flash-attention
f16 374.7909 374.7909
q8_0 362.0037 362.0037
q8_0+q4_0 363.4146 363.4146
q4_0 817.0624 817.0624

qwen3-coder

Model: spectralyst/Qwen3-Coder-30B-A3B-Instruct-MXFP4_MOE-GGUF/Qwen3-Coder-30B-A3B-Instruct-MXFP4_MOE.gguf

master

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
q8_0 11.1139 -.0704 9.72 9.72 19.44 47%
q8_0+q4_0 11.1215 -.0628 9.72 5.14 14.86 60%
f16 11.1843 18.31 18.31 36.62
q4_0 11.3810 +.1967 5.14 5.14 10.28 72%

mxfp-flash-attention

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
q8_0 11.1139 -.0704 9.72 9.72 19.44 47%
q8_0+q4_0 11.1215 -.0628 9.72 5.14 14.86 60%
f16 11.1843 18.31 18.31 36.62
mxfp8 11.2512 +.0669 9.44 9.44 18.88 49%
mxfp6 11.2574 +.0731 7.15 7.15 14.30 61%
mxfp6+mxfp4 11.2794 +.0951 7.15 4.86 12.01 68%
mxfp8+mxfp4 11.2938 +.1095 9.44 4.86 14.30 61%
q4_0 11.3810 +.1967 5.14 5.14 10.28 72%
mxfp4 11.9138 +.7295 4.86 4.86 9.72 74%

Comparison

Config master mxfp-flash-attention
f16 11.1843 11.1843
q8_0 11.1139 11.1139
q8_0+q4_0 11.1215 11.1215
q4_0 11.3810 11.3810

gemma-3n-e4b

Model: lmstudio-community/gemma-3n-E4B-it-text-GGUF/gemma-3n-E4B-it-Q8_0.gguf

master

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
q4_0 38.8477 -.0320 2.14 2.14 4.28 72%
q8_0 38.8676 -.0121 4.05 4.05 8.10 47%
f16 38.8797 7.62 7.62 15.24
q8_0+q4_0 39.1937 +.3140 4.05 2.14 6.19 60%

mxfp-flash-attention

Config PPL Δ F16 K GiB/100K V GiB/100K Total GiB/100K Savings
mxfp4 35.8952 -2.9845 2.02 2.02 4.04 74%
mxfp6+mxfp4 37.4399 -1.4398 2.97 2.02 4.99 68%
mxfp8+mxfp4 37.9486 -.9311 3.93 2.02 5.95 62%
q4_0 38.8477 -.0320 2.14 2.14 4.28 72%
mxfp8 38.8489 -.0308 3.93 3.93 7.86 49%
q8_0 38.8676 -.0121 4.05 4.05 8.10 47%
f16 38.8797 7.62 7.62 15.24
mxfp6 38.8864 +.0067 2.97 2.97 5.94 62%
q8_0+q4_0 39.1937 +.3140 4.05 2.14 6.19 60%

Comparison

Config master mxfp-flash-attention
f16 38.8797 38.8797
q8_0 38.8676 38.8676
q8_0+q4_0 39.1937 39.1937
q4_0 38.8477 38.8477

Test logs: https://gist.github.com/timothyeburke/f176199437c9c886b2fdd8205b357451

Add MXFP KV cache quantization for flash attention using Struct-of-Arrays
(SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8
(E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec.

SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling
aligned memory access patterns for GPU backends. All functions in the
flash attention pipeline — set_rows quantization, Q preprocessing, K/V
dequantization — use SoA end-to-end. The existing AoS block layout
remains for MUL_MAT weight quantization (untouched).

Q preprocessing applies Walsh-Hadamard rotation (block-32) before
quantize/dequant round-trip, distributing outlier energy across the
shared exponent group. This is essential for perplexity:
  MXFP8: +0.22 PPL without rotation
  MXFP6: +3.34 PPL without rotation
Hadamard is skipped for MLA models (DK != DV) where V is a view of K.

Shared infrastructure in ggml-common.h:
- Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements)
- E8M0 MSE-optimal scale search with ±1 range
- Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2)
- FP6 tight packing (4 six-bit values in 3 bytes, 25% savings)
- IEEE-754 bit reconstruction constants for SIMD backends
- SoA layout macros, portable bit cast, type property queries

CPU implementation:
- Scalar reference + ARM NEON + x86 AVX2 optimized paths
- Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM)
- Split-KV path extended for single-query decode
- Generic vec_dot via dequant-to-float for MUL_MAT compatibility
- Arch fallbacks for loongarch, powerpc, riscv, s390, wasm

KV cache integration:
- set_rows writes SoA with optional Hadamard (op_params[0] flag)
- K cache block-aligned to 16 for CUDA cp.async compatibility
- CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8)

Tests:
- Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4)
- SET_ROWS: Hadamard rotation for all types
- SoA-aware test initialization and comparison for MXFP tensors
- Quantize functions coverage for all types

Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends
(CPU, OpenCL, SYCL) for consistency with the MX type family naming.
@timothyeburke timothyeburke requested review from a team and ggerganov as code owners March 15, 2026 21:33
Copilot AI review requested due to automatic review settings March 15, 2026 21:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MXFP (OCP Microscaling v1.0) KV-cache quantization support for flash attention using an end-to-end Struct-of-Arrays (SoA) layout, plus CPU reference/SIMD paths and CLI/test coverage updates.

Changes:

  • Introduces new GGML MXFP types (MXFP4_E2M1, MXFP8_E4M3, MXFP6_E2M3) and common infrastructure (converters, SoA helpers, Hadamard rotation, MSE scale search).
  • Extends CPU flash-attention pipelines (one_chunk + tiled) and KV-cache write path to support MXFP SoA with optional Hadamard rotation.
  • Updates tools/CLI/tests and several backends to recognize renamed/added MXFP types.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tools/llama-bench/llama-bench.cpp Adds bench CLI type-name parsing for MXFP aliases/full names.
tests/test-backend-ops.cpp Adds MXFP SoA init/dequant for tests; expands flash-attn and set_rows tests; updates renamed type usage.
src/llama-quant.cpp Switches MoE default quant type to renamed MXFP4_E2M1.
src/llama-kv-cache.cpp Pads MXFP K-cache allocation to 16-block alignment; uses padded strides in K views; sets Hadamard flag on MXFP K writes.
ggml/src/ggml.c Registers new GGML types/traits; adds ggml_is_type_mxfp + Hadamard/type-property helpers; extends quantize dispatch.
ggml/src/ggml-sycl/mmvq.cpp Updates SYCL mul-mat-vec switch to renamed MXFP4_E2M1.
ggml/src/ggml-sycl/convert.cpp Updates SYCL conversion dispatch to renamed MXFP4_E2M1.
ggml/src/ggml-quants.h Declares new MXFP quant/dequant APIs, SoA functions, element converters, packing, Hadamard.
ggml/src/ggml-quants.c Implements MXFP scale search, element converters wrappers, FP6 packing, SoA quant/dequant, Hadamard wrapper, adds MXFP6/8 quantizers.
ggml/src/ggml-opencl/ggml-opencl.cpp Updates OpenCL type checks/paths to renamed MXFP4_E2M1.
ggml/src/ggml-impl.h Simplifies/clarifies E8M0-to-float helpers (kept local to impl header).
ggml/src/ggml-cpu/repack.cpp Updates MXFP4 repack assertions and selection to renamed MXFP4_E2M1.
ggml/src/ggml-cpu/quants.h Declares MXFP6/8 quantize + CPU dequant + SoA dequant entry points and generic fallbacks.
ggml/src/ggml-cpu/quants.c Adds generic vec_dot for MXFP8/MXFP6 and generic CPU dequant/SoA dequant wrappers.
ggml/src/ggml-cpu/ops.cpp Adds MXFP SoA quantize in set_rows (with optional Hadamard); adds MXFP support in flash-attn one_chunk/tiled + split-KV changes.
ggml/src/ggml-cpu/ggml-cpu.c Registers CPU type traits for new MXFP8/MXFP6 and renamed MXFP4.
ggml/src/ggml-cpu/arch/x86/quants.c Adds AVX2 MXFP8/MXFP6 vec_dot and AoS/SoA dequant kernels.
ggml/src/ggml-cpu/arch/wasm/quants.c Adds wasm fallbacks for MXFP vec_dot entry points.
ggml/src/ggml-cpu/arch/s390/quants.c Adds s390 fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/riscv/quants.c Adds RISC-V fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/powerpc/quants.c Adds PowerPC fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/loongarch/quants.c Adds LoongArch fallbacks for MXFP vec_dot entry points.
ggml/src/ggml-cpu/arch/arm/quants.c Adds NEON MXFP8/MXFP6 vec_dot and AoS/SoA dequant kernels.
ggml/src/ggml-cpu/arch-fallback.h Adds MXFP generic aliasing and attempts to route dequant on non-arm/x86.
ggml/src/ggml-common.h Adds MXFP constants, block structs, SoA offsets, LUTs, and canonical converters/Hadamard/E8M0 helpers.
ggml/include/ggml.h Adds new public ggml_type/ftype enums and MXFP helper APIs.
ggml/include/ggml-cpu.h Extends CPU type-traits struct with optional SIMD to_float hook.
common/arg.cpp Adds MXFP cache types + short aliases to KV-cache CLI parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added testing Everything test related examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language OpenCL Issues specific to the OpenCL backend labels Mar 15, 2026
@timothyeburke timothyeburke marked this pull request as draft March 15, 2026 22:48
…lize MXFP6 naming

Fix potential buffer overflows flagged in PR ggml-org#20609 review:
- set_rows: replace fixed float tmp[1024] with std::vector for large n_embd_k_gqa
- tiled FA: size q_mxfp_buf with ggml_row_size guard instead of fixed 1024
- one_chunk FA: pre-allocate k/v dequant buffers from mxfp.{k,v}_soa_elems
  instead of hard-coded float[4096] stack arrays
- kv-cache: assert n_embd_k_gqa % qk == 0 before integer division
- test init: assert soa_bytes % block_size == 0

Normalize MXFP6 function naming to match MXFP8 convention (short form
without element format suffix): mxfp6_e2m3 → mxfp6 in all function
identifiers across 14 files. Format-specific items (type enums, traits,
lookup tables, constants) retain their _e2m3 suffix.
- Per-head dequant: multihead MXFP now extracts only the needed head's
  SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and
  dequants DK elements, instead of dequanting all heads (nek2*DK).
  For 8 KV heads this is 8x less dequant work per KV position.

- Hoist loop invariants: base pointer offsets (k_base, v_base),
  per-head SoA byte offsets, and multihead row bases are computed once
  per query row instead of per KV position in the inner loop.

- Precompute SoA addressing in mxfp_fa_params_init: qs_per_block,
  blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated
  once at init rather than derived per iteration.

- Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and
  v_is_f16 check outside the ir loop.
- Increase q_mxfp_buf from 512 to 2048 bytes (supports DK up to 1024 with MXFP8)
- Replace fixed k_soa[4096]/v_soa[4096] stack arrays with dynamically sized vectors
- Replace fixed k_head_soa[320]/v_head_soa[320] with dynamically sized vectors
- Add soa_bytes divisibility assertion in test init
… helpers

Add mxfp_dequant_traits_t to ggml-common.h as single source of truth for
MXFP IEEE-754 reconstruction parameters. Define static const instances for
all 4 formats (E4M3, E5M2, E2M3, E3M2), ready for CUDA/Metal/Vulkan reuse.

Extract shared dequant and FP6 unpack helpers on both architectures,
replacing duplicated inline code and macros. Net -215 lines.
…d dequant

Replace per-row/per-tile std::vector heap allocations with stack buffers
in set_rows, one_chunk, and tiled flash attention paths. Fix tiled path
to use per-head SoA extraction (matching one_chunk) instead of dequanting
the full multihead region per token.
@narinishi
Copy link
Copy Markdown

Could you clarify the testing methodology specifically for gpt-oss-20b-MXFP4? The results seem anomalous.

@timothyeburke
Copy link
Copy Markdown
Author

@narinishi I updated to latest master and fixed a mistake I made in the tiled GEMM paths. The full results log is available at https://gist.github.com/timothyeburke/f176199437c9c886b2fdd8205b357451, and I've updated the results above showing the comparison with master which now match.

* cleanup : hoist mxfp soa functions

* fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert

Three fixes for CI failures:

1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include
   causes NAN/INFINITY to become non-constexpr, breaking __device__ static
   table initialization for the MXFP LUTs.

2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal
   shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS)
   but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS).

3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a
   conditional fallback — when ne2 is not divisible by heads_per_region,
   fall back to per-head SoA init instead of crashing.

* fix : correct guard for mxfp cpu dequant functions

* fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout

- Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP
  LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY
- Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides,
  matching the KV cache layout for permuted flash attention tests

* fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables

CUDA/HIP/MUSA __device__ static tables require constexpr initializers.
Standard NAN/INFINITY macros may expand to non-constexpr expressions
(e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr
for static init). Previous fix attempted __uint_as_float for nvcc and
__builtin_bit_cast for clang — neither worked universally.

Use __builtin_nanf("") and __builtin_inff() which are constexpr on
all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define
once before the platform #if chain instead of per-platform copies.

* fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests

The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal
precision, causing bitwise mismatches against the IEEE-754 element
converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e
precision for exact float round-trip on all 256 entries.

Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single
definition using __builtin_nanf/__builtin_inff (constexpr on all
target compilers), and adds LUT validation tests to test-quantize-fns
that verify all 5 MXFP element converters match their canonical LUT
values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256,
FP8 E5M2: 256 — 656 total values verified).

* fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc

MSVC does not support __builtin_nanf/__builtin_inff. Use standard
NAN/INFINITY macros on MSVC (which work for regular static tables),
and compiler builtins only on GCC/Clang/nvcc (needed for CUDA
__device__ table constexpr initialization).

* fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros

When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__
are defined. The previous fix checked _MSC_VER first, giving nvcc the
MSVC NAN/INFINITY macros which are not constexpr for __device__ tables.
Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff.

* cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA)

MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention,
which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions
(NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added
and are dead code — no model stores weights in these formats.

Removed:
- AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch
- AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch
- AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic
- AoS fallback defines in arch-fallback.h
- CPU traits .to_float entries for MXFP6/MXFP8
- MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests)

Kept (correct SoA code):
- All *_soa_* functions (NEON, AVX2, generic, dispatch)
- CPU traits .from_float_soa / .to_float_soa
- Flash attention and SET_ROWS Hadamard test cases
- Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip)
- MXFP4 AoS code (upstream model weight support, untouched)

Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code
that had a NEON bug. The test no longer runs because the type is
correctly excluded from AoS test paths.

* test: guard all MXFP types must have SoA traits for flash attention

All MXFP flash attention uses SoA layout exclusively. Test validates:
- ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa
- MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float

Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8,
or removes SoA traits from any MXFP type, CI will catch it.

* test: add Hadamard, SoA cross-check, E8M0, and layout offset tests

* test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests

Add comprehensive tests to catch the bugs backend implementers hit most:
- Element converter edge cases: subnormals, max finite, saturation, NaN, sign
- FP6 pack/unpack exhaustive round-trip with known-answer byte verification
- E8M0 known-answer decode + HALF vs FULL scale distinction
- E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs)
- Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats
- Consolidate duplicate SoA switches into single table in test-backend-ops

* test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse

- MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match
- Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types
- mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants
- Block size consistency: verify QK_MXFP{4,8,6} == 32
- EMAX_OFFSET vs format max: validate constants produce valid E8M0
- Edge case LUT validation: expected_bits verified against canonical LUTs
- FP4 E2M1 exhaustive converter round-trip (16/16)

* cleanup: tighten MXFP test comments to match repo conventions

* fix: platform-specific NaN/Infinity for GPU device table initializers

FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized
in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses
these LUTs (they use converter functions instead), so guard them out of
GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.
@github-actions github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Mar 22, 2026
* cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms

- Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching
  existing GGML_TYPE_MXFP4 pattern) and use short names consistently
  throughout the codebase instead of mixing long/short forms.

- Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu)
  on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases
  to each arch section in arch-fallback.h. Previously these were only
  defined under GGML_CPU_GENERIC, causing linker failures on those
  platforms when using MXFP flash attention.

- Remove 10 files from the PR diff:
  - 5 arch stub files replaced by arch-fallback.h aliases
  - 5 rename-only files (sycl, opencl, repack, llama-quant) reverted
    since the GGML_TYPE_MXFP4 compat alias handles them

* cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper

- FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4()
  from ggml-common.h instead of duplicating the scalar bit manipulation.

- Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric
  K/V fields (dequantize, multihead, soa_elems, qs_per_block,
  head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable
  struct accessed as mxfp.k and mxfp.v.

- Add mxfp_dequant_head() helper: replaces 4 instances of the multihead
  SoA extraction pattern (2x memcpy + dequant, with multihead/single-head
  branching) with a single function call. Future backends get the pattern
  for free.

* cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks

The K and V initialization in mxfp_fa_params_init were structurally
identical 10-line blocks differing only by tensor/dimension. Extract
into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats
get the multihead SoA addressing logic automatically.

* cleanup: generic MSE round-trip, replace magic buffer sizes with constants

- Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical
  round-trip functions differing only by converter. mxfp_compute_e8m0_mse
  now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6).
  MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free
  by just setting to_elem/to_float in their traits.

- Replace magic 1024/1088 buffer sizes in flash attention with named
  constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if
  max head dimension grows.

* cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls

MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash
attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were
dead code — no matmul path uses them.

Removed:
- ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h
- ggml_vec_dot_mxfp_q8_0_impl shared helper
- arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines)
- vec_dot/vec_dot_type registration in ggml-cpu.c

Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and
mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two
generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl)
that use traits->bits_per_elem and traits->qs_per_block to handle both
byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats
get SoA for free by setting these trait fields.

* cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only

MXFP8 and MXFP6 are KV-cache-only types. All quantization and
dequantization goes through the SoA (Struct-of-Arrays) path for flash
attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations
were dead code that should never have been added.

Removed:
- quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl
- quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6}
- quantize_mxfp{8,6} (ggml_quantize_chunk wrappers)
- All declarations from ggml-quants.h and quants.h
- to_float/from_float_ref registrations from ggml.c type traits
- from_float registration from ggml-cpu.c CPU traits

Block struct definitions (block_mxfp8, block_mxfp6) are retained for
sizeof() in type traits and validate_row_data.

* cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types

Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk —
these are KV-cache-only types that use SoA layout via from_float_soa.
Attempting AoS quantization through this entry point is a bug.
…t coverage

* fix: correct tiled flash attention SoA pointer math for multihead MXFP

The cleanup refactoring (c919bc4) extracted mxfp_dequant_head as a
shared helper but failed to update the tiled path's data pointers.
The helper expects the full SoA row base (no per-head offset), but the
tiled path was passing a pointer that already included ik2*nbk2, causing
a double head offset that produced NaN during prefill.

Add mxfp_row_ptr helper to centralize the multihead-aware pointer
calculation across both one_chunk and tiled paths. Verified with 16-chunk
perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8)
produce exact matches with the known-good commit (23e8863).

* perf: reduce E8M0 MSE search range from ±2 to ±1

The base estimate round(log2(amax)) is always within 1 step of optimal.
Empirically verified across 30K blocks and 6 distributions: ±1 and ±2
never disagree. This reduces the scale search from 5 to 3 candidates
(40% fewer inner loop iterations) with zero quality impact.

* perf: eliminate redundant work in MXFP quantize and flash attention

- mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d
- mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop
- tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer

* cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers

- EMAX_OFFSET comments: fix ceil/floor labels to match actual values
- Hadamard flag: unify write path (llama-kv-cache.cpp) and read path
  (ops.cpp) to both use DK==DV condition instead of is_mla()
- E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style,
  add cross-reference comment

* fix: MXFP8/6 flash attention tests crash on init

The view base tensors for K/V don't get named "k"/"v" but inherit the
MXFP type. The name-based filter in initialize_tensors missed them,
falling through to init_tensor_uniform which calls quantize_chunk and
aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for
all tensors, matching the pattern set_rows tests already use.

* test: expand MXFP set_rows coverage

- Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage
- Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs
- Coverage: 18 → 51 MXFP set_rows tests

* perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path)

* cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers

Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared
helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4)
and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths.
Eliminates duplicated per-block logic while keeping layout-specific
pointer arithmetic in the callers.

* feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support)

Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers
from the SoA generic impl and use them to build AoS row functions for
MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref
in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT.

MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used
as general quantization types. The SoA impl is also DRY'd to delegate
to the same per-block helpers.

* cleanup: remove dead soa_elems field from mxfp_kv_params

Computed but never read — leftover from an earlier design.

* feat: add MXFP8/MXFP6 vec_dot and full CPU type support

Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations,
register from_float + vec_dot + vec_dot_type in CPU traits, and add
fallback remaps for all architectures. MXFP8/6 are now fully tested:
AoS quantization error, reference match, and dot product accuracy all
pass in test-quantize-fns.

* perf: remove E8M0 MSE search — base estimate is perplexity-optimal

The MSE search over ±1 candidates around round(log2(amax)) was found to
HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b.
The base estimate alone (no search) produces better attention patterns
because minimizing per-block reconstruction error is not the same as
minimizing attention score distortion through softmax.

Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant,
and the entire search loop. E8M0 computation is now a single amax scan +
integer bit extraction — no inner loop, no function pointers. This also
simplifies future GPU/Metal implementations.

* perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer)

Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and
quantize block-by-block with a 32-float stack buffer. Eliminates the
std::vector heap allocation and 2 extra memory passes over the full row.

set_rows now dispatches to the fused path when Hadamard is enabled,
falling through to the unfused quantize for non-Hadamard types.

This pattern maps directly to a CUDA kernel: global memory read →
register Hadamard → register quantize → global memory write.

* cleanup: consistent MXFP type names and variable naming

- Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6"
  to match "mxfp4". Only one variant of each exists — the suffix was
  unnecessary disambiguation that implied alternatives.
- Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles
  all types via ggml_type_name matching).
- Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K).

* perf: fuse Q preprocessing round-trip (no SoA buffer needed)

Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions
that apply quantization error to float values without materializing SoA
bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA
buffer → dequant from SoA buffer) with a single pass through per-block
round-trip helpers.

Eliminates the Q_q intermediate buffer and two function pointer calls
from the flash attention hot path. Maps directly to CUDA: Q stays in
registers, Hadamard butterfly + quantize error applied in-place.

* fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures)

E8M0 = 255 means NaN per MX spec, but our encode path already clamps
to 254. When test data contains random E8M0 = 255 bytes, the decode
produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to
fail on MXFP6 (and potentially MXFP4/8).

Fix: clamp 255 → 254 in both E8M0 decode functions:
  - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h)
  - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h)

These are unfortunately duplicated across two headers because
ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves
CPU-only callers that don't include ggml-common.h.
@ggerganov
Copy link
Copy Markdown
Member

Just a heads up that this PR is too large and since it is heavily AI generated by a new contributor to the project, it will not be accepted in similar form. If you want to have any of this contributed, you would need to split it into smaller parts and use your own words to discuss and demonstrate that you really understand what you are doing and that you will be able to maintain it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants