ggml: MXFP flash attention with SoA layout (CPU scalar reference) by timothyeburke · Pull Request #20609 · ggml-org/llama.cpp

timothyeburke · 2026-03-15T21:33:51Z

Hi llama.cpp maintainers and community!

I'm Tim, the new enterprise AI enablement engineer at The New York Times (not a journalist). A few weeks ago a blizzard snowed out a show I had tickets to (at least I got this video) and to pass the time I decided to do a deep dive into teaching myself how Flash Attention works, so I could teach others.

I decided to try implementing MXFP4 flash attention for my CUDA GPUs (a pair of 5070 Ti cards), and to improve on perplexity scoring I added a few research methods on recovering loss in low-bit floating point model weight quantization, and also realized that adding most of the plumbing for MXFP6 and MXFP8 were there too and gave us lots of knobs to play with.

I have CUDA (timlikesai#2) and Metal implementations ready as fast follows for this PR.

MXFP Flash Attention — CPU Scalar Reference + SIMD

Add MXFP KV cache quantization for flash attention with Struct-of-Arrays (SoA) memory layout. Three MX format families: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), per the OCP Microscaling v1.0 spec.

SoA Layout

[qs contiguous][e8m0 contiguous] per row, enabling aligned memory access for all backends. The full flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — uses SoA end-to-end. Existing AoS block layout for MUL_MAT weight quantization is untouched.

Shared Constructs (ggml-common.h)

mxfp_dequant_traits_t — single source of truth for IEEE-754 bit reconstruction parameters across all backends (CPU, CUDA, Metal, Vulkan)
Static const trait instances for all 4 formats: E4M3, E5M2, E2M3, E3M2
Portable element converters, FP6 pack/unpack, E8M0 scale computation, Walsh-Hadamard rotation

CPU Implementation

Scalar reference in ops.cpp: per-head SoA dequant, hoisted loop invariants, stack-allocated buffers
x86 AVX2 and ARM NEON SIMD: shared dequant helpers and FP6 unpack, traits-driven dispatch
MXFP4 uses LUT-based dequant (shuffle); MXFP6/MXFP8 use IEEE bit reconstruction via shared helpers

Hadamard Rotation

Block-32 Walsh-Hadamard rotation on Q and K distributes outlier energy across the shared E8M0 exponent group. Essential for perplexity — without it, MXFP6 E2M3 degrades by +3.34 PPL, MXFP8 E4M3 by +0.22. Skipped for MLA models where V is a view of K.

Naming

GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends for consistency with the MX type family naming (similarly MXFP8_E4M3, MXFP6_E2M3).

KV Cache Perplexity Results

Date: 2026-03-21 | Backend: CPU (scalar) | Chunks: 16 | ctx: 512

Memory columns show GiB per 100K tokens of KV cache (non-SWA + SWA combined).

gpt-oss-20b

Model: ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf

`master`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
q8_0	362.0037	-12.7872	4.86	4.86	9.72	47%
q8_0+q4_0	363.4146	-11.3763	4.86	2.57	7.43	60%
f16	374.7909	—	9.15	9.15	18.30	—
q4_0	817.0624	+442.2715	2.57	2.57	5.14	72%

`mxfp-flash-attention`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
mxfp6+mxfp4	331.7924	-42.9985	3.57	2.43	6.00	68%
mxfp8+mxfp4	343.5957	-31.1952	4.72	2.43	7.15	61%
mxfp6	349.9909	-24.8000	3.57	3.57	7.14	62%
q8_0	362.0037	-12.7872	4.86	4.86	9.72	47%
q8_0+q4_0	363.4146	-11.3763	4.86	2.57	7.43	60%
mxfp4	370.1863	-4.6046	2.43	2.43	4.86	74%
mxfp8	373.4227	-1.3682	4.72	4.72	9.44	49%
f16	374.7909	—	9.15	9.15	18.30	—
q4_0	817.0624	+442.2715	2.57	2.57	5.14	72%

Comparison

Config	`master`	`mxfp-flash-attention`
f16	374.7909	374.7909
q8_0	362.0037	362.0037
q8_0+q4_0	363.4146	363.4146
q4_0	817.0624	817.0624

qwen3-coder

Model: spectralyst/Qwen3-Coder-30B-A3B-Instruct-MXFP4_MOE-GGUF/Qwen3-Coder-30B-A3B-Instruct-MXFP4_MOE.gguf

`master`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
q8_0	11.1139	-.0704	9.72	9.72	19.44	47%
q8_0+q4_0	11.1215	-.0628	9.72	5.14	14.86	60%
f16	11.1843	—	18.31	18.31	36.62	—
q4_0	11.3810	+.1967	5.14	5.14	10.28	72%

`mxfp-flash-attention`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
q8_0	11.1139	-.0704	9.72	9.72	19.44	47%
q8_0+q4_0	11.1215	-.0628	9.72	5.14	14.86	60%
f16	11.1843	—	18.31	18.31	36.62	—
mxfp8	11.2512	+.0669	9.44	9.44	18.88	49%
mxfp6	11.2574	+.0731	7.15	7.15	14.30	61%
mxfp6+mxfp4	11.2794	+.0951	7.15	4.86	12.01	68%
mxfp8+mxfp4	11.2938	+.1095	9.44	4.86	14.30	61%
q4_0	11.3810	+.1967	5.14	5.14	10.28	72%
mxfp4	11.9138	+.7295	4.86	4.86	9.72	74%

Comparison

Config	`master`	`mxfp-flash-attention`
f16	11.1843	11.1843
q8_0	11.1139	11.1139
q8_0+q4_0	11.1215	11.1215
q4_0	11.3810	11.3810

gemma-3n-e4b

Model: lmstudio-community/gemma-3n-E4B-it-text-GGUF/gemma-3n-E4B-it-Q8_0.gguf

`master`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
q4_0	38.8477	-.0320	2.14	2.14	4.28	72%
q8_0	38.8676	-.0121	4.05	4.05	8.10	47%
f16	38.8797	—	7.62	7.62	15.24	—
q8_0+q4_0	39.1937	+.3140	4.05	2.14	6.19	60%

`mxfp-flash-attention`

Config	PPL	Δ F16	K GiB/100K	V GiB/100K	Total GiB/100K	Savings
mxfp4	35.8952	-2.9845	2.02	2.02	4.04	74%
mxfp6+mxfp4	37.4399	-1.4398	2.97	2.02	4.99	68%
mxfp8+mxfp4	37.9486	-.9311	3.93	2.02	5.95	62%
q4_0	38.8477	-.0320	2.14	2.14	4.28	72%
mxfp8	38.8489	-.0308	3.93	3.93	7.86	49%
q8_0	38.8676	-.0121	4.05	4.05	8.10	47%
f16	38.8797	—	7.62	7.62	15.24	—
mxfp6	38.8864	+.0067	2.97	2.97	5.94	62%
q8_0+q4_0	39.1937	+.3140	4.05	2.14	6.19	60%

Comparison

Config	`master`	`mxfp-flash-attention`
f16	38.8797	38.8797
q8_0	38.8676	38.8676
q8_0+q4_0	39.1937	39.1937
q4_0	38.8477	38.8477

Test logs: https://gist.github.com/timothyeburke/f176199437c9c886b2fdd8205b357451

Add MXFP KV cache quantization for flash attention using Struct-of-Arrays (SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec. SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling aligned memory access patterns for GPU backends. All functions in the flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — use SoA end-to-end. The existing AoS block layout remains for MUL_MAT weight quantization (untouched). Q preprocessing applies Walsh-Hadamard rotation (block-32) before quantize/dequant round-trip, distributing outlier energy across the shared exponent group. This is essential for perplexity: MXFP8: +0.22 PPL without rotation MXFP6: +3.34 PPL without rotation Hadamard is skipped for MLA models (DK != DV) where V is a view of K. Shared infrastructure in ggml-common.h: - Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements) - E8M0 MSE-optimal scale search with ±1 range - Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2) - FP6 tight packing (4 six-bit values in 3 bytes, 25% savings) - IEEE-754 bit reconstruction constants for SIMD backends - SoA layout macros, portable bit cast, type property queries CPU implementation: - Scalar reference + ARM NEON + x86 AVX2 optimized paths - Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM) - Split-KV path extended for single-query decode - Generic vec_dot via dequant-to-float for MUL_MAT compatibility - Arch fallbacks for loongarch, powerpc, riscv, s390, wasm KV cache integration: - set_rows writes SoA with optional Hadamard (op_params[0] flag) - K cache block-aligned to 16 for CUDA cp.async compatibility - CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8) Tests: - Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4) - SET_ROWS: Hadamard rotation for all types - SoA-aware test initialization and comparison for MXFP tensors - Quantize functions coverage for all types Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends (CPU, OpenCL, SYCL) for consistency with the MX type family naming.

Copilot

Pull request overview

Adds MXFP (OCP Microscaling v1.0) KV-cache quantization support for flash attention using an end-to-end Struct-of-Arrays (SoA) layout, plus CPU reference/SIMD paths and CLI/test coverage updates.

Changes:

Introduces new GGML MXFP types (MXFP4_E2M1, MXFP8_E4M3, MXFP6_E2M3) and common infrastructure (converters, SoA helpers, Hadamard rotation, MSE scale search).
Extends CPU flash-attention pipelines (one_chunk + tiled) and KV-cache write path to support MXFP SoA with optional Hadamard rotation.
Updates tools/CLI/tests and several backends to recognize renamed/added MXFP types.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tools/llama-bench/llama-bench.cpp	Adds bench CLI type-name parsing for MXFP aliases/full names.
tests/test-backend-ops.cpp	Adds MXFP SoA init/dequant for tests; expands flash-attn and set_rows tests; updates renamed type usage.
src/llama-quant.cpp	Switches MoE default quant type to renamed MXFP4_E2M1.
src/llama-kv-cache.cpp	Pads MXFP K-cache allocation to 16-block alignment; uses padded strides in K views; sets Hadamard flag on MXFP K writes.
ggml/src/ggml.c	Registers new GGML types/traits; adds ggml_is_type_mxfp + Hadamard/type-property helpers; extends quantize dispatch.
ggml/src/ggml-sycl/mmvq.cpp	Updates SYCL mul-mat-vec switch to renamed MXFP4_E2M1.
ggml/src/ggml-sycl/convert.cpp	Updates SYCL conversion dispatch to renamed MXFP4_E2M1.
ggml/src/ggml-quants.h	Declares new MXFP quant/dequant APIs, SoA functions, element converters, packing, Hadamard.
ggml/src/ggml-quants.c	Implements MXFP scale search, element converters wrappers, FP6 packing, SoA quant/dequant, Hadamard wrapper, adds MXFP6/8 quantizers.
ggml/src/ggml-opencl/ggml-opencl.cpp	Updates OpenCL type checks/paths to renamed MXFP4_E2M1.
ggml/src/ggml-impl.h	Simplifies/clarifies E8M0-to-float helpers (kept local to impl header).
ggml/src/ggml-cpu/repack.cpp	Updates MXFP4 repack assertions and selection to renamed MXFP4_E2M1.
ggml/src/ggml-cpu/quants.h	Declares MXFP6/8 quantize + CPU dequant + SoA dequant entry points and generic fallbacks.
ggml/src/ggml-cpu/quants.c	Adds generic vec_dot for MXFP8/MXFP6 and generic CPU dequant/SoA dequant wrappers.
ggml/src/ggml-cpu/ops.cpp	Adds MXFP SoA quantize in set_rows (with optional Hadamard); adds MXFP support in flash-attn one_chunk/tiled + split-KV changes.
ggml/src/ggml-cpu/ggml-cpu.c	Registers CPU type traits for new MXFP8/MXFP6 and renamed MXFP4.
ggml/src/ggml-cpu/arch/x86/quants.c	Adds AVX2 MXFP8/MXFP6 vec_dot and AoS/SoA dequant kernels.
ggml/src/ggml-cpu/arch/wasm/quants.c	Adds wasm fallbacks for MXFP vec_dot entry points.
ggml/src/ggml-cpu/arch/s390/quants.c	Adds s390 fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/riscv/quants.c	Adds RISC-V fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/powerpc/quants.c	Adds PowerPC fallbacks for MXFP8/MXFP6 vec_dot entry points.
ggml/src/ggml-cpu/arch/loongarch/quants.c	Adds LoongArch fallbacks for MXFP vec_dot entry points.
ggml/src/ggml-cpu/arch/arm/quants.c	Adds NEON MXFP8/MXFP6 vec_dot and AoS/SoA dequant kernels.
ggml/src/ggml-cpu/arch-fallback.h	Adds MXFP generic aliasing and attempts to route dequant on non-arm/x86.
ggml/src/ggml-common.h	Adds MXFP constants, block structs, SoA offsets, LUTs, and canonical converters/Hadamard/E8M0 helpers.
ggml/include/ggml.h	Adds new public ggml_type/ftype enums and MXFP helper APIs.
ggml/include/ggml-cpu.h	Extends CPU type-traits struct with optional SIMD to_float hook.
common/arg.cpp	Adds MXFP cache types + short aliases to KV-cache CLI parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ggml/src/ggml-cpu/ops.cpp

ggml/src/ggml-cpu/arch-fallback.h

src/llama-kv-cache.cpp

tests/test-backend-ops.cpp

…flash-attention

…lize MXFP6 naming Fix potential buffer overflows flagged in PR ggml-org#20609 review: - set_rows: replace fixed float tmp[1024] with std::vector for large n_embd_k_gqa - tiled FA: size q_mxfp_buf with ggml_row_size guard instead of fixed 1024 - one_chunk FA: pre-allocate k/v dequant buffers from mxfp.{k,v}_soa_elems instead of hard-coded float[4096] stack arrays - kv-cache: assert n_embd_k_gqa % qk == 0 before integer division - test init: assert soa_bytes % block_size == 0 Normalize MXFP6 function naming to match MXFP8 convention (short form without element format suffix): mxfp6_e2m3 → mxfp6 in all function identifiers across 14 files. Format-specific items (type enums, traits, lookup tables, constants) retain their _e2m3 suffix.

- Per-head dequant: multihead MXFP now extracts only the needed head's SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and dequants DK elements, instead of dequanting all heads (nek2*DK). For 8 KV heads this is 8x less dequant work per KV position. - Hoist loop invariants: base pointer offsets (k_base, v_base), per-head SoA byte offsets, and multihead row bases are computed once per query row instead of per KV position in the inner loop. - Precompute SoA addressing in mxfp_fa_params_init: qs_per_block, blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated once at init rather than derived per iteration. - Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and v_is_f16 check outside the ir loop.

- Increase q_mxfp_buf from 512 to 2048 bytes (supports DK up to 1024 with MXFP8) - Replace fixed k_soa[4096]/v_soa[4096] stack arrays with dynamically sized vectors - Replace fixed k_head_soa[320]/v_head_soa[320] with dynamically sized vectors - Add soa_bytes divisibility assertion in test init

… helpers Add mxfp_dequant_traits_t to ggml-common.h as single source of truth for MXFP IEEE-754 reconstruction parameters. Define static const instances for all 4 formats (E4M3, E5M2, E2M3, E3M2), ready for CUDA/Metal/Vulkan reuse. Extract shared dequant and FP6 unpack helpers on both architectures, replacing duplicated inline code and macros. Net -215 lines.

…d dequant Replace per-row/per-tile std::vector heap allocations with stack buffers in set_rows, one_chunk, and tiled flash attention paths. Fix tiled path to use per-head SoA extraction (matching one_chunk) instead of dequanting the full multihead region per token.

narinishi · 2026-03-20T21:11:13Z

Could you clarify the testing methodology specifically for gpt-oss-20b-MXFP4? The results seem anomalous.

… semantics

timothyeburke · 2026-03-21T17:20:09Z

@narinishi I updated to latest master and fixed a mistake I made in the tiled GEMM paths. The full results log is available at https://gist.github.com/timothyeburke/f176199437c9c886b2fdd8205b357451, and I've updated the results above showing the comparison with master which now match.

* cleanup : hoist mxfp soa functions * fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert Three fixes for CI failures: 1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include causes NAN/INFINITY to become non-constexpr, breaking __device__ static table initialization for the MXFP LUTs. 2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS) but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS). 3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a conditional fallback — when ne2 is not divisible by heads_per_region, fall back to per-head SoA init instead of crashing. * fix : correct guard for mxfp cpu dequant functions * fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout - Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY - Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides, matching the KV cache layout for permuted flash attention tests * fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables CUDA/HIP/MUSA __device__ static tables require constexpr initializers. Standard NAN/INFINITY macros may expand to non-constexpr expressions (e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr for static init). Previous fix attempted __uint_as_float for nvcc and __builtin_bit_cast for clang — neither worked universally. Use __builtin_nanf("") and __builtin_inff() which are constexpr on all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define once before the platform #if chain instead of per-platform copies. * fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal precision, causing bitwise mismatches against the IEEE-754 element converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e precision for exact float round-trip on all 256 entries. Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single definition using __builtin_nanf/__builtin_inff (constexpr on all target compilers), and adds LUT validation tests to test-quantize-fns that verify all 5 MXFP element converters match their canonical LUT values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256, FP8 E5M2: 256 — 656 total values verified). * fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc MSVC does not support __builtin_nanf/__builtin_inff. Use standard NAN/INFINITY macros on MSVC (which work for regular static tables), and compiler builtins only on GCC/Clang/nvcc (needed for CUDA __device__ table constexpr initialization). * fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__ are defined. The previous fix checked _MSC_VER first, giving nvcc the MSVC NAN/INFINITY macros which are not constexpr for __device__ tables. Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff. * cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA) MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention, which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions (NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added and are dead code — no model stores weights in these formats. Removed: - AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch - AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch - AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic - AoS fallback defines in arch-fallback.h - CPU traits .to_float entries for MXFP6/MXFP8 - MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests) Kept (correct SoA code): - All *_soa_* functions (NEON, AVX2, generic, dispatch) - CPU traits .from_float_soa / .to_float_soa - Flash attention and SET_ROWS Hadamard test cases - Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip) - MXFP4 AoS code (upstream model weight support, untouched) Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code that had a NEON bug. The test no longer runs because the type is correctly excluded from AoS test paths. * test: guard all MXFP types must have SoA traits for flash attention All MXFP flash attention uses SoA layout exclusively. Test validates: - ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa - MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8, or removes SoA traits from any MXFP type, CI will catch it. * test: add Hadamard, SoA cross-check, E8M0, and layout offset tests * test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests Add comprehensive tests to catch the bugs backend implementers hit most: - Element converter edge cases: subnormals, max finite, saturation, NaN, sign - FP6 pack/unpack exhaustive round-trip with known-answer byte verification - E8M0 known-answer decode + HALF vs FULL scale distinction - E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs) - Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats - Consolidate duplicate SoA switches into single table in test-backend-ops * test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse - MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match - Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types - mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants - Block size consistency: verify QK_MXFP{4,8,6} == 32 - EMAX_OFFSET vs format max: validate constants produce valid E8M0 - Edge case LUT validation: expected_bits verified against canonical LUTs - FP4 E2M1 exhaustive converter round-trip (16/16) * cleanup: tighten MXFP test comments to match repo conventions * fix: platform-specific NaN/Infinity for GPU device table initializers FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses these LUTs (they use converter functions instead), so guard them out of GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.

* cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms - Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching existing GGML_TYPE_MXFP4 pattern) and use short names consistently throughout the codebase instead of mixing long/short forms. - Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu) on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases to each arch section in arch-fallback.h. Previously these were only defined under GGML_CPU_GENERIC, causing linker failures on those platforms when using MXFP flash attention. - Remove 10 files from the PR diff: - 5 arch stub files replaced by arch-fallback.h aliases - 5 rename-only files (sycl, opencl, repack, llama-quant) reverted since the GGML_TYPE_MXFP4 compat alias handles them * cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper - FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4() from ggml-common.h instead of duplicating the scalar bit manipulation. - Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric K/V fields (dequantize, multihead, soa_elems, qs_per_block, head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable struct accessed as mxfp.k and mxfp.v. - Add mxfp_dequant_head() helper: replaces 4 instances of the multihead SoA extraction pattern (2x memcpy + dequant, with multihead/single-head branching) with a single function call. Future backends get the pattern for free. * cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks The K and V initialization in mxfp_fa_params_init were structurally identical 10-line blocks differing only by tensor/dimension. Extract into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats get the multihead SoA addressing logic automatically. * cleanup: generic MSE round-trip, replace magic buffer sizes with constants - Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical round-trip functions differing only by converter. mxfp_compute_e8m0_mse now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6). MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free by just setting to_elem/to_float in their traits. - Replace magic 1024/1088 buffer sizes in flash attention with named constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if max head dimension grows. * cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were dead code — no matmul path uses them. Removed: - ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h - ggml_vec_dot_mxfp_q8_0_impl shared helper - arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines) - vec_dot/vec_dot_type registration in ggml-cpu.c Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl) that use traits->bits_per_elem and traits->qs_per_block to handle both byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats get SoA for free by setting these trait fields. * cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only MXFP8 and MXFP6 are KV-cache-only types. All quantization and dequantization goes through the SoA (Struct-of-Arrays) path for flash attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations were dead code that should never have been added. Removed: - quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl - quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6} - quantize_mxfp{8,6} (ggml_quantize_chunk wrappers) - All declarations from ggml-quants.h and quants.h - to_float/from_float_ref registrations from ggml.c type traits - from_float registration from ggml-cpu.c CPU traits Block struct definitions (block_mxfp8, block_mxfp6) are retained for sizeof() in type traits and validate_row_data. * cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk — these are KV-cache-only types that use SoA layout via from_float_soa. Attempting AoS quantization through this entry point is a bug.

…t coverage * fix: correct tiled flash attention SoA pointer math for multihead MXFP The cleanup refactoring (c919bc4) extracted mxfp_dequant_head as a shared helper but failed to update the tiled path's data pointers. The helper expects the full SoA row base (no per-head offset), but the tiled path was passing a pointer that already included ik2*nbk2, causing a double head offset that produced NaN during prefill. Add mxfp_row_ptr helper to centralize the multihead-aware pointer calculation across both one_chunk and tiled paths. Verified with 16-chunk perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8) produce exact matches with the known-good commit (23e8863). * perf: reduce E8M0 MSE search range from ±2 to ±1 The base estimate round(log2(amax)) is always within 1 step of optimal. Empirically verified across 30K blocks and 6 distributions: ±1 and ±2 never disagree. This reduces the scale search from 5 to 3 candidates (40% fewer inner loop iterations) with zero quality impact. * perf: eliminate redundant work in MXFP quantize and flash attention - mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d - mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop - tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer * cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers - EMAX_OFFSET comments: fix ceil/floor labels to match actual values - Hadamard flag: unify write path (llama-kv-cache.cpp) and read path (ops.cpp) to both use DK==DV condition instead of is_mla() - E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style, add cross-reference comment * fix: MXFP8/6 flash attention tests crash on init The view base tensors for K/V don't get named "k"/"v" but inherit the MXFP type. The name-based filter in initialize_tensors missed them, falling through to init_tensor_uniform which calls quantize_chunk and aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for all tensors, matching the pattern set_rows tests already use. * test: expand MXFP set_rows coverage - Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage - Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs - Coverage: 18 → 51 MXFP set_rows tests * perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path) * cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4) and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths. Eliminates duplicated per-block logic while keeping layout-specific pointer arithmetic in the callers. * feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support) Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers from the SoA generic impl and use them to build AoS row functions for MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT. MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used as general quantization types. The SoA impl is also DRY'd to delegate to the same per-block helpers. * cleanup: remove dead soa_elems field from mxfp_kv_params Computed but never read — leftover from an earlier design. * feat: add MXFP8/MXFP6 vec_dot and full CPU type support Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations, register from_float + vec_dot + vec_dot_type in CPU traits, and add fallback remaps for all architectures. MXFP8/6 are now fully tested: AoS quantization error, reference match, and dot product accuracy all pass in test-quantize-fns. * perf: remove E8M0 MSE search — base estimate is perplexity-optimal The MSE search over ±1 candidates around round(log2(amax)) was found to HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b. The base estimate alone (no search) produces better attention patterns because minimizing per-block reconstruction error is not the same as minimizing attention score distortion through softmax. Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant, and the entire search loop. E8M0 computation is now a single amax scan + integer bit extraction — no inner loop, no function pointers. This also simplifies future GPU/Metal implementations. * perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer) Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and quantize block-by-block with a 32-float stack buffer. Eliminates the std::vector heap allocation and 2 extra memory passes over the full row. set_rows now dispatches to the fused path when Hadamard is enabled, falling through to the unfused quantize for non-Hadamard types. This pattern maps directly to a CUDA kernel: global memory read → register Hadamard → register quantize → global memory write. * cleanup: consistent MXFP type names and variable naming - Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6" to match "mxfp4". Only one variant of each exists — the suffix was unnecessary disambiguation that implied alternatives. - Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles all types via ggml_type_name matching). - Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K). * perf: fuse Q preprocessing round-trip (no SoA buffer needed) Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions that apply quantization error to float values without materializing SoA bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA buffer → dequant from SoA buffer) with a single pass through per-block round-trip helpers. Eliminates the Q_q intermediate buffer and two function pointer calls from the flash attention hot path. Maps directly to CUDA: Q stays in registers, Hadamard butterfly + quantize error applied in-place. * fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures) E8M0 = 255 means NaN per MX spec, but our encode path already clamps to 254. When test data contains random E8M0 = 255 bytes, the decode produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to fail on MXFP6 (and potentially MXFP4/8). Fix: clamp 255 → 254 in both E8M0 decode functions: - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h) - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h) These are unfortunately duplicated across two headers because ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves CPU-only callers that don't include ggml-common.h.

ggerganov · 2026-03-23T09:12:38Z

Just a heads up that this PR is too large and since it is heavily AI generated by a new contributor to the project, it will not be accepted in similar form. If you want to have any of this contributed, you would need to split it into smaller parts and use your own words to discuss and demonstrate that you really understand what you are doing and that you will be able to maintain it in the future.

timothyeburke requested review from a team and ggerganov as code owners March 15, 2026 21:33

Copilot AI review requested due to automatic review settings March 15, 2026 21:33

Copilot AI reviewed Mar 15, 2026

View reviewed changes

Merge remote-tracking branch 'origin/mxfp-flash-attention' into mxfp-…

5c3a952

…flash-attention

github-actions bot added testing Everything test related examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language OpenCL Issues specific to the OpenCL backend labels Mar 15, 2026

timothyeburke marked this pull request as draft March 15, 2026 22:48

Copilot started reviewing on behalf of timothyeburke March 15, 2026 23:09 View session

timothyeburke added 5 commits March 15, 2026 19:49

Comment consistencty pass and cleanup.

f603c03

timothyeburke added 2 commits March 20, 2026 18:49

Merge branch 'master' into mxfp-flash-attention

358bd71

fix: gate tiled GEMM and split-KV paths to preserve q8_0/q4_0 vec_dot…

23e8863

… semantics

timothyeburke added 3 commits March 21, 2026 13:52

Comment consistency pass and cleanup.

5bb05ed

mxfp traits : ensure mxfp soa quant and dequant functions are tested

dd263ff

github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Mar 22, 2026

Merge remote-tracking branch 'origin/master' into mxfp-flash-attention

0e3304f

timothyeburke added 2 commits March 22, 2026 02:44

TheTom mentioned this pull request Mar 25, 2026

Upstream: TurboQuant discussion + contribution requirements for llama.cpp TheTom/turboquant_plus#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: MXFP flash attention with SoA layout (CPU scalar reference)#20609

ggml: MXFP flash attention with SoA layout (CPU scalar reference)#20609
timothyeburke wants to merge 16 commits intoggml-org:masterfrom
timlikesai:mxfp-flash-attention

timothyeburke commented Mar 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

narinishi commented Mar 20, 2026

Uh oh!

timothyeburke commented Mar 21, 2026

Uh oh!

ggerganov commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

timothyeburke commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MXFP Flash Attention — CPU Scalar Reference + SIMD

SoA Layout

Shared Constructs (ggml-common.h)

CPU Implementation

Hadamard Rotation

Naming

KV Cache Perplexity Results

gpt-oss-20b

master

mxfp-flash-attention

Comparison

qwen3-coder

master

mxfp-flash-attention

Comparison

gemma-3n-e4b

master

mxfp-flash-attention

Comparison

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

narinishi commented Mar 20, 2026

Uh oh!

timothyeburke commented Mar 21, 2026

Uh oh!

ggerganov commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

timothyeburke commented Mar 15, 2026 •

edited

Loading

`master`

`mxfp-flash-attention`

`master`

`mxfp-flash-attention`

`master`

`mxfp-flash-attention`