ggml-cpu: use LUT for converting e8->f32 scales on x86 by am17an · Pull Request #19288 · ggml-org/llama.cpp

am17an · 2026-02-03T10:03:00Z

perf showed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86

Model	Test	t/s topk-cuda-refactor	t/s mxfp4-cpu-scale	Speedup
gpt-oss 20B MXFP4 MoE	pp1024	237.74	257.53	1.08
gpt-oss 20B MXFP4 MoE	pp2048	228.16	246.40	1.08
gpt-oss 20B MXFP4 MoE	pp4096	211.92	227.59	1.07
gpt-oss 20B MXFP4 MoE	pp8192	185.53	197.05	1.06

ggerganov · 2026-02-03T11:22:23Z

On M2 Ultra it does not make a difference:

Model	Test	t/s master	t/s pr/19288	Speedup
gpt-oss 20B MXFP4 MoE	pp512	182.14	182.79	1.00
gpt-oss 20B MXFP4 MoE	pp1024	178.52	178.44	1.00
gpt-oss 20B MXFP4 MoE	pp2048	170.37	170.38	1.00

ggerganov · 2026-02-03T11:24:39Z

ggml/src/ggml-cpu/arch/x86/quants.c

+    return _mm256_set_m128(_mm_set1_ps(ggml_table_f32_e8m0_half[x1] * GGML_CPU_FP16_TO_FP32(y1)),
+                           _mm_set1_ps(ggml_table_f32_e8m0_half[x0] * GGML_CPU_FP16_TO_FP32(y0)));


Instead of using the table explicitly, define 2 macros:

#define GGML_COMPUTE_E8M0_TO_FP32_HALF(...) ... #define GGML_E8M0_TO_FP32_HALF(x) ggml_table_f32_e8m0_half[(x)]

This way the existing code keeps using GGML_E8M0_TO_FP32_HALF which redirects to the table lookup.

I didn't do this because other parts of the code use this function also, not sure what the performance will be for them

I see. You can still use the same pattern as we do here:

llama.cpp/ggml/src/ggml-cpu/simd-mappings.h

Lines 115 to 134 in e9a859d

// precomputed f32 table for f16 (256 KB)

// defined in ggml-cpu.c, initialized in ggml_cpu_init()

extern float ggml_table_f32_f16[1 << 16];

// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,

// so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.

// This is also true for POWER9.

#if !defined(GGML_CPU_FP16_TO_FP32)

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {

uint16_t s;

memcpy(&s, &f, sizeof(uint16_t));

return ggml_table_f32_f16[s];

}

#define GGML_CPU_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)

#endif

#if !defined(GGML_CPU_FP32_TO_FP16)

#define GGML_CPU_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)

#endif

And forward the GGML_E8M0_TO_FP32_HALF to lookup only if __AVX__ is defined for example.

am17an · 2026-02-03T11:26:36Z

On M2 Ultra it does not make a difference:

I only made the change for x86, M2 ultra will be ARM?

am17an · 2026-02-03T12:49:22Z

@ggerganov I have a M4 macbook air. When I run llama-bench on vanilla settings I get a Metal,BLAS backend. Is there a way to force ARM? Or maybe I am not understanding something w.r.t unified memory.
Also one more question: is it there a way to skip the depth computation when doing llama-bench. I find it takes a long time for the context to fill up, I was wondering if there is a way to fill the kv-cache randomly with a specified depth

ggerganov · 2026-02-03T12:54:41Z

To disable GPU and AMX acceleration, build with:

cmake -DGGML_METAL=OFF -DGGML_BLAS=OFF ...

For the compare-commit.sh script, use something like:

CMAKE_OPTS="-DGGML_METAL=OFF -DGGML_BLAS=OFF" scripts/compare-commits.sh master pr/19288 llama-bench -m ~/models/gpt-oss-20b/ggml-model-mxfp4.gguf -fa 1 -ub 2048 -p 0 -n 0 -r 3 -mmp 1 -t 16 -n 32,32,32

Also one more question: is it there a way to skip the depth computation when doing llama-bench.

The -d argument is the best we have atm. It computes the context up to d once and then can run multiple -p and -n with that pre-computed context. It still needs to be computed one time.

Filling with random data is not going to work because it will not activate the correct experts for MoE models.

* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro

ggml-cpu: use LUT for converting e8->f32 scales on x86

635c8df

am17an requested a review from ggerganov as a code owner February 3, 2026 10:03

ggerganov reviewed Feb 3, 2026

View reviewed changes

loci-dev mentioned this pull request Feb 3, 2026

UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86 auroralabs-loci/llama.cpp#1152

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026

add dispatch based on macro

e0a6335

ggerganov approved these changes Feb 3, 2026

View reviewed changes

am17an merged commit 2ceda3f into ggml-org:master Feb 4, 2026
76 of 78 checks passed

am17an deleted the mxfp4-cpu-scale branch February 4, 2026 01:43

agent-enemy-2 pushed a commit to agent-enemy-2/llama.cpp that referenced this pull request Feb 4, 2026

ggml-cpu: use LUT for converting e8->f32 scales on x86 (ggml-org#19288)

8ae1ad3

* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026

ggml-cpu: use LUT for converting e8->f32 scales on x86 (ggml-org#19288)

cf25c00

* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: use LUT for converting e8->f32 scales on x86#19288

ggml-cpu: use LUT for converting e8->f32 scales on x86#19288
am17an merged 2 commits intoggml-org:masterfrom
am17an:mxfp4-cpu-scale

am17an commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

ggerganov Feb 3, 2026

Uh oh!

am17an Feb 3, 2026

Uh oh!

ggerganov Feb 3, 2026

Uh oh!

am17an commented Feb 3, 2026

Uh oh!

am17an commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return _mm256_set_m128(_mm_set1_ps(ggml_table_f32_e8m0_half[x1] * GGML_CPU_FP16_TO_FP32(y1)),
		_mm_set1_ps(ggml_table_f32_e8m0_half[x0] * GGML_CPU_FP16_TO_FP32(y0)));

	// precomputed f32 table for f16 (256 KB)
	// defined in ggml-cpu.c, initialized in ggml_cpu_init()
	extern float ggml_table_f32_f16[1 << 16];

	// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
	// so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
	// This is also true for POWER9.
	#if !defined(GGML_CPU_FP16_TO_FP32)
	inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
	uint16_t s;
	memcpy(&s, &f, sizeof(uint16_t));
	return ggml_table_f32_f16[s];
	}

	#define GGML_CPU_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
	#endif

	#if !defined(GGML_CPU_FP32_TO_FP16)
	#define GGML_CPU_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
	#endif

Conversation

am17an commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

ggerganov Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented Feb 3, 2026

Uh oh!

am17an commented Feb 3, 2026

Uh oh!

ggerganov commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants