Skip to content

ggml-cpu: use LUT for converting e8->f32 scales on x86#19288

Merged
am17an merged 2 commits intoggml-org:masterfrom
am17an:mxfp4-cpu-scale
Feb 4, 2026
Merged

ggml-cpu: use LUT for converting e8->f32 scales on x86#19288
am17an merged 2 commits intoggml-org:masterfrom
am17an:mxfp4-cpu-scale

Conversation

@am17an
Copy link
Contributor

@am17an am17an commented Feb 3, 2026

perf showed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86

Model Test t/s topk-cuda-refactor t/s mxfp4-cpu-scale Speedup
gpt-oss 20B MXFP4 MoE pp1024 237.74 257.53 1.08
gpt-oss 20B MXFP4 MoE pp2048 228.16 246.40 1.08
gpt-oss 20B MXFP4 MoE pp4096 211.92 227.59 1.07
gpt-oss 20B MXFP4 MoE pp8192 185.53 197.05 1.06

@am17an am17an requested a review from ggerganov as a code owner February 3, 2026 10:03
@ggerganov
Copy link
Member

On M2 Ultra it does not make a difference:

Model Test t/s master t/s pr/19288 Speedup
gpt-oss 20B MXFP4 MoE pp512 182.14 182.79 1.00
gpt-oss 20B MXFP4 MoE pp1024 178.52 178.44 1.00
gpt-oss 20B MXFP4 MoE pp2048 170.37 170.38 1.00

Comment on lines +272 to +273
return _mm256_set_m128(_mm_set1_ps(ggml_table_f32_e8m0_half[x1] * GGML_CPU_FP16_TO_FP32(y1)),
_mm_set1_ps(ggml_table_f32_e8m0_half[x0] * GGML_CPU_FP16_TO_FP32(y0)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the table explicitly, define 2 macros:

#define GGML_COMPUTE_E8M0_TO_FP32_HALF(...) ...
#define GGML_E8M0_TO_FP32_HALF(x) ggml_table_f32_e8m0_half[(x)]

This way the existing code keeps using GGML_E8M0_TO_FP32_HALF which redirects to the table lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this because other parts of the code use this function also, not sure what the performance will be for them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You can still use the same pattern as we do here:

// precomputed f32 table for f16 (256 KB)
// defined in ggml-cpu.c, initialized in ggml_cpu_init()
extern float ggml_table_f32_f16[1 << 16];
// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
// so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
// This is also true for POWER9.
#if !defined(GGML_CPU_FP16_TO_FP32)
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
uint16_t s;
memcpy(&s, &f, sizeof(uint16_t));
return ggml_table_f32_f16[s];
}
#define GGML_CPU_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
#endif
#if !defined(GGML_CPU_FP32_TO_FP16)
#define GGML_CPU_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
#endif

And forward the GGML_E8M0_TO_FP32_HALF to lookup only if __AVX__ is defined for example.

@am17an
Copy link
Contributor Author

am17an commented Feb 3, 2026

On M2 Ultra it does not make a difference:

I only made the change for x86, M2 ultra will be ARM?

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026
@am17an
Copy link
Contributor Author

am17an commented Feb 3, 2026

@ggerganov I have a M4 macbook air. When I run llama-bench on vanilla settings I get a Metal,BLAS backend. Is there a way to force ARM? Or maybe I am not understanding something w.r.t unified memory.
Also one more question: is it there a way to skip the depth computation when doing llama-bench. I find it takes a long time for the context to fill up, I was wondering if there is a way to fill the kv-cache randomly with a specified depth

@ggerganov
Copy link
Member

To disable GPU and AMX acceleration, build with:

cmake -DGGML_METAL=OFF -DGGML_BLAS=OFF ...

For the compare-commit.sh script, use something like:

CMAKE_OPTS="-DGGML_METAL=OFF -DGGML_BLAS=OFF" scripts/compare-commits.sh master pr/19288 llama-bench -m ~/models/gpt-oss-20b/ggml-model-mxfp4.gguf -fa 1 -ub 2048 -p 0 -n 0 -r 3 -mmp 1 -t 16 -n 32,32,32

Also one more question: is it there a way to skip the depth computation when doing llama-bench.

The -d argument is the best we have atm. It computes the context up to d once and then can run multiple -p and -n with that pre-computed context. It still needs to be computed one time.

Filling with random data is not going to work because it will not activate the correct experts for MoE models.

@am17an am17an merged commit 2ceda3f into ggml-org:master Feb 4, 2026
76 of 78 checks passed
@am17an am17an deleted the mxfp4-cpu-scale branch February 4, 2026 01:43
agent-enemy-2 pushed a commit to agent-enemy-2/llama.cpp that referenced this pull request Feb 4, 2026
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants