ggml-cpu: use LUT for converting e8->f32 scales on x86#19288
ggml-cpu: use LUT for converting e8->f32 scales on x86#19288am17an merged 2 commits intoggml-org:masterfrom
Conversation
|
On M2 Ultra it does not make a difference:
|
ggml/src/ggml-cpu/arch/x86/quants.c
Outdated
| return _mm256_set_m128(_mm_set1_ps(ggml_table_f32_e8m0_half[x1] * GGML_CPU_FP16_TO_FP32(y1)), | ||
| _mm_set1_ps(ggml_table_f32_e8m0_half[x0] * GGML_CPU_FP16_TO_FP32(y0))); |
There was a problem hiding this comment.
Instead of using the table explicitly, define 2 macros:
#define GGML_COMPUTE_E8M0_TO_FP32_HALF(...) ...
#define GGML_E8M0_TO_FP32_HALF(x) ggml_table_f32_e8m0_half[(x)]This way the existing code keeps using GGML_E8M0_TO_FP32_HALF which redirects to the table lookup.
There was a problem hiding this comment.
I didn't do this because other parts of the code use this function also, not sure what the performance will be for them
There was a problem hiding this comment.
I see. You can still use the same pattern as we do here:
llama.cpp/ggml/src/ggml-cpu/simd-mappings.h
Lines 115 to 134 in e9a859d
And forward the GGML_E8M0_TO_FP32_HALF to lookup only if __AVX__ is defined for example.
I only made the change for x86, M2 ultra will be ARM? |
|
@ggerganov I have a M4 macbook air. When I run llama-bench on vanilla settings I get a |
|
To disable GPU and AMX acceleration, build with: cmake -DGGML_METAL=OFF -DGGML_BLAS=OFF ...For the CMAKE_OPTS="-DGGML_METAL=OFF -DGGML_BLAS=OFF" scripts/compare-commits.sh master pr/19288 llama-bench -m ~/models/gpt-oss-20b/ggml-model-mxfp4.gguf -fa 1 -ub 2048 -p 0 -n 0 -r 3 -mmp 1 -t 16 -n 32,32,32
The Filling with random data is not going to work because it will not activate the correct experts for MoE models. |
* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro
* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro
perfshowed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86