fix: add POSIX functionality for Linux compilation#51
fix: add POSIX functionality for Linux compilation#51ggerganov merged 2 commits intoggml-org:masterfrom
Conversation
|
Can you add a short comment why this is required? (I assume some functions are not recognized but not obvious which ones) |
|
Sure! |
I meant into the source code just above the Btw the issue #54 mentions |
Which OS and which GCC version do you use? For me the master compiles just fine on Debian 11 with GCC 10.2.1 (even without the proposed define). |
|
CentOS7 with GCC 10.2.0 |
|
This flags has been discussed in |
|
Now thinking more about this, probably the cleanest option is to add compilation flags to the build system like suggested in ggml-org/whisper.cpp#37 + ggml-org/whisper.cpp#576 We can probably set this for all compilers on all OSes, because either a compiler understands the flag and sets the value to the supported level or the flag is ignored. @valentynbez can you confirm this fixed the issue for you on CentOS 7? |
* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 * NEON Flash Attention: quantized K*Q for q4_0 I could finally take advantage of the matrix multiplication templates. We get quite a bit of speedup that way for q4_0: For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step> results in PP-2048 = 287 t/s vs 268 t/s when converting the q4_0 k-cache and Q to fp16 and using fp16 multiplication. * NEON Flash Attention: quantized K*Q for q4_1 * NEON Flash Attention: quantized K*Q for q8_0 This makes quite a bit of difference: For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs 178 t/s when converting things to fp16 and using fp16 matrix multiplication. We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the performance of PP-512. In contrast, llama.cpp with Q8_0 cache is 38% of PP-512. * Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0 * Tidy up FlashMS * Delete no longer used stuff With the usage of quantized matrix multiplications for quantized k- and/or v-cache, we no longer need the helper methods loading entire rows. * Disallow mixing bf16 with other types for kv caches --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%) - Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K - Gemma-3 turbo3 results post-iSWA fix (+3.3%) - KVLinC no-K-rotation negative result - Speculative decoding negative result - CUDA 13.2 compatibility verified - Experiments TheTom#31, TheTom#39, TheTom#42, TheTom#45, ggml-org#49, ggml-org#50, ggml-org#51 status updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Small fix to compile binaries properly on Linux:
CLOCK_MONOTONICinggml.c