forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 98
Porting TurboQuant to Windows (MSVC): Compatibility fixes and Python IDs #39
Copy link
Copy link
Open
Description
Hi Tom!
First of all, thank you for the amazing work on TurboQuant. It's a massive game-changer! I managed to get it compiled and running on Windows with a mobile RTX 4070 (8GB), getting an incredible 17 t/s on Nemotron 30B (MoE).
However, the current feature/turboquant-kv-cache branch doesn't compile out of the box on Windows using the Microsoft C++ compiler (MSVC). I had to apply a few manual patches to get a successful build.
I'm sharing these fixes here in hopes they can be integrated to make Windows builds seamless:
1. Missing M_PI math constant
In ggml-turbo-quant.c, MSVC requires defining math constants before includes:
#define _USE_MATH_DEFINES
#include <math.h>
2. Missing variable in ops.cpp
The build failed due to missing turbo3_cpu_wht_group_size. I had to add this line:
C++
int turbo3_cpu_wht_group_size = 1;
3. MSVC Linker/Scope errors with g_innerq_scale_inv_host
MSVC throws "undeclared identifier" and linker errors because of how extern is handled across the files. I had to clean up the existing extern declarations of g_innerq_scale_inv_host in the headers/cuda files and explicitly declare it at the very top of llama-kv-cache.cpp:
C++
/* MSVC LINKER BYPASS */
float * g_innerq_scale_inv_host = nullptr;
bool turbo_innerq_needs_tensor_update(void) { return false; }
void turbo_innerq_mark_tensor_updated(void) {}
4. Flash Attention incompatibility
Currently, compiling with -DGGML_FLASH_ATTN=ON crashes the MSVC compiler on Windows, so I highly recommend Windows users to build with -DGGML_FLASH_ATTN=OFF for now.
Python Bindings Note:
For anyone using llama-cpp-python, I discovered that the magic IDs to trigger the experimental cache are type_k=41 and type_v=41 for TURBO3_0.
I wrote a fully automated PowerShell build script with these patches for the Windows community and posted the guide on Reddit here:
[https://www.reddit.com/r/LocalLLaMA/comments/1s931oz/llamacpp_new_turboquant_3bit_kv_cache_is_insane/](https://www.reddit.com/r/LocalLLaMA/comments/1s931oz/llamacpp_new_turboquant_3bit_kv_cache_is_insane/)
Thanks again for this brilliant memory optimization!Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels