Skip to content

Porting TurboQuant to Windows (MSVC): Compatibility fixes and Python IDs #39

@v75181806

Description

@v75181806

Hi Tom!

First of all, thank you for the amazing work on TurboQuant. It's a massive game-changer! I managed to get it compiled and running on Windows with a mobile RTX 4070 (8GB), getting an incredible 17 t/s on Nemotron 30B (MoE).

However, the current feature/turboquant-kv-cache branch doesn't compile out of the box on Windows using the Microsoft C++ compiler (MSVC). I had to apply a few manual patches to get a successful build.

I'm sharing these fixes here in hopes they can be integrated to make Windows builds seamless:

1. Missing M_PI math constant
In ggml-turbo-quant.c, MSVC requires defining math constants before includes:

#define _USE_MATH_DEFINES
#include <math.h>
2. Missing variable in ops.cpp
The build failed due to missing turbo3_cpu_wht_group_size. I had to add this line:

C++
int turbo3_cpu_wht_group_size = 1;
3. MSVC Linker/Scope errors with g_innerq_scale_inv_host
MSVC throws "undeclared identifier" and linker errors because of how extern is handled across the files. I had to clean up the existing extern declarations of g_innerq_scale_inv_host in the headers/cuda files and explicitly declare it at the very top of llama-kv-cache.cpp:

C++
/* MSVC LINKER BYPASS */
float * g_innerq_scale_inv_host = nullptr;
bool turbo_innerq_needs_tensor_update(void) { return false; }
void turbo_innerq_mark_tensor_updated(void) {}
4. Flash Attention incompatibility
Currently, compiling with -DGGML_FLASH_ATTN=ON crashes the MSVC compiler on Windows, so I highly recommend Windows users to build with -DGGML_FLASH_ATTN=OFF for now.

Python Bindings Note:
For anyone using llama-cpp-python, I discovered that the magic IDs to trigger the experimental cache are type_k=41 and type_v=41 for TURBO3_0.

I wrote a fully automated PowerShell build script with these patches for the Windows community and posted the guide on Reddit here:
[https://www.reddit.com/r/LocalLLaMA/comments/1s931oz/llamacpp_new_turboquant_3bit_kv_cache_is_insane/](https://www.reddit.com/r/LocalLLaMA/comments/1s931oz/llamacpp_new_turboquant_3bit_kv_cache_is_insane/)

Thanks again for this brilliant memory optimization!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions