feat: TBQ4_0 + TBQ3_0 CUDA flash attention for SM121 (DGX Spark) by mihai-chiorean · Pull Request #1 · mihai-chiorean/turbo3-cuda

mihai-chiorean · 2026-03-31T18:14:59Z

Summary

Add TBQ4_0 (4.125 bpw) and TBQ3_0 (3.125 bpw) KV cache types with native CUDA flash attention for DGX Spark (GB10, SM121). Enables ~85K context on MiniMax M2.5 (+30% vs 65K with q4_0).

Key Results (Llama-3.1-8B, WikiText-2, SM121)

Config	PPL	vs f16	bpw
f16	7.6186	--	16.0
tbq4_0/tbq4_0	7.6999	+1.07%	4.125
tbq4_0/tbq3_0	7.7556	+1.80%	3.56
q8_0/tbq3_0	7.6844	+0.86%	5.81

MiniMax M2.5: all TBQ configs within noise of f16 (overlapping error bars at 4 chunks).

Architecture

Lloyd-Max codebook quantization with WHT rotation. FA vec kernel pre-rotates Q via shared memory (once per invocation), codebook lookups per-token, inverse rotation on output. O(1) per-token overhead.

Bug Fixes (7)

Shadow path eliminated, turbo4 Q_q8_1 fix, dequant dispatch, launch_bounds, NC stride, multi-GPU rotation init, context size.

29 files, 7454 insertions. 2 rounds of code review passed.

…21056) * server : add custom socket options to disable SO_REUSEPORT Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --reuse-port $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update tools/server/README.md (llama-gen-docs) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix windows Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* CI: fix ARM64 image build error & enable compilation * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: revert ggml/src/ggml-cpu/CMakeLists.txt * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04") * CI: change cpu.Dockerfile gcc to 14; * CI : cpu.Dockerfile , update pip install . * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* add /glob command * output error when max files reached * support globbing outside curdir

…ml-org#21085) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker

* vulkan: add noncontiguous GLU support * fix compile issue

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* refactor: Make `DialogConfirmation` extensible with children slot * feat: Add conversation forking logic * feat: Conversation forking UI * feat: Update delete/edit dialogs and logic for forks * refactor: Improve Chat Sidebar UX and add MCP Servers entry * refactor: Cleanup * feat: Update message in place when editing leaf nodes * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * refactor: Post-review improvements * chore: update webui build output * test: Update Storybook test * chore: update webui build output * chore: update webui build output

…19771)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…g#21107)

…schema pattern converter (ggml-org#21124) The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV when a JSON schema "pattern" field contains a non-capturing group (?:...). Root cause: when the parser sees '(' followed by '?', it pushes a warning but does not advance past '?:'. The recursive transform() call then interprets '?' as a quantifier and calls seq.back() on an empty vector, causing undefined behavior. This commonly occurs when serving OpenAI-compatible tool calls from clients that include complex regex patterns in their JSON schemas (e.g., date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$). The fix: - Skip '?:' after '(' to treat non-capturing groups as regular groups - For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely, handling escaped characters to avoid miscounting parenthesis depth - Adjust the ')' unbalanced-parentheses check using direct char comparisons instead of substr - Add test cases for non-capturing groups (C++ only, as the JS/Python implementations do not yet support this syntax)

* remove/replace nested button elements * map rest props to outer element * solve TODO * chore: update webui build output

* add character class support to glob_match * remove pointless reference

…s key (ggml-org#21128)

…ggml-org#21093) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0

* hex-fa: add simple dma cache for Mask I noticed that we were refetch the mask rows over and over. This simple cache avoids that. * hex-dma: unset in-order desc bit which caused signficant perf regression We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen. * hex-rope: update comment to clarify that we don't need in-order DMA completions

@am17an

* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* server: wrap headers for mcp proxy * Update tools/server/server-cors-proxy.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix build * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix incorrect type ignore comments * bump ty to 0.0.26

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

* fix: Branching logic + small refactor * chore: update webui build output

When RPC is running with a remote backend which doesn't have init_tensor function (like CPU and Metal), the server log gets full with error messages saying that init_tensor is being called with null buffer which is incorrect. This patch fixes this.

…l-org#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162 * Reduce nrows in test case to 256, don't need 768

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments(). * Treat empty computed member expressions with Jinja2 undefined semantics Treat empty computed member expressions like `a[]` as undefined instead of raising a parser error, to match Jinja2 behavior. - return a noop expression for empty computed member arguments - return undefined when a computed member key evaluates to undefined - add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined` * Handle undefined computed member properties Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`. * Use default undefined value in member access Initialize val and then return it when property is undefined. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * empty statement parses to blank_expression instead of noop_statement --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> * Obtain source tag name from git tag Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* ggml-zendnn : add MUL_MAT_ID op support for MoE models - Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13 * ggml-zendnn : add braces to sgemm failure condition for consistency Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>

…#21337) This helps improve our chances of finding build failures before the release workflow builds for all architectures.

…rg#21331) The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture. This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.

Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> Co-authored-by: CISC <CISC@users.noreply.github.com>

…gml-org#21327) * common : fix tool call type detection for nullable and enum schemas * common, tests : fix grammar delegation for nullable/enum schemas and add tests Fix enum type inference to scan all enum values (not just index 0) so schemas like {"enum": [0, "celsius"]} correctly detect string type. Fix schema_delegates in peg-parser to handle nullable type arrays (["string", "null"]) and typeless enum schemas in raw mode, allowing the tagged parser to use raw text instead of JSON-formatted strings. Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format): - nullable string ["string", "null"] - nullable string with null first ["null", "string"] - nullable integer ["integer", "null"] - enum without explicit type key

…ity for tag-json parsers (ggml-org#21230) * Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers * Rename * Update common/chat-auto-parser-generator.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…org#20993) * server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE) * server: move idle slot KV clearing to slot release The save "cost" is now paid by the finishing request. * server: add --kv-clear-idle flag, enable by default * server: skip clearing last idle slot, clear on launch * server: test --no-kv-clear-idle flag * server: simplify on-release clearing loop * server: remove on-release KV clearing, keep launch-only * cont : clean-up * tests: update log strings after --clear-idle rename * tests: use debug tags instead of log message matching * test: fix Windows CI by dropping temp log file unlink --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* experimenting CI * Experimenting CI fix for MinGW * experimenting CI on Windows * modified script for integration with VisualStudio * added proxy handling * adding python version for Windows execution * fix iterator::end() dereference * fixed proxy handling * Fix errors occurring on Windows * fixed ci script * Reverted to master * Stripping test items to simplify Windows test * adjusting script for windows testing * Changed shell * Fixed shell * Fixed shell * Fix CI setting * Fix CI setting * Fix CI setting * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * experimenting fix for unit test error * Changed to use BUILD_LOW_PERF to skip python tests * Fix CI * Added option to specify Ninja generator * Reverted proxy related changes

…fsets (ggml-org#21278) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting

…-sm121 # Conflicts: # ggml/src/ggml-cuda/fattn.cu # src/llama-graph.cpp

…rg#21038) Upstream master now applies Walsh-Hadamard rotation to K/V/Q before KV cache storage (commit 744c0c7). This is the same rotation TBQ was doing independently, causing double rotation after the merge. TBQ types are now pure codebook quantizers: - SET_ROWS: normalize + codebook quantize + pack (no FWHT) - FA dequant: codebook lookup + scale (no Q pre-rotation, no V inverse rotation) - Standalone dequant: codebook lookup + scale (no inverse FWHT) Removes ~200 lines of rotation code from CUDA and CPU paths. Fixes garbage output caused by double WHT rotation after upstream merge.

angt and others added 30 commits March 28, 2026 01:12

cli : add /glob command (ggml-org#21084)

c46758d

* add /glob command * output error when max files reached * support globbing outside curdir

common/parser: fix reasoning whitespace bugs + extra parser tests (gg…

1f5d15e

…ml-org#21085) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker

vulkan: add noncontiguous GLU support (ggml-org#21081)

0eb4764

* vulkan: add noncontiguous GLU support * fix compile issue

vendor : update cpp-httplib to 0.40.0 (ggml-org#21100)

b0f0dd3

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Document custom default webui preferences in server README (ggml-org#…

82b703f

…19771)

ci : gracefully shut down the server (ggml-org#21110)

3d66da1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

server : fix processing of multiple back-to-back mtmd chunks (ggml-or…

edfb440

…g#21107)

common : add reasoning_format = none support to gpt-oss (ggml-org#21094)

e6f2ec0

WebUI: Replace illegal nested button elements (ggml-org#21026)

9681897

* remove/replace nested button elements * map rest props to outer element * solve TODO * chore: update webui build output

common : add character class support to glob_match (ggml-org#21111)

3a14a54

* add character class support to glob_match * remove pointless reference

common/parser: fix handling of tool definition with missing propertie…

98ae0a0

…s key (ggml-org#21128)

fix **/x glob matching (ggml-org#21129)

6509718

[SYCL] Enhance build script to use half cores to build, avoid OS hang (…

afe65aa

…ggml-org#21093) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0

devops: including compute-runtime for intel.Dockerfile (ggml-org#21076)

2405d59

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (ggml-org#21150)

7c20367

ci : bump ty to 0.0.26 (ggml-org#21156)

e2eb39e

* fix incorrect type ignore comments * bump ty to 0.0.26

llama-model-loader: print warning when using overrides with mmap (ggm…

278521c

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

webui: Fix branching logic on edit message (ggml-org#21175)

389c7d4

* fix: Branching logic + small refactor * chore: update webui build output

z-vishal and others added 15 commits April 3, 2026 12:19

fix: add openssl to nix dependencies (ggml-org#21353) (ggml-org#21355)

f851fa5

HIP: build eatch ci build test for a different architecture (ggml-org…

43a4ee4

…#21337) This helps improve our chances of finding build failures before the release workflow builds for all architectures.

fix: remove stale assert (ggml-org#21369)

d3416a4

ci: add more binary checks (ggml-org#21349)

887535c

jinja: coerce input for string-specific filters (ggml-org#21370)

1f34806

docker : bump cuda12 to 12.9.1 (ggml-org#20920)

277ff5f

Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> Co-authored-by: CISC <CISC@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into feat/tbq4-cuda-fa…

3a9a2ab

…-sm121 # Conflicts: # ggml/src/ggml-cuda/fattn.cu # src/llama-graph.cpp

github-actions Bot added documentation Improvements or additions to documentation SYCL Vulkan AMD ZenDNN devops python script server model nix jinja parser Ascend NPU OpenCL Hexagon WebGPU labels Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TBQ4_0 + TBQ3_0 CUDA flash attention for SM121 (DGX Spark)#1

feat: TBQ4_0 + TBQ3_0 CUDA flash attention for SM121 (DGX Spark)#1
mihai-chiorean wants to merge 112 commits intorelease/turbo3-cudafrom
feat/tbq4-cuda-fa-sm121

mihai-chiorean commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants