fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled by sujitvasanth · Pull Request #10 · AtomicBot-ai/atomic-llama-cpp-turboquant

sujitvasanth · 2026-05-11T14:28:26Z

Overview

The previous cache bug #9 prevented the discovery of a knock on problem in the RoPE implementation. This fix is necessary to allow TurboQuant to function properly with cache reuse with gemma 4.

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation, which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.

Additional information

Combined with the previous PR that recognises caching in Gemma 4 this leads to near instataneous chat conversations on llama-sever web gui, when previously there was a reprocessing lag of 7 seconds plus, and a crash with any prompt causing a sliding window shift.
I have tested to around 6k of available 250k context and working flawlessly now.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yers, coauthored with Claude, I have built and tested - confirm fully functional in my rtx3060+gtx1660 setup on ubuntu 20.04

…--cache-reuse enabled TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift. build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt. Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.

Brings in Gemma 4 + TurboQuant KV cache fixes: - fix/turbo-rope-shift-gemma4 (PR #10) - fix/iswa-get-can-shift-gemma4 (PR #9) - fix/mtp-assistant-tensor-prefix (PR #7)

Ooooze merged commit b1a7d71 into AtomicBot-ai:feature/turboquant-kv-cache May 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/turbo-rope-shift-gemma4

sujitvasanth commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sujitvasanth commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sujitvasanth commented May 11, 2026 •

edited

Loading