Skip to content

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10

Merged
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/turbo-rope-shift-gemma4
May 12, 2026
Merged

fix: Gemma 4 + TurboQuant KV no longer crashes on second prompt when --cache-reuse enabled#10
Ooooze merged 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:fix/turbo-rope-shift-gemma4

Conversation

@sujitvasanth

@sujitvasanth sujitvasanth commented May 11, 2026

Copy link
Copy Markdown

Overview

The previous cache bug #9 prevented the discovery of a knock on problem in the RoPE implementation. This fix is necessary to allow TurboQuant to function properly with cache reuse with gemma 4.

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation, which is position-invariant -- WHT preserves inner products so no RoPE correction is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel level. Building the shift graph with turbo-padded tensors causes a null buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for turbo KV types. Position tracking via seq_add() still works correctly -- only the broken RoPE re-rotation kernel is skipped.

Additional information

Combined with the previous PR that recognises caching in Gemma 4 this leads to near instataneous chat conversations on llama-sever web gui, when previously there was a reprocessing lag of 7 seconds plus, and a crash with any prompt causing a sliding window shift.
I have tested to around 6k of available 250k context and working flawlessly now.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yers, coauthored with Claude, I have built and tested - confirm fully functional in my rtx3060+gtx1660 setup on ubuntu 20.04

…--cache-reuse enabled

TurboQuant (turbo2/3/4) uses kernel-level WHT rotation which is
position-invariant -- WHT preserves inner products so no RoPE correction
is needed after a KV position shift.

build_graph_shift() assumed standard quantized tensors with upstream
rotation, but TurboQuant sets attn_rot_k=0 and handles rotation at kernel
level. Building the shift graph with turbo-padded tensors causes a null
buffer assert and segfault on the second prompt.

Fix: skip build_graph_shift() layers and get_has_shift() entirely for
turbo KV types. Position tracking via seq_add() still works correctly --
only the broken RoPE re-rotation kernel is skipped.
@Ooooze Ooooze merged commit b1a7d71 into AtomicBot-ai:feature/turboquant-kv-cache May 12, 2026
1 check passed
Ooooze added a commit that referenced this pull request May 12, 2026
Brings in Gemma 4 + TurboQuant KV cache fixes:
- fix/turbo-rope-shift-gemma4 (PR #10)
- fix/iswa-get-can-shift-gemma4 (PR #9)
- fix/mtp-assistant-tensor-prefix (PR #7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants