Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.
Weian Mao1*, Xi Lin3*, Wei Huang2*, Yuxin Xie1, Tianfu Fu1, Bohan Zhuang3, Song Han1,2, Yukang Chen2
1MIT, 2NVIDIA, 3ZJU *Equal contribution
demo.mp4
- 2.5x throughput on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8)
- 10.7x KV memory reduction with trigonometric frequency-domain compression
- OpenClaw compatible — enables local deployment on 24GB RTX 4090
TriAttention achieves 2.5x higher throughput and 10.7x KV memory reduction on AIME25 while matching Full Attention accuracy.
Pre-RoPE Q/K vectors in long reasoning models concentrate around fixed centers that determine distance preferences via a trigonometric series. TriAttention scores keys using these centers and norms instead of requiring representative query selection, enabling accurate KV cache compression without the overhead of existing attention-based methods.
TriAttention's vLLM server exposes an OpenAI-compatible API, which means you can use it directly as a custom provider in OpenClaw.
- Follow the Installation instructions, then start a vLLM server with the recommended settings below.
- In OpenClaw, add a custom provider pointing to your vLLM server (e.g.
http://localhost:8000/v1).
For manual configuration or troubleshooting, see the OpenClaw Manual Configuration Guide.
Interactive chat workloads differ from offline benchmarks — conversations are long-running and prefill chunks can trigger compression at unexpected points. We recommend the following adjustments:
# Required: path to precomputed frequency statistics
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt
# Use a larger KV budget for multi-turn chat (default: 2048)
export TRIATTN_RUNTIME_KV_BUDGET=12000
vllm serve <model_path> \
--dtype bfloat16 \
--max-model-len 32768 \
--enforce-eager \
--trust-remote-code \
--enable-prefix-caching false \
--max-num-batched-tokens 1024Key differences from the default server mode:
--enable-prefix-caching false— Prefix caching is incompatible with KV compression currently; disable it to avoid incorrect cache hits on compressed entries.--max-num-batched-tokens 1024— Limits the prefill chunk size. Large chunks can overshoot the KV budget in a single step before compression has a chance to trigger, leading to OOM.TRIATTN_RUNTIME_KV_BUDGET=12000— Chat sessions accumulate context across many turns; a larger budget (e.g. 12k) keeps more history available and avoids aggressive eviction.
git clone https://github.com/WeianMao/triattention.git
cd triattention
pip install -e .
pip install flash-attn --no-build-isolation # recommendedpython scripts/cli.py run-one \
--model Qwen3-8B \
--dataset aime24 \
--method triattention \
--budget 2048Benchmark datasets (AIME 2024, AIME 2025, MATH-500) are automatically downloaded from HuggingFace on first run -- no manual data preparation is needed. The evaluation scripts handle downloading, caching, and formatting transparently.
| Model | HuggingFace ID | Status |
|---|---|---|
| Qwen3-8B | Qwen/Qwen3-8B |
Verified |
| DeepSeek-R1-Distill-Llama-8B | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
Verified |
| DeepSeek-R1-Distill-Qwen-7B | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
Verified |
| Method | Qwen3-8B | DS-Llama-8B | DS-Qwen-7B | GPT-OSS-20B |
|---|---|---|---|---|
| Full Attention | 57.1 / 40.8 | 50.4 / 31.4 | 43.8 / 34.2 | 69.2 / 60.0 |
| SnapKV | 34.6 / 20.0 | 5.0 / 6.7 | 34.6 / 25.0 | 48.3 / 36.7 |
| R-KV | 25.4 / 17.5 | 25.8 / 11.2 | 34.6 / 23.3 | 49.6 / 39.2 |
| TriAttention | 42.1 / 32.9 | 33.8 / 19.6 | 42.5 / 30.0 | 59.2 / 49.2 |
| Benchmark | TriAttn Budget | Full Acc | TriAttn Acc | Full Throughput | TriAttn Throughput | Speedup |
|---|---|---|---|---|---|---|
| MATH-500 | 1024 | 69.6 | 68.4 | 222.8 | 1405.2 | 6.3x |
| AIME24 | 4096 | 57.1 | 54.6 | 222.8 | 413.9 | 1.9x |
| AIME25 | 3072 | 40.8 | 40.8 | 222.8 | 563.5 | 2.5x |
See docs/results.md for complete results including MATH-500 accuracy table, accuracy vs. budget curves, and DFS memory retention analysis.
TriAttention includes a vLLM plugin that enables transparent KV cache compression for production deployment. After installation, vLLM automatically discovers and activates the plugin -- no code changes required.
# Set compression parameters
export TRIATTN_RUNTIME_KV_BUDGET=2048
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt
# Launch vLLM server -- TriAttention activates automatically
vllm serve <model_path> \
--dtype bfloat16 \
--max-model-len 32768 \
--enforce-eager \
--trust-remote-code \
--enable-prefix-caching false
# Use the standard OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model_path>", "messages": [{"role": "user", "content": "Solve: ..."}]}'from triattention.vllm.runtime.integration_monkeypatch import (
install_vllm_integration_monkeypatches,
)
# Install patches before creating the LLM instance
install_vllm_integration_monkeypatches(patch_scheduler=True, patch_worker=True)
# Standard vLLM API -- compression happens transparently
from vllm import LLM, SamplingParams
llm = LLM(
model="<model_path>",
dtype="bfloat16",
max_model_len=32768,
enforce_eager=True,
trust_remote_code=True,
)
outputs = llm.generate(["Your prompt here"], SamplingParams(temperature=0.6, top_p=0.95))
print(outputs[0].outputs[0].text)| Environment Variable | Default | Description |
|---|---|---|
TRIATTN_RUNTIME_KV_BUDGET |
2048 |
Maximum tokens retained in KV cache per request |
TRIATTN_RUNTIME_DIVIDE_LENGTH |
128 |
Compression trigger interval (every N new tokens) |
TRIATTN_RUNTIME_WINDOW_SIZE |
128 |
Recent tokens always preserved |
TRIATTN_RUNTIME_PRUNING_MODE |
per_head |
Token selection strategy (per_head or per_layer_per_head) |
TRIATTN_RUNTIME_SPARSE_STATS_PATH |
-- | Path to precomputed frequency statistics .pt file |
TRIATTN_RUNTIME_PROTECT_PREFILL |
false |
Protect initial prompt tokens from eviction |
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_KV_COMPACTION |
true |
Enable in-place KV cache compaction |
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_BLOCK_RECLAIM |
true |
Enable freed block reclamation |
ENABLE_TRIATTENTION |
true |
Master switch to enable/disable the plugin |
TriAttention requires precomputed Q/K frequency statistics for scoring. We provide pre-calibrated stats for supported models in triattention/vllm/stats/. See the Calibration Guide for generating stats for custom models.
- Reproduction Guide -- full experiment commands for all benchmarks
- Calibration Guide -- generating custom Q/K statistics
- Full Results -- complete tables, figures, and analysis
- vLLM integration
- SGLang integration
- Ollama integration
- Support for more model architectures
@article{mao2026triattention,
title={TriAttention: Efficient Long Reasoning with Trigonometric KV Compression},
author={Weian Mao and Xi Lin and Wei Huang and Yuxin Xie and Tianfu Fu and Bohan Zhuang and Song Han and Yukang Chen},
year={2026},
eprint={2604.04921},
archivePrefix={arXiv},
primaryClass={cs.CL}
}We thank the following projects for their contributions and inspiration: R-KV | SnapKV
This project is licensed under the Apache License 2.0. See LICENSE for details.
