Skip to content

WeianMao/triattention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper Project Page License Python 3.10+

Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.

Weian Mao1*, Xi Lin3*, Wei Huang2*, Yuxin Xie1, Tianfu Fu1, Bohan Zhuang3, Song Han1,2, Yukang Chen2

1MIT, 2NVIDIA, 3ZJU    *Equal contribution

demo.mp4

Highlights

  • 2.5x throughput on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8)
  • 10.7x KV memory reduction with trigonometric frequency-domain compression
  • OpenClaw compatible — enables local deployment on 24GB RTX 4090

TriAttention achieves 2.5x higher throughput and 10.7x KV memory reduction on AIME25 while matching Full Attention accuracy.

How It Works

Pre-RoPE Q/K vectors in long reasoning models concentrate around fixed centers that determine distance preferences via a trigonometric series. TriAttention scores keys using these centers and norms instead of requiring representative query selection, enabling accurate KV cache compression without the overhead of existing attention-based methods.

Deploy with OpenClaw

TriAttention's vLLM server exposes an OpenAI-compatible API, which means you can use it directly as a custom provider in OpenClaw.

Quick Setup

  1. Follow the Installation instructions, then start a vLLM server with the recommended settings below.
  2. In OpenClaw, add a custom provider pointing to your vLLM server (e.g. http://localhost:8000/v1).

For manual configuration or troubleshooting, see the OpenClaw Manual Configuration Guide.

Recommended Server Settings for Chat

Interactive chat workloads differ from offline benchmarks — conversations are long-running and prefill chunks can trigger compression at unexpected points. We recommend the following adjustments:

# Required: path to precomputed frequency statistics
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt

# Use a larger KV budget for multi-turn chat (default: 2048)
export TRIATTN_RUNTIME_KV_BUDGET=12000

vllm serve <model_path> \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --enforce-eager \
    --trust-remote-code \
    --enable-prefix-caching false \
    --max-num-batched-tokens 1024

Key differences from the default server mode:

  • --enable-prefix-caching false — Prefix caching is incompatible with KV compression currently; disable it to avoid incorrect cache hits on compressed entries.
  • --max-num-batched-tokens 1024 — Limits the prefill chunk size. Large chunks can overshoot the KV budget in a single step before compression has a chance to trigger, leading to OOM.
  • TRIATTN_RUNTIME_KV_BUDGET=12000 — Chat sessions accumulate context across many turns; a larger budget (e.g. 12k) keeps more history available and avoids aggressive eviction.

Installation

git clone https://github.com/WeianMao/triattention.git
cd triattention
pip install -e .
pip install flash-attn --no-build-isolation  # recommended

Quick Start

python scripts/cli.py run-one \
    --model Qwen3-8B \
    --dataset aime24 \
    --method triattention \
    --budget 2048

Datasets

Benchmark datasets (AIME 2024, AIME 2025, MATH-500) are automatically downloaded from HuggingFace on first run -- no manual data preparation is needed. The evaluation scripts handle downloading, caching, and formatting transparently.

Supported Models

Model HuggingFace ID Status
Qwen3-8B Qwen/Qwen3-8B Verified
DeepSeek-R1-Distill-Llama-8B deepseek-ai/DeepSeek-R1-Distill-Llama-8B Verified
DeepSeek-R1-Distill-Qwen-7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B Verified

Results

AIME24 / AIME25 (KV Budget = 2048, DS-Llama = 512)

Method Qwen3-8B DS-Llama-8B DS-Qwen-7B GPT-OSS-20B
Full Attention 57.1 / 40.8 50.4 / 31.4 43.8 / 34.2 69.2 / 60.0
SnapKV 34.6 / 20.0 5.0 / 6.7 34.6 / 25.0 48.3 / 36.7
R-KV 25.4 / 17.5 25.8 / 11.2 34.6 / 23.3 49.6 / 39.2
TriAttention 42.1 / 32.9 33.8 / 19.6 42.5 / 30.0 59.2 / 49.2

Throughput (Qwen3-8B, tokens/sec)

Benchmark TriAttn Budget Full Acc TriAttn Acc Full Throughput TriAttn Throughput Speedup
MATH-500 1024 69.6 68.4 222.8 1405.2 6.3x
AIME24 4096 57.1 54.6 222.8 413.9 1.9x
AIME25 3072 40.8 40.8 222.8 563.5 2.5x

See docs/results.md for complete results including MATH-500 accuracy table, accuracy vs. budget curves, and DFS memory retention analysis.

vLLM Integration

TriAttention includes a vLLM plugin that enables transparent KV cache compression for production deployment. After installation, vLLM automatically discovers and activates the plugin -- no code changes required.

Server Mode (OpenAI-Compatible API)

# Set compression parameters
export TRIATTN_RUNTIME_KV_BUDGET=2048
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/vllm/stats/qwen3_32b_int4_stats.pt

# Launch vLLM server -- TriAttention activates automatically
vllm serve <model_path> \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --enforce-eager \
    --trust-remote-code \
    --enable-prefix-caching false

# Use the standard OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "<model_path>", "messages": [{"role": "user", "content": "Solve: ..."}]}'

Python API

from triattention.vllm.runtime.integration_monkeypatch import (
    install_vllm_integration_monkeypatches,
)

# Install patches before creating the LLM instance
install_vllm_integration_monkeypatches(patch_scheduler=True, patch_worker=True)

# Standard vLLM API -- compression happens transparently
from vllm import LLM, SamplingParams

llm = LLM(
    model="<model_path>",
    dtype="bfloat16",
    max_model_len=32768,
    enforce_eager=True,
    trust_remote_code=True,
)

outputs = llm.generate(["Your prompt here"], SamplingParams(temperature=0.6, top_p=0.95))
print(outputs[0].outputs[0].text)

Configuration Reference

Environment Variable Default Description
TRIATTN_RUNTIME_KV_BUDGET 2048 Maximum tokens retained in KV cache per request
TRIATTN_RUNTIME_DIVIDE_LENGTH 128 Compression trigger interval (every N new tokens)
TRIATTN_RUNTIME_WINDOW_SIZE 128 Recent tokens always preserved
TRIATTN_RUNTIME_PRUNING_MODE per_head Token selection strategy (per_head or per_layer_per_head)
TRIATTN_RUNTIME_SPARSE_STATS_PATH -- Path to precomputed frequency statistics .pt file
TRIATTN_RUNTIME_PROTECT_PREFILL false Protect initial prompt tokens from eviction
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_KV_COMPACTION true Enable in-place KV cache compaction
TRIATTN_RUNTIME_ENABLE_EXPERIMENTAL_BLOCK_RECLAIM true Enable freed block reclamation
ENABLE_TRIATTENTION true Master switch to enable/disable the plugin

Precomputed Statistics

TriAttention requires precomputed Q/K frequency statistics for scoring. We provide pre-calibrated stats for supported models in triattention/vllm/stats/. See the Calibration Guide for generating stats for custom models.

Documentation

Roadmap

  • vLLM integration
  • SGLang integration
  • Ollama integration
  • Support for more model architectures

Citation

@article{mao2026triattention,
    title={TriAttention: Efficient Long Reasoning with Trigonometric KV Compression},
    author={Weian Mao and Xi Lin and Wei Huang and Yuxin Xie and Tianfu Fu and Bohan Zhuang and Song Han and Yukang Chen},
    year={2026},
    eprint={2604.04921},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgements

We thank the following projects for their contributions and inspiration: R-KV | SnapKV

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

About

TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained GPUs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages