Optimize KV cache dequantization performance by kyuyeunk · Pull Request #9528 · pytorch/xla

kyuyeunk · 2025-08-01T01:46:56Z

This change reduces casting of quantized KV cache for best performance.

yaochengji

LGTM, thanks for the improvement!

kyuyeunk · 2025-08-01T20:03:00Z

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

bythew3i · 2025-08-01T21:22:57Z

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

Please ping me if you check in any thing to RPA

kyuyeunk · 2025-08-01T21:32:57Z

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

Please ping me if you check in any thing to RPA

ack. will always ping you on an RPA related changes.

Adds following changes - Add support for query quantization (w8a8) - Optimize performance with KV cache quantization (similar approach in pytorch/xla#9528) Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>

Adds following changes - Add support for query quantization (w8a8) - Optimize performance of kv cache quantization (similar approach in pytorch/xla#9528) Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>

kyuyeunk force-pushed the optimize_kv_cache_dequant branch from 143e835 to 08adc53 Compare August 1, 2025 01:54

yaochengji reviewed Aug 1, 2025

View reviewed changes

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py Outdated

Optimize KV cache dequantization performance

6ccf85c

This change reduces casting of quantized KV cache for best performance.

kyuyeunk force-pushed the optimize_kv_cache_dequant branch from 08adc53 to 6ccf85c Compare August 1, 2025 18:27

yaochengji approved these changes Aug 1, 2025

View reviewed changes

yaochengji reviewed Aug 1, 2025

View reviewed changes

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py

bythew3i reviewed Aug 1, 2025

View reviewed changes

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py

bythew3i approved these changes Aug 1, 2025

View reviewed changes

yaochengji enabled auto-merge (squash) August 1, 2025 23:51

yaochengji merged commit 9995e97 into pytorch:master Aug 1, 2025
23 of 24 checks passed

kyuyeunk deleted the optimize_kv_cache_dequant branch August 2, 2025 01:20

kyuyeunk mentioned this pull request Aug 13, 2025

[RPA] Add support for RPA query quantization vllm-project/tpu-inference#468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize KV cache dequantization performance#9528

Optimize KV cache dequantization performance#9528
yaochengji merged 1 commit intopytorch:masterfrom
kyuyeunk:optimize_kv_cache_dequant

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

yaochengji left a comment

Uh oh!

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

bythew3i commented Aug 1, 2025

Uh oh!

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

bythew3i commented Aug 1, 2025

Uh oh!

kyuyeunk commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants