Skip to content

Optimize KV cache dequantization performance#9528

Merged
yaochengji merged 1 commit intopytorch:masterfrom
kyuyeunk:optimize_kv_cache_dequant
Aug 1, 2025
Merged

Optimize KV cache dequantization performance#9528
yaochengji merged 1 commit intopytorch:masterfrom
kyuyeunk:optimize_kv_cache_dequant

Conversation

@kyuyeunk
Copy link
Copy Markdown
Contributor

@kyuyeunk kyuyeunk commented Aug 1, 2025

This change reduces casting of quantized KV cache for best performance.

@kyuyeunk kyuyeunk force-pushed the optimize_kv_cache_dequant branch from 143e835 to 08adc53 Compare August 1, 2025 01:54
Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py Outdated
This change reduces casting of quantized KV cache for best performance.
@kyuyeunk kyuyeunk force-pushed the optimize_kv_cache_dequant branch from 08adc53 to 6ccf85c Compare August 1, 2025 18:27
Copy link
Copy Markdown
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the improvement!

@kyuyeunk
Copy link
Copy Markdown
Contributor Author

kyuyeunk commented Aug 1, 2025

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py
@bythew3i
Copy link
Copy Markdown
Contributor

bythew3i commented Aug 1, 2025

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

Please ping me if you check in any thing to RPA

@kyuyeunk
Copy link
Copy Markdown
Contributor Author

kyuyeunk commented Aug 1, 2025

LGTM, thanks for the improvement!

Thanks! Can you press this PR's merge button for me?

Please ping me if you check in any thing to RPA

ack. will always ping you on an RPA related changes.

Comment thread torch_xla/experimental/pallas_kernels/ragged_paged_attention_v2.py
@yaochengji yaochengji enabled auto-merge (squash) August 1, 2025 23:51
@yaochengji yaochengji merged commit 9995e97 into pytorch:master Aug 1, 2025
23 of 24 checks passed
@kyuyeunk kyuyeunk deleted the optimize_kv_cache_dequant branch August 2, 2025 01:20
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Aug 13, 2025
Adds following changes
- Add support for query quantization (w8a8)
- Optimize performance with KV cache quantization (similar approach in pytorch/xla#9528)

Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Aug 13, 2025
Adds following changes
- Add support for query quantization (w8a8)
- Optimize performance of kv cache quantization (similar approach in pytorch/xla#9528)

Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Aug 13, 2025
Adds following changes
- Add support for query quantization (w8a8)
- Optimize performance of kv cache quantization (similar approach in pytorch/xla#9528)

Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
kyuyeunk added a commit to vllm-project/tpu-inference that referenced this pull request Aug 14, 2025
Adds following changes
- Add support for query quantization (w8a8)
- Optimize performance of kv cache quantization (similar approach in pytorch/xla#9528)

Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants