Support kv8 (FP8) with torch_native attention backend#12596
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
773c7a0 to
64ff639
Compare
|
@Fridge003 Thanks for your review and approval. Could someone help merge this PR? Thanks~ |
|
@JackChuang Please fix conflict |
64ff639 to
5faf913
Compare
|
@Fridge003 Could you please help merge this PR when you have free cycles? Thank you. |
This patch fixes the issue where KV8 could not run when the attention backend was set to torch_native. Updates the attention backend support document. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
5faf913 to
8ee8295
Compare
|
@JackChuang Do you have any example of accuracy benchmarking when enabling fp8 kv cache with torch native backend |
Didn't test accuracy but performance. I’ll run the accuracy tests and then update. |
|
@Fridge003 Using native_torch with KV8, the precision is essentially lossless. [KV16] [KV8]
|
|
@JackChuang Please merge the main branch. |
This patch fixes the issue where KV8 could not run when the attention backend was set to torch_native.
Motivation
Currently, when using --attention-backend torch_native, the --kv-cache-dtype fp8_e4m3 option is not supported, causing KV cache in FP8 to fail. This patch fixes the issue by ensuring that the query, key, and value tensors are cast to the same dtype before calling scaled_dot_product_attention.
Modifications
Accuracy Tests
Tested in another PR #12612
Benchmarking and Profiling
Tested in another PR #12612
Checklist