Add disable_chunked_prefix_cache feature to TRTLLM MLA#10178
Add disable_chunked_prefix_cache feature to TRTLLM MLA#10178elfiegg wants to merge 5 commits intosgl-project:mainfrom
Conversation
|
|
done @zhyncs |
|
Lines 146 to 149 in 7a40e4f |
shall I open an issue for tracking? @Fridge003 |
|
Issue for tracking CI FP4/FP8 deepseek model: #10237 |
|
@Fridge003 that's a temp solution and doesn't seem to be the root cause and it's only for FP4 model. Also the perf is going to drop to 1/2 with FA2 backend. |
|
@Fridge003 but I was debugging with Shu yesterday, we both run FP4 model with Flashinfer FA2 kernel and the issue went away. Either cutlass or TRTLLM kernel would cause the accuracy drop. |
|
Looks like the issue is because FP4 model is triggerring #8995. Which cutlass / trtllm has 100% mismatch elements compared to FA2 |
|
Merged changes into #10180 |
Motivation
Flashinfer chunked-prefill has accuracy issue for deepseek fp4 model. add feature to disable it for a temp workaround
Currently if disable chunked-prefill cache it will
accuracy after the fix
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist