Motivation
prepare_lora_batch is triggered once per forward pass and is one of the main sources of perf overheads from LoRA. Based on suggestion from @Fridge003 , there are some low-hanging fruits for perf optimization such as eliminating unnecessary cuda device syncs.
Current status:
|
Baseline |
#6960 |
+ #6994 |
+ #8940 |
| ITL@P95 |
78.42 ms |
68.24 ms (-13.0%) |
52.51 (-33.0%) |
38.40 (-51.0%) |
| ITL@P50 |
34.36 ms |
32.85 ms (-4.4%) |
22.68 (-34.0%) |
18.30 (-46.7%) |
| TTFT@P50 |
91.37 ms |
85.52 ms (-6.5%) |
62.65 (-31.4%) |
53.79 (-41.1%) |
Motivation
prepare_lora_batchis triggered once per forward pass and is one of the main sources of perf overheads from LoRA. Based on suggestion from @Fridge003 , there are some low-hanging fruits for perf optimization such as eliminating unnecessary cuda device syncs.Current status:
Experiment torch.compile / cuda grpah for prepare_lora_batch to reduce gaps between kernels (Idea from @hebiao064 , to be verified)(deprioritized as most tensor ops are now moved to server init phase by Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops #8940)