[Feature] add LoRADrainer to address high P99 TTFT#17913
[Feature] add LoRADrainer to address high P99 TTFT#17913Fridge003 merged 13 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a LoRADrainer to address high P99 TTFT for LoRA requests by preventing adapter starvation. The implementation is well-structured, introducing a new LoRADrainer class with a draining mechanism that seems sound. The integration into the Scheduler is clean, and the addition of unit tests for the new functionality is a great practice. I have one minor suggestion to improve code robustness in the scheduler.
|
From the benchmark, it seems that this draining strategy will harm median latency/TTFT. |
Added - it is turned off by default. |
|
/tag-run-ci-label |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci again |
|
/rerun-failed-ci again |
|
/rerun-failed-ci again |
Head branch was pushed to by a user without write access
|
/rerun-failed-ci again |
1 similar comment
|
/rerun-failed-ci again |
|
@Fridge003 I think it's good to merge now |
Motivation
Currently, our LoRA implementation suffers from an extremely high P99 TTFT issue. For instance, running with the below scripts on an A100-SXM4-80GB:
gives us the following results:
That means that 1% of requests take almost 40 seconds to schedule, compared to the median which is 83 ms.
Modifications
LoRADrainerclass to force hot adapters to start draining for cold adapters that have been starvedAccuracy Tests
test_lora_drainer.pyBenchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci