[ray.data.llm] Add hint of how to optimize throughput#52634
[ray.data.llm] Add hint of how to optimize throughput#52634kouroshHakha merged 4 commits intoray-project:masterfrom
Conversation
Signed-off-by: Linkun Chen <github@lkchen.net>
|
|
||
| # Core stage -- the vLLM engine. | ||
|
|
||
| if config.batch_size * config.max_concurrent_batches < DEFAULT_VLLM_BATCH_SIZE: |
There was a problem hiding this comment.
I don't get why DEFAULT_VLLM_BATCH_SIZE is set to be 256? and why does this warning make sense?
There was a problem hiding this comment.
256 comes from vLLM, I've refactored to always read from vLLM instead of hardcoding.
This warning has two parts:
- product of
batch_sizeandmax_concurrent_batchesindicates total concurrent prompts, if this product is too small, vLLM is under-utilized - I want user increase
max_concurrent_batchesinstead ofbatch_size, since the latter cause long-tail blocking
which part doesn't make sense to you, could you clarify?
There was a problem hiding this comment.
ok explanation is clear now. The 256 is really coming from engine_kwargs of vllm. It's not hardcoded inside vllm either. Basically you are saying ray data will adjust itself to the corresponding max_seq number set on the vllm engine replica, by adjusting the max_concurrent_batches instead of adjusting the batch size. Can we get some reliable benchmark datapoints attached to this PR for different combos of batch_size and max_concurrent_batches to show the basis of this choice?
What I mean is that we should run a benchmark for sweep of batch_size and max_concurrent_batches under similar max_seqs.
Basically
for max_seq: [128, 256, 512]:
for (bsize, max_concurrent_batches) in [(1, max_seq), (2, max_seq/2), ..., (max_seq, 1)]:
Measure: E2E runtime on a fixed dataset of say 10k rows
For baseline comparisons also measure E2E time when bsize=10k, max_concurrent_batches=1 on similar max_seq levels.
| max_tasks_in_flight_per_actor=max( | ||
| DEFAULT_MAX_TASKS_IN_FLIGHT, config.max_concurrent_batches | ||
| ), | ||
| ), |
There was a problem hiding this comment.
@raulchen if this is deprecated, what is the right way to control the max_tasks_in_flight_per_actor properly?
There was a problem hiding this comment.
That's what the comment says, this deprecated field is the only way to control max_tasks_in_flight_per_actor
There was a problem hiding this comment.
As users of ray data yes, but ray data should either not deprecate this, or give a more stable solution. I want to understand if this is what is recommended for the issue above. cc @alexeykudinkin @gvspraveen @richardliaw
There was a problem hiding this comment.
unfortunately, we haven't exposed a new API for this yet.
created a ticket here #52667
For now, let's use the current way.
Signed-off-by: lkchen <github@lkchen.net>
Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Linkun Chen <github@lkchen.net>
|
@kouroshHakha yes, I started running max_seq=256 an hour ago |
|
ok discussed offline. With this PR we are basically enabling configuring max_concurrency on a udf actor pools. By modifying bsize and max_concurrency we can shave off an overhead of 20 ish % to 10 ish % compared to async vllm for single replica. The rest of the remaining overhead must be ray serialization, etc which will be insignificant cost for the value of horizontal scaling. Both @lk-chen and I agree that we should put a pin on this and just be aware that in single replica there could be an overhead of 7-10% compared to async vllm. |
Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: lkchen <github@lkchen.net> Signed-off-by: jhsu <jhsu@anyscale.com>
Why are these changes needed?
LLM task is usually long-running, and duration varies a lot. This can easily cause long-tail problem if batch size is too large.
For example, within a batch, most prompts finished, while there's one prompt keep decoding, blocking the whole batch to finish. Ray data cannot schedule more batches, if the long-tail happens in all running batches. And vLLM engine is not saturated in this case (only decoding one prompt from each batch, while vLLM could potentially handle 256 sequence concurrently), causing low throughput.
This PR
batch_sizeto avoid long tail, and largemax_concurrent_batchesto saturate engine.Benchmarking on a 10k ShareGPT dataset, on L40S GPU (vLLM 0.8.4, VLLM_USE_V1=0)
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.