[Model] enable data parallel for InternVL vision encoder#23909
[Model] enable data parallel for InternVL vision encoder#23909vllm-bot merged 14 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces data parallelism for the InternVL vision encoder, a valuable optimization for multi-modal models. The implementation correctly uses a use_data_parallel flag to switch between tensor-parallel and replicated linear layers within the vision transformer's attention and MLP blocks. The forward pass for the data-parallel mode is handled by run_dp_sharded_vision_model, which correctly shards the input batch across the available GPUs. The changes are well-contained, follow existing design patterns within the vLLM codebase, and include the necessary updates to the model configuration and documentation. The implementation appears solid and I have no major concerns.
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
|
Thanks, can you attach benchmark results to show the performance improvement? |
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Hi @DarkLight1337 I tested with InternVL2-1B and dp size 2 but did not see a significant improvement. Should I try a larger dp size? Let me know if you have any insights :) Test Plan TP: DP: Benchmark: Test Result TP DP: |
|
@666even666 Could you reduce the |
I lowered 'max-concurrency' to 32 and here are the results TP(baseline) DP Total Token throughput slightly improved. I wonder is it because a tp size of 2 is too small to see any significant improvement? |
|
Here is my benchmark result and I did not see a significant improvement too. It's a little weird that the DP TP When decreasing the Maximum request concurrency: 8 DP TP |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com>
@ZJY0516 Thanks a lot for running the test!! This looks much better than what I had with dp size of 2. Looks like the performance improvement is aligned with other multimodal models (#23168). @DarkLight1337 If the changes look good to you, we can probably merge it? BTW @ZJY0516 what's the dp size you are using? |
|
|
|
I will try to profile this today. |
Benchmark SettingsResultencoder dp normal The profile result is a little weird. According profile data below, encoder dp mode have better performane in language model forward. Additionally, my benchmark results indicate that encoder DP mode leads to an end-to-end speedup (i.e., a decrease in benchmark duration), which contradicts the findings from @666even666 . Could you verify it? @666even666 vision model partdp tp language partdp tp |
|
Can you try increasing the number of prompts? Usually I use 500 or 1000. I think 128 might not be enough to get reliable results. |
|
The profile data will be quite large, but I'll try tomorrow |
|
Random-mm dataset: each request has 3 images. commanddptp |
However, shouldn't the execution time of each operator remain consistent regardless of the number of prompts? The benchmark results I obtained with 128 prompts are similar with those from 1024 prompts. |
|
Oh, I usually just set infinite QPS for the benchmark. |
|
@666even666 Could you run some eval benchmarks to make sure this change does not affect model quality? |
|
I am also implementing data parallelism for InternVL. My code is similar to yours, but I found that your code seems to be missing one content, which is this processing |
Interesting! thanks for letting us know. I put the same logic under forward() of InternVisionModel https://github.com/vllm-project/vllm/pull/23909/files#diff-2df945bfeea35cb93797dc3b4f8a393a27b06de6d60e2f9b0334920d40721e10R506, which should be executed every time self.vision_model() is called. @DarkLight1337 @ZJY0516 Do you think this could cause the performance difference? |
|
I tested InternVL3-8B on L20 with reference to (https://github.com/mistralai/mistral-evals), and this is my result: ---------tp2-------------- |
|
Can you run the performance benchmark with infinite QPS and show the output as well? |
Sorry for the delay! Here is the result of InternVL3-8B tested against chartqa |
hello,Did you get this result by running that program file? |
|
yeah the numbers are exactly the same. I will double check today. |
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Head branch was pushed to by a user without write access
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
|
Sorry for the delay, let me merge this |
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
…t#23909) Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu> Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…t#23909) Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu> Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>
…t#23909) Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu> Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>
…t#23909) Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu> Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io>






Purpose
This PR enables DP for InternVL vision encoder
FIX #23876
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.