[VLM] Introduce Cache for positional embedding ids for Qwen-VL family#14292
[VLM] Introduce Cache for positional embedding ids for Qwen-VL family#14292yuan-luo merged 2 commits intosgl-project:mainfrom
Conversation
|
A quick question — is a 35s TTFT workload considered reasonable? |
This is just a benchmark test to input considerable multi-modal data such as 7 images in one request. |
What are the timing comparisons for send_one? |
E2E improved 7% for send one. |
|
/tag-and-rerun-ci |
| import torch | ||
|
|
||
|
|
||
| class RotaryPosMixin: |
There was a problem hiding this comment.
I’m not certain this is the right place for this class. Would there be a more appropriate location for it?
There was a problem hiding this comment.
This class is only used by model. There will be more models reusing this class. So I prefer to keep it in this folder, or do you have any suggestion?
e873ff9 to
3068c73
Compare
3068c73 to
8d7f5ed
Compare
8d7f5ed to
5af1887
Compare
|
From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened? |
This vision encoder change mainly focuses on Prefill, not related with TPOT. The worse might be turbulence. I retested PR, the TPOT is more stable. |
Can we restart the engine each time and run three separate tests? |
|
I set output=0, restarted server and tested. TTFT improved 2%. |
|
After tested for many times, I found the turbulence was caused by num prompt 100 too short. When setting num prompt to 500, the TPOT is stable. |
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
Introduce a cache for rot_pos_emb index computation to boost the calculation.
Introduce a mixin class for broader reuse for this mechanism. For cached rotary position embedding, improvement is significant.
Moreover, refine the index computation to use numpy obtains extra speedup which E2E improved 7%.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist