Conversation
|
I encountered the "assert positions.ndim == 1 or positions.ndim == 2" error as well and tried to locate the root cause. The following finding is for your reference. From debugging, we make sure that during cuda graph capture, general_mm_embed_routine() enters the following branch. I added an assertion “positions must be set for Qwen2/2.5-VL (MRoPE)” in def general_mm_embed_routine, which pinpoints the root cause: When entering language_model.forward(...), the positions required by Qwen2/2.5-VL’s rotary positional encoding (MRoPE) aren’t being passed in, so rotary_embedding.forward() receives None. Then, when TorchDynamo captures it and tries to evaluate positions.ndim == 1 or 2, it crashes (None.ndim). |
The issue you encountered was likely caused by the piecewise CUDA graph not preparing the mrope data. This PR should have resolved the problem. |
|
I believe this new error might be caused by the fact that the general_mm_embed_routine function in the multimodal model creates a new variable inputs_embeds on each call. sglang/python/sglang/srt/managers/mm_utils.py Line 681 in 38a704b Because embed_mm_inputs allocates a fresh inputs_embeds tensor every time, its device address cannot remain fixed, which likely prevents us from capturing the entire VLM with a single CUDA Graph. I have two potential ideas:
I'm not entirely sure which approach would be better, but it might make sense to start with option (2) and later build upon it to explore (1). I’m not deeply familiar with CUDA Graph internals, so I’d really appreciate your thoughts and feedback. @BBuf @yuan-luo Thank you! |
Fixing bugs when using piecewise CUDA graph serving for VL models, such as the "assert positions.ndim == 1 or positions.ndim == 2" error caused by mrope_positions, and the "torch._dynamo.exc.Unsupported: Skip calling
torch.compiler.disable()" error caused by the @torch._dynamo.disable() decorator in triton_mrope_wrapper. However, there is currently a new bug that remains unresolved, preventing the serving of VL models.