refactor: refactor engine vlm params by minleminzui · Pull Request #13069 · sgl-project/sglang

minleminzui · 2025-11-11T09:18:04Z

Motivation

Mainly to ensure CI stability for unit-test-backend-1-gpu (0),
which runs pytest test/srt/test_vision_openai_server_a.py.

Fix Qwen3-Omni multimodal embedding handling and relax vision/audio tests

MRotaryEmbedding:
- Make audio-related kwargs (audio_token_id, audio_start_token_id, position_id_per_seconds, audio_seqlens) optional.
- Introduce has_audio guard and skip audio-specific branches when no audio is present.
- Ensure all newly created position_id tensors (torch.arange) are allocated on the same device as input_ids.
mm_utils:
- Change _adjust_embedding_length to take special_multimodal_mask instead of a generic mask.
- Handle length mismatch robustly: pad with zeros when embeddings are shorter than the number of multimodal tokens, truncate when longer.
- Replace hard RuntimeError with warnings and best-effort adjustment to avoid crashing on imperfect MM embeddings.
Qwen3VL:
- Rework get_image_feature to explicitly reconstruct patches based on in_channels * temporal_patch_size * patch_size^2.
- Enforce full patch and patch-group (spatial_merge_size^2) alignment, skipping invalid/too-short inputs.
- Return empty tensor when there are no valid image patches, and ensure all tensors are on the visual module’s device/dtype.
Tests (OpenAI vision server):
- Override verify_single_image_response in TestQwen3OmniServer to only check high-level structure:
  - presence of “1.” and “2.”,
  - mentions of image/picture/photo and audio/sound/speech,
  - valid usage stats.
- Add a Qwen3-Omni-specific verify_speech_recognition_response that checks structural/audio mentions instead of exact transcript words.
- Fix a bug in common verify_single_image_response where "person" was not actually checked with in text.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com>

…ng (sgl-project#10702) Co-authored-by: a4zhangfei <a4zhangfei@qq.com>

…0218) (sgl-project#10225) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

…gl-project#13005)

)

…(spec, non-spec, spec v2) x (retract, finished)` (sgl-project#12224)

Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

…ect#13010)

…r 8-gpu-h200 runners (sgl-project#12952)

Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>

)

)

… item passed

…project#12755)

…ests - MRotaryEmbedding: - Make audio-related kwargs (audio_token_id, audio_start_token_id, position_id_per_seconds, audio_seqlens) optional. - Introduce `has_audio` guard and skip audio-specific branches when no audio is present. - Ensure all newly created position_id tensors (`torch.arange`) are allocated on the same device as input_ids. - mm_utils: - Change `_adjust_embedding_length` to take `special_multimodal_mask` instead of a generic mask. - Handle length mismatch robustly: pad with zeros when embeddings are shorter than the number of multimodal tokens, truncate when longer. - Replace hard RuntimeError with warnings and best-effort adjustment to avoid crashing on imperfect MM embeddings. - Qwen3VL: - Rework `get_image_feature` to explicitly reconstruct patches based on `in_channels * temporal_patch_size * patch_size^2`. - Enforce full patch and patch-group (spatial_merge_size^2) alignment, skipping invalid/too-short inputs. - Return empty tensor when there are no valid image patches, and ensure all tensors are on the visual module’s device/dtype. - Tests (OpenAI vision server): - Override `verify_single_image_response` in `TestQwen3OmniServer` to only check high-level structure: - presence of “1.” and “2.”, - mentions of image/picture/photo and audio/sound/speech, - valid usage stats. - Add a Qwen3-Omni-specific `verify_speech_recognition_response` that checks structural/audio mentions instead of exact transcript words. - Fix a bug in common `verify_single_image_response` where `"person"` was not actually checked with `in text`.

… item passed

…ature - Relaxed shape check in `get_image_feature`: allow `pixel_values` with dim > 2 (e.g. `[B, T, D]` or `[B, H, W, C]`) instead of hard-asserting `dim() == 2` - Flatten all leading dims into a single batch dim to match `[N, D]` expected by `self.visual` - Keeps backward compatibility for existing `[N, D]` image embeddings - Fixes AssertionError(3) raised when running Qwen3-Omni mixed-modality tests - Verified passing `TestQwen3OmniServer::test_mixed_modality_chat_completion`

…rver tests In TestOpenAIMLLMServerBase.setUpClass, raise unittest.SkipTest when model is missing/empty Prevents pytest from collecting/execing mixin classes and throwing AttributeError: ... has no attribute 'model' Keeps CI green by marking mixins as skipped instead of erroring No impact on concrete test classes that define model

…st_mixed_modality_chat_completion

…ulti_images_chat_completion

…ideo_images_chat_completion

merrymercy and others added 30 commits November 10, 2025 01:51

[Auto Sync] Update batch_invariant_ops.py (20251109) (sgl-project#12916)

9ea2c68

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com>

[router] bucket policy (sgl-project#11719)

611a4fd

fix missing output_token_logprobs when using ngram speculative decodi…

6f08488

…ng (sgl-project#10702) Co-authored-by: a4zhangfei <a4zhangfei@qq.com>

feat(metrics): add scheduler and hiradix cache metrics (sgl-project#1…

afee284

…0218) (sgl-project#10225) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

diffusion: reduce effort of supporting new model (sgl-project#12982)

5639145

vlm: fix tiny multimodal cache bug (sgl-project#12984)

1240ac1

chore: bump sgl-kernel version to 0.3.17 (sgl-project#12966)

37c40a8

[1 / 2] register weak_ref_tensor in sgl-kernel (sgl-project#12999)

547de8c

Support piecewise cuda graph for deepseek v3 (sgl-project#12996)

58b12cc

minor: fix notebook bug with new model_info fields added for warmup (s…

ddfcb7c

…gl-project#13005)

Super tiny fix typo (sgl-project#13001)

b0ee99d

Add process_prefill_chunk back to fix PP event loop (sgl-project#13009

f1f4c45

)

[misc][ci] Add run-ci after auto-labeler (sgl-project#13013)

838bcb0

Unify memory management across `(overlap, non-overlap) x (page>=1) x …

665416f

…(spec, non-spec, spec v2) x (retract, finished)` (sgl-project#12224)

Enhance retract test (page cases, long output cases) (sgl-project#12781)

1086473

[AMD CI] Remove SRT docker build. (sgl-project#11850)

b51d46d

Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

[CI] Limit the CI trigger frequency of low-privilege actors (sgl-proj…

56c83e0

…ect#13010)

Resolve HF download issue and download models before CI run starts fo…

c022107

…r 8-gpu-h200 runners (sgl-project#12952)

Add pre-suffle weight for new aiter MoE support. (sgl-project#12908)

661c1c9

Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>

chore: bump SGLang version to 0.5.5.post1 (sgl-project#13000)

303cc95

[router][ci] Fix maturin build (sgl-project#13012)

9840bf4

Simplify the BatchMultimodalOutput in io_struct.py (sgl-project#12993)

40b26b4

[router][ci] Quick Improvement to make CI more stable (sgl-project#12869

0493775

)

[9/n] decouple quantization impl from vllm dependency - adjust ci (sg…

012bfc4

…l-project#12753)

[router] add postgres databases data connector (sgl-project#12218)

2fe4e69

[AMD CI] Update docker release workflows docker file name. (sgl-proje…

aea88fa

…ct#13028)

fix tuning_fused_moe_triton_sep tool per_channel_quant bug (sgl-proje…

f18ec92

…ct#13027)

fix(ci): workflow id in permission rate limit (sgl-project#13035)

08c805a

[PieceWise CUDA Graph] Support awq/gptq model in piecewise cudagraph (s…

9caca6a

…gl-project#12518)

Re-enable Flashinfer TRTLLM GEN MHA and Add Unit Test (sgl-project#12885

3594815

)

mickqian and others added 16 commits November 16, 2025 09:59

MultimodalInputFormat, processor_output and precomputed_embedding one…

27af58f

… item passed

tmp

629ae97

use correct image_grid_thw

08f4b7b

update doc

3e5596d

fix docs of offline engine VLM

4e7a27b

upd

d033f57

docs: document processor_output usage in VLM offline engine notebook

b08d998

test: relax VLM input format assertions for image understanding (sgl-…

59619db

…project#12755)

Update mm_utils.py

0c81010

MultimodalInputFormat, processor_output and precomputed_embedding one…

11e8139

… item passed

use correct image_grid_thw

c3e0405

update doc

10bb7f4

more

0e09b6a

minleminzui force-pushed the refactor-engine-vlm-params branch from a0563ed to 0e09b6a Compare November 16, 2025 11:21

github-actions Bot added performance quant LLM Quantization lora speculative-decoding hicache Hierarchical Caching for SGLang router-benchmark labels Nov 16, 2025

minleminzui added 2 commits November 16, 2025 15:22

pass test/srt/test_vision_openai_server_a.py::TestQwen3OmniServer::te…

9b95317

…st_mixed_modality_chat_completion

pass test/srt/test_vision_openai_server_a.py::TestLlavaServer::test_m…

20ebb81

…ulti_images_chat_completion

minleminzui force-pushed the refactor-engine-vlm-params branch from 2904ebf to 20ebb81 Compare November 19, 2025 05:09

minleminzui added 2 commits November 19, 2025 06:01

pass test/srt/test_vision_openai_server_a.py::TestLlavaServer::test_v…

a0fc2a7

…ideo_images_chat_completion

debug

bdacba7

minleminzui closed this Nov 27, 2025

minleminzui deleted the refactor-engine-vlm-params branch November 27, 2025 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: refactor engine vlm params#13069

refactor: refactor engine vlm params#13069
minleminzui wants to merge 312 commits intosgl-project:refactor-engine-vlm-paramsfrom
minleminzui:refactor-engine-vlm-params

minleminzui commented Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

minleminzui commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

minleminzui commented Nov 11, 2025 •

edited

Loading