Skip to content

refactor: refactor engine vlm params#13069

Closed
minleminzui wants to merge 312 commits intosgl-project:refactor-engine-vlm-paramsfrom
minleminzui:refactor-engine-vlm-params
Closed

refactor: refactor engine vlm params#13069
minleminzui wants to merge 312 commits intosgl-project:refactor-engine-vlm-paramsfrom
minleminzui:refactor-engine-vlm-params

Conversation

@minleminzui
Copy link
Copy Markdown
Collaborator

@minleminzui minleminzui commented Nov 11, 2025

Motivation

#10532

Mainly to ensure CI stability for unit-test-backend-1-gpu (0),
which runs pytest test/srt/test_vision_openai_server_a.py.

Fix Qwen3-Omni multimodal embedding handling and relax vision/audio tests

  • MRotaryEmbedding:

    • Make audio-related kwargs (audio_token_id, audio_start_token_id, position_id_per_seconds, audio_seqlens) optional.
    • Introduce has_audio guard and skip audio-specific branches when no audio is present.
    • Ensure all newly created position_id tensors (torch.arange) are allocated on the same device as input_ids.
  • mm_utils:

    • Change _adjust_embedding_length to take special_multimodal_mask instead of a generic mask.
    • Handle length mismatch robustly: pad with zeros when embeddings are shorter than the number of multimodal tokens, truncate when longer.
    • Replace hard RuntimeError with warnings and best-effort adjustment to avoid crashing on imperfect MM embeddings.
  • Qwen3VL:

    • Rework get_image_feature to explicitly reconstruct patches based on in_channels * temporal_patch_size * patch_size^2.
    • Enforce full patch and patch-group (spatial_merge_size^2) alignment, skipping invalid/too-short inputs.
    • Return empty tensor when there are no valid image patches, and ensure all tensors are on the visual module’s device/dtype.
  • Tests (OpenAI vision server):

    • Override verify_single_image_response in TestQwen3OmniServer to only check high-level structure:
      • presence of “1.” and “2.”,
      • mentions of image/picture/photo and audio/sound/speech,
      • valid usage stats.
    • Add a Qwen3-Omni-specific verify_speech_recognition_response that checks structural/audio mentions instead of exact transcript words.
    • Fix a bug in common verify_single_image_response where "person" was not actually checked with in text.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

merrymercy and others added 30 commits November 10, 2025 01:51
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
mickqian and others added 16 commits November 16, 2025 09:59
…ests

- MRotaryEmbedding:
  - Make audio-related kwargs (audio_token_id, audio_start_token_id, position_id_per_seconds, audio_seqlens) optional.
  - Introduce `has_audio` guard and skip audio-specific branches when no audio is present.
  - Ensure all newly created position_id tensors (`torch.arange`) are allocated on the same device as input_ids.

- mm_utils:
  - Change `_adjust_embedding_length` to take `special_multimodal_mask` instead of a generic mask.
  - Handle length mismatch robustly: pad with zeros when embeddings are shorter than the number of multimodal tokens, truncate when longer.
  - Replace hard RuntimeError with warnings and best-effort adjustment to avoid crashing on imperfect MM embeddings.

- Qwen3VL:
  - Rework `get_image_feature` to explicitly reconstruct patches based on `in_channels * temporal_patch_size * patch_size^2`.
  - Enforce full patch and patch-group (spatial_merge_size^2) alignment, skipping invalid/too-short inputs.
  - Return empty tensor when there are no valid image patches, and ensure all tensors are on the visual module’s device/dtype.

- Tests (OpenAI vision server):
  - Override `verify_single_image_response` in `TestQwen3OmniServer` to only check high-level structure:
    - presence of “1.” and “2.”,
    - mentions of image/picture/photo and audio/sound/speech,
    - valid usage stats.
  - Add a Qwen3-Omni-specific `verify_speech_recognition_response` that checks structural/audio mentions instead of exact transcript words.
  - Fix a bug in common `verify_single_image_response` where `"person"` was not actually checked with `in text`.
…ature

- Relaxed shape check in `get_image_feature`: allow `pixel_values` with dim > 2
  (e.g. `[B, T, D]` or `[B, H, W, C]`) instead of hard-asserting `dim() == 2`
- Flatten all leading dims into a single batch dim to match `[N, D]` expected by `self.visual`
- Keeps backward compatibility for existing `[N, D]` image embeddings
- Fixes AssertionError(3) raised when running Qwen3-Omni mixed-modality tests
- Verified passing `TestQwen3OmniServer::test_mixed_modality_chat_completion`
…rver tests

In TestOpenAIMLLMServerBase.setUpClass, raise unittest.SkipTest when model is missing/empty

Prevents pytest from collecting/execing mixin classes and throwing AttributeError: ... has no attribute 'model'

Keeps CI green by marking mixins as skipped instead of erroring

No impact on concrete test classes that define model
@minleminzui minleminzui force-pushed the refactor-engine-vlm-params branch from 2904ebf to 20ebb81 Compare November 19, 2025 05:09
@minleminzui minleminzui deleted the refactor-engine-vlm-params branch November 27, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang lora Multi-modal multi-modal language model performance quant LLM Quantization router router-benchmark run-ci sgl-kernel speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.