[VLM] Refactor load_mm_data to improve performance by yuan-luo · Pull Request #14644 · sgl-project/sglang

yuan-luo · 2025-12-08T12:48:26Z

Motivation

Inspired by @mickqian .

In the load_mm_data method, there’s currently a redundant approach where we manually detect various tokens and then load them. This could be changed to simply loading all the passed data directly, but we previously made it this way because minicpmo model has a mechanism for adjusting video frames. We decide to simplify it in order to improve majority VLM models' performance.

For example, in both load_mm_data() and submit_data_loading_tasks() there are redundant for text_part in text_parts loop.

In the new implementation, load_mm_data works as “1 token → 1 data”: it doesn’t expand frames or rewrite the prompt, it just loads all the incoming data and aligns them with the tokens in order.

We keep legacy load_mm_data dedicated for the MiniCPM model: support MiniCPM’s “1 token → multiple frames” behavior.

Log printing, result is as expected.

Server:

SGLANG_VIT_CUDA_GRAPH=1 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
python -m sglang.launch_server \
  --model-path /home/admin/Qwen2.5-VL-7B-Instruct \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-max-tokens 8192 \
  --mm-attention-backend fa3 \
  --port 30000 \
  --chunked-prefill-size 8192 \
  --disable-radix-cache \
  --disable-overlap-schedule \
  --piecewise-cuda-graph-compiler eager \
  --attention-backend fa3 \
  --tp 2 \
  --log-level debug \
  --log-level-http debug \
  --log-requests

Client:

$cat bench_remote_video.sh 
for i in {1..1}; do
    time curl 'http://127.0.0.1:30000/v1/chat/completions' --header 'Content-Type: application/json' --data '{
        "model": "auto",
        "messages": [
            {
                "role": "user",
                "content": [
                                  {"type": "video_url", "video_url": {"url": "http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/....../video_test.mp4"}},
                  {"type": "text", "text": "视频里的招牌写的什么"}
                ]
            }
        ],
                                                  
        "temperature":0.0,
        "max_tokens":1000,
        "stream": false,
        "chat_template_kwargs": {"enable_thinking": false}
    }'
done

[root  /root/luoyuan.luo/workspace/bench_script] 一 12月 08 20:37:24 
$bash bench_remote_video.sh 
{"id":"cdb06d6d5c1b428b89473566f40ad2f3","object":"chat.completion","created":1765197964,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”，并附有电话号码“1529521190”。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":7682,"total_tokens":7712,"completion_tokens":30,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default","e2e_latency":1310.2936744689941,"ttft_latency":1310.302734375,"queue_latency":1.2103579938411713}}
real    0m1.318s
user    0m0.001s
sys     0m0.003s

2025-12-08 20:46:02.862 INFO 104744 [ tokenizer_manager.py:480] Receive: obj="GenerateReqInput(rid='cdb06d6d5c1b428b89473566f40ad2f3', http_worker_ipc=None, metrics={'api_server_arrive_time': 1765197962.8611393}, text='<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n<|vision_start|><|video_pad|><|vision_end|>视频里的招牌写的什么<|im_end|>\\n<|im_start|>assistant\\n', video_data=['http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/aistudio/temp/20250910/208156dafb7b44a8/video_test.mp4'], sampling_params={'temperature': 0.0, 'max_new_tokens': 1000, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.05, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': True, 'logit_bias': None, 'custom_params': None}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_path=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, validation_time=8.022785186767578e-05, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False, mm_sampling_kwargs=None, external_trace_headers=None)"
2025-12-08 20:46:02.863 DEBUG 104744 [ tokenizer_manager.py:710] Using regular tokenizer for 1 inputs
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:791] [_submit_mm_data_loading_tasks_simple] no data for modality=IMAGE
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:798] [_submit_mm_data_loading_tasks_simple] submit load task: modality=VIDEO, index=0, data_type=<class 'str'>
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:791] [_submit_mm_data_loading_tasks_simple] no data for modality=AUDIO
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:411] [_load_single_item] start loading data, modality=VIDEO, frame_count_limit=None, audio_sample_rate=None, raw_type=<class 'str'>
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:938] [load_mm_data(simple)] total futures submitted: 1
2025-12-08 20:46:03.162 DEBUG 104744 [ base_processor.py:435] [_load_single_item][VIDEO] loaded video: len=389, shape[0]=(720, 1280, 3)
2025-12-08 20:46:03.162 DEBUG 104744 [ base_processor.py:966] [load_mm_data(simple)] loaded counts: images=0, videos=1, audios=0
2025-12-08 20:46:03.508 INFO 104744 [ qwen_vl.py:304] [preprocess_video Perf], get_batch_time: 260.65 ms, smart_resize_time: 0.05 ms, torchvision_resize_time: 84.61 ms, total_time: 345.30 ms
2025-12-08 20:46:03.530 INFO 104744 [ qwen_vl.py:497] [QwenVLProcessor Perf] rid='cdb06d6d5c1b428b89473566f40ad2f3', load_time: 298.95 ms, preprocess_time: 352.29 ms, process_time: 14.30 ms, get_rope_index_time: 0.69 ms, total_time: 666.24 ms
2025-12-08 20:46:03.540 DEBUG 104744 [ cuda_ipc_transport_utils.py:75] [try_to_recycle] area=(0, 144055296), flag=1.0, tp_size=2
2025-12-08 20:46:03.541 INFO 104895 [ TP0 scheduler_metrics_mixin.py:154] Prefill batch, #new-seq: 1, #new-token: 7682, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
2025-12-08 20:46:03.590 DEBUG 104744 [ cuda_ipc_transport_utils.py:75] [try_to_recycle] area=(0, 144055296), flag=2.0, tp_size=2
2025-12-08 20:46:04.111 INFO 104895 [ TP0 scheduler_metrics_mixin.py:309] Decode batch, #running-req: 1, #token: 7699, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.08, #queue-req: 0, 
2025-12-08 20:46:04.170 INFO 104895 [ TP0 schedule_batch.py:1043] Req Time Stats(rid=cdb06d6d5c1b428b89473566f40ad2f3, input len=7682, output len=30, type=unified): queue_duration=1.21ms, forward_duration=627.68ms, start_time=9661822.019
2025-12-08 20:46:04.172 INFO 104744 [ tokenizer_manager.py:1150] "Finish: obj=GenerateReqInput(rid='cdb06d6d5c1b428b89473566f40ad2f3', http_worker_ipc=None, metrics={'api_server_arrive_time': 1765197962.8611393, 'mm_entry_time_ts': 1765197962.8640013, 'mm_entry_time': 9661821.34138114, 'mm_load_time': 9661821.640335323, 'mm_preprocess_time': 9661821.992624268, 'mm_process_time': 9661822.006922927, 'mm_get_rope_index_time': 9661822.007616928}, text='<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n<|vision_start|><|video_pad|><|vision_end|>视频里的招牌写的什么<|im_end|>\\n<|im_start|>assistant\\n', video_data=['http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/aistudio/temp/20250910/208156dafb7b44a8/video_test.mp4'], sampling_params={'temperature': 0.0, 'max_new_tokens': 1000, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.05, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': True, 'logit_bias': None, 'custom_params': None}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_path=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, validation_time=8.022785186767578e-05, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False, mm_sampling_kwargs=None, external_trace_headers=None), out={'text': '视频里的招牌上写着“小鞋匠洗鞋”，并附有电话号码“1529521190”。', 'meta_info': {'id': 'cdb06d6d5c1b428b89473566f40ad2f3', 'finish_reason': {'type': 'stop', 'matched': 151645}, 'prompt_tokens': 7682, 'weight_version': 'default', 'total_retractions': 0, 'queue_time': 0.0012103579938411713, 'prefill_launch_delay': 0.001300731673836708, 'prefill_launch_latency': 0.17800299264490604, 'completion_tokens': 30, 'cached_tokens': 0, 'e2e_latency': 1.3102936744689941, 'request_received_ts': 1765197962.8611393, 'request_sent_to_scheduler_ts': 1765197963.531742, 'decode_finished_ts': 1765197964.171433, 'inference_time': 0.6288455203175545, 'ttft_latency': 1.310302734375, 'response_sent_to_client_ts': 1765197964.172052}}"

Modifications

Accuracy Tests

lmms_eval result no drop.

$python3 -m lmms_eval --model openai_compatible --model_args model_version=Qwen/Qwen2.5-VL-7B-Instruct   --tasks mmmu_val   --batch_size 16
/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
2025-12-10 15:39:50 | INFO     | __main__:cli_evaluate:311 - Verbosity set to INFO
2025-12-10 15:39:52 | INFO     | __main__:cli_evaluate_single:400 - Evaluation tracker args: {'token': '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'}
2025-12-10 15:39:52 | INFO     | __main__:cli_evaluate_single:480 - Selected Tasks: ['mmmu_val']
2025-12-10 15:39:52 | INFO     | lmms_eval.evaluator:simple_evaluate:161 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-12-10 15:39:56 | INFO     | lmms_eval.evaluator:evaluate:402 - Running on rank 0 (local rank 0)
2025-12-10 15:39:56 | INFO     | lmms_eval.api.task:build_all_requests:427 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14037.16it/s]
2025-12-10 15:39:56 | INFO     | lmms_eval.evaluator:evaluate:495 - Running generate_until requests
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [02:56<00:00,  2.31s/it]2025-12-10 15:42:52 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 1251.369s, Total tokens: 2040, Avg speed: 1.6 tokens/s
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [02:56<00:00,  3.09s/it]
Postprocessing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 11417.73it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.75}, 'Art': {'num': 30, 'acc': 0.8}, 'Art_Theory': {'num': 30, 'acc': 0.93333}, 'Design': {'num': 30, 'acc': 0.83333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.62667}, 'Accounting': {'num': 30, 'acc': 0.63333}, 'Economics': {'num': 30, 'acc': 0.7}, 'Finance': {'num': 30, 'acc': 0.46667}, 'Manage': {'num': 30, 'acc': 0.63333}, 'Marketing': {'num': 30, 'acc': 0.7}, 'Overall-Science': {'num': 150, 'acc': 0.57333}, 'Biology': {'num': 30, 'acc': 0.5}, 'Chemistry': {'num': 30, 'acc': 0.46667}, 'Geography': {'num': 30, 'acc': 0.73333}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.66667}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.68}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.7}, 'Clinical_Medicine': {'num': 30, 'acc': 0.73333}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.46667}, 'Pharmacy': {'num': 30, 'acc': 0.76667}, 'Public_Health': {'num': 30, 'acc': 0.73333}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.8}, 'History': {'num': 30, 'acc': 0.76667}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.76667}, 'Psychology': {'num': 30, 'acc': 0.76667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.4619}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.43333}, 'Computer_Science': {'num': 30, 'acc': 0.6}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.46667}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.4}, 'Overall': {'num': 900, 'acc': 0.62778}}
fatal: not a git repository (or any of the parent directories): .git
2025-12-10 15:42:52 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen2.5-VL-7B-Instruct), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6278|±  |   N/A|

Benchmarking and Profiling

As the function portion is small, there's no significant improvement for the e2e performance. Whereas the function cost time reduces.

SGLANG_MM_FEATURE_CACHE_MB=4096 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=0 \
python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-30B-A3B-Instruct \
--host 0.0.0.0 --port 30000 --trust-remote-code --tp-size 2 --enable-cache-report \
--log-level info --max-running-requests 48 --mem-fraction-static 0.7 --chunked-prefill-size 8192  \
--attention-backend flashinfer --mm-attention-backend fa3 
                                                                                               
Benchmark:
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --dataset-name image \
  --num-prompts 256 \
  --apply-chat-template \
  --random-input-len 128 \
  --random-output-len 1 \
  --image-resolution 560x560 \
  --image-format jpeg \
  --image-count 1 \
  --image-content random \
  --random-range-ratio 0.1 \
  --port 30000 \
  --max-concurrency 32
                                                                                               
Baseline:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  10.06     
Total input tokens:                      104023    
Total input text tokens:                 20567     
Total input vision tokens:               83456     
Total generated tokens:                  115       
Total generated tokens (retokenized):    115       
Request throughput (req/s):              25.46     
Input token throughput (tok/s):          10344.45  
Output token throughput (tok/s):         11.44     
Peak output token throughput (tok/s):    39.00     
Peak concurrent requests:                72        
Total token throughput (tok/s):          10355.89  
Concurrency:                             30.69     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1205.71   
Median E2E Latency (ms):                 1204.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          531.57    
Median TTFT (ms):                        0.00      
P99 TTFT (ms):                           1818.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================                                                                                        

                                                                                         
PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  10.10     
Total input tokens:                      104043    
Total input text tokens:                 20587     
Total input vision tokens:               83456     
Total generated tokens:                  115       
Total generated tokens (retokenized):    115       
Request throughput (req/s):              25.34     
Input token throughput (tok/s):          10298.52  
Output token throughput (tok/s):         11.38     
Peak output token throughput (tok/s):    37.00     
Peak concurrent requests:                65        
Total token throughput (tok/s):          10309.90  
Concurrency:                             30.62     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1208.55   
Median E2E Latency (ms):                 1238.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          535.38    
Median TTFT (ms):                        0.00      
P99 TTFT (ms):                           1729.77   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-08T12:48:39Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the multimodal data loading mechanism to enhance performance and simplify the default processing path. By introducing a streamlined load_mm_data that assumes a direct token-to-data mapping, it optimizes common use cases. The original, more complex logic is preserved in a legacy function, with plans for a specialized token-aware loader for specific model requirements in the future.

Highlights

Refactored load_mm_data for performance: The primary load_mm_data function has been refactored to a simpler, more direct approach, assuming a 1:1 alignment between tokens and multimodal data. This change aims to improve performance by removing complex token detection and frame expansion logic from the default path.
Introduction of _submit_mm_data_loading_tasks_simple: A new private helper method, _submit_mm_data_loading_tasks_simple, has been added to streamline the submission of I/O loading tasks for individual multimodal data items (images, videos, audio) without complex prompt scanning.
Renamed original load_mm_data to legacy_load_mm_data: The previous, more complex implementation of load_mm_data has been renamed to legacy_load_mm_data and marked for future obsolescence. This preserves its functionality for models like MiniCPM that require specific token-aware processing, while the new load_mm_data serves as the default.
Future load_mm_data_token_aware for MiniCPM: The PR description indicates a plan to introduce a dedicated load_mm_data_token_aware function in the future. This will specifically handle models like MiniCPM that require '1 token -> multiple frames' behavior, keeping the current logic separate from the simplified default.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the load_mm_data method to improve performance by introducing a simpler, faster implementation that avoids prompt scanning. The old logic is preserved in legacy_load_mm_data. A new helper _submit_mm_data_loading_tasks_simple is added to support this.

My review focuses on the new load_mm_data implementation. I've suggested improvements to its signature to remove an unused parameter and correct a type hint. I've also provided a more formal docstring and a refactoring to reduce code duplication. These changes will improve the code's clarity and maintainability.

yuan-luo · 2025-12-08T12:53:50Z

/tag-and-rerun-ci

yuan-luo · 2025-12-10T03:28:29Z

TestMiniCPMo26Server test case failed unsurprisingly. Fixing.

https://github.com/sgl-project/sglang/actions/runs/20085678223/job/57622671663?pr=14644

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Video images response:
The video clip is a close-up shot of a man, widely recognized as Steve Jobs, presenting a product on stage. The camera focuses on his face from the nose down and his right hand, which is holding a device.
----------------------------------------------------------------------

Traceback (most recent call last):
The man is wearing a black, collared shirt and thin-framed glasses. He is holding a white, rectangular electronic device, which is an early model of the Apple iPod. The device features a small, square screen at the top and a large, circular click wheel below it. He holds the iPod vertically, presenting it to the audience.
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/test/srt/test_vision_openai_server_common.py", line 41, in setUpClass
    cls.process = popen_launch_server(
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/python/sglang/test/test_utils.py", line 653, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

yuan-luo · 2025-12-17T02:16:20Z

TestMiniCPMo26Server test case failed unsurprisingly. Fixing.

https://github.com/sgl-project/sglang/actions/runs/20085678223/job/57622671663?pr=14644

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Video images response:
The video clip is a close-up shot of a man, widely recognized as Steve Jobs, presenting a product on stage. The camera focuses on his face from the nose down and his right hand, which is holding a device.
----------------------------------------------------------------------

Traceback (most recent call last):
The man is wearing a black, collared shirt and thin-framed glasses. He is holding a white, rectangular electronic device, which is an early model of the Apple iPod. The device features a small, square screen at the top and a large, circular click wheel below it. He holds the iPod vertically, presenting it to the audience.
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/test/srt/test_vision_openai_server_common.py", line 41, in setUpClass
    cls.process = popen_launch_server(
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/python/sglang/test/test_utils.py", line 653, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

Fixed with adding fallback branch.

yuan-luo · 2025-12-19T03:15:53Z

/rerun-failed-ci

JustinTong0323 · 2025-12-24T12:10:26Z

/rerun-failed-ci

JustinTong0323 · 2025-12-24T14:27:33Z

/tag-and-rerun-ci

JustinTong0323 · 2025-12-24T16:19:37Z

/tag-and-rerun-ci

yuan-luo · 2025-12-25T02:20:21Z

/rerun-failed-ci

JustinTong0323 · 2025-12-25T14:10:17Z

/tag-and-rerun-ci

yuan-luo · 2025-12-26T03:01:10Z

/tag-and-rerun-ci

yuan-luo · 2025-12-26T04:43:38Z

/rerun-failed-ci

mickqian · 2025-12-26T05:17:36Z

we should consider splitting base_processor.py, as it's already too large a file

yuan-luo · 2025-12-26T05:26:58Z

There's a case not passed, not sure it's related with this PR or not. I'm manually rerun and follow up.
https://github.com/sgl-project/sglang/actions/runs/20497461686/job/58944251650?pr=14644#logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 119, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 105, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 426, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 312, in run_endpoint_function
    return await dependant.call(**values)
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 447, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 651, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
IndexError: index 1 is out of bounds for dimension 0 with size 1

yuan-luo · 2025-12-26T05:29:43Z

This is an error in transformers. Might be related with not rebase main when run CI. It should be fine. Will keep a close eye on CI.

yuan-luo · 2025-12-26T06:06:26Z

The issue does exist in main CI. Investigating.

command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-26 06:03:31] WARNING server_args.py:1543: Attention backend not specified. Use trtllm_mha backend by default.
[2025-12-26 06:03:31] WARNING server_args.py:1613: TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-12-26 06:03:31] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7683540624999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=702667536, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=16384, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192, 8448, 8704, 8960, 9216, 9472, 9728, 9984, 10240, 10496, 10752, 11008, 11264, 11520, 11776, 12032, 12288, 12544, 12800, 13056, 13312, 13568, 13824, 14080, 14336, 14592, 14848, 15104, 15360, 15616, 15872, 16128, 16384], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-26 06:03:32] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 06:03:36] No chat template found, defaulting to 'string' content format
[2025-12-26 06:03:39] Init torch distributed begin.
[rank0]:[W1226 06:03:40.230730208 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-26 06:03:40] Init torch distributed ends. mem usage=0.00 GB
[2025-12-26 06:03:40] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-26 06:03:40] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 06:03:40] Load weight begin. avail mem=177.74 GB
[2025-12-26 06:03:41] Multimodal attention backend not set. Use triton_attn.
[2025-12-26 06:03:41] Using triton_attn as multimodal attention backend.
[2025-12-26 06:03:41] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:16<00:16, 16.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:34<00:00, 17.67s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:34<00:00, 17.49s/it]

[2025-12-26 06:04:16] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=170.40 GB, mem usage=7.34 GB.
[2025-12-26 06:04:16] Using KV cache dtype: torch.bfloat16
[2025-12-26 06:04:16] The available memory for KV cache is 129.22 GB.
[2025-12-26 06:04:16] KV Cache is allocated. #tokens: 3763904, K size: 64.61 GB, V size: 64.61 GB
[2025-12-26 06:04:16] Memory pool end. avail mem=39.14 GB
[2025-12-26 06:04:16] Capture cuda graph begin. This can take up to several minutes. avail mem=38.09 GB
[2025-12-26 06:04:16] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=37.53 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:03<00:00, 13.05it/s]
[2025-12-26 06:04:21] Capture cuda graph end. Time elapsed: 4.56 s. mem usage=0.56 GB. avail mem=37.53 GB.
[2025-12-26 06:04:21] max_total_num_tokens=3763904, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=37.53 GB
[2025-12-26 06:04:21] INFO:     Started server process [138941]
[2025-12-26 06:04:21] INFO:     Waiting for application startup.
[2025-12-26 06:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-26 06:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-26 06:04:21] INFO:     Application startup complete.
[2025-12-26 06:04:21] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-26 06:04:22] INFO:     127.0.0.1:57760 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-26 06:04:22] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-26 06:04:23] INFO:     127.0.0.1:57766 - "POST /generate HTTP/1.1" 200 OK
[2025-12-26 06:04:23] The server is fired up and ready to roll!
[2025-12-26 06:04:24] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-26 06:04:25] INFO:     127.0.0.1:57782 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_eos_behavior
[2025-12-26 06:04:26] INFO:     127.0.0.1:57788 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 167, in test_eos_behavior
    self.run_decode(max_new_tokens=256)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_logprob
[2025-12-26 06:04:26] INFO:     127.0.0.1:57804 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 164, in test_logprob
    self.run_decode(return_logprob=True, top_logprobs_num=top_logprobs_num)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_parallel_sample
[2025-12-26 06:04:26] INFO:     127.0.0.1:57806 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 485, in generate_request
    async for response in self._handle_batch_request(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 1205, in _handle_batch_request
    tokenized_objs = await asyncio.gather(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 160, in test_parallel_sample
    self.run_decode(n=3)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-26 06:04:26] INFO:     127.0.0.1:57808 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: test_eos_behavior (__main__.TestSkipTokenizerInitVLM.test_eos_behavior)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 167, in test_eos_behavior
    self.run_decode(max_new_tokens=256)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_logprob (__main__.TestSkipTokenizerInitVLM.test_logprob)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 164, in test_logprob
    self.run_decode(return_logprob=True, top_logprobs_num=top_logprobs_num)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_parallel_sample (__main__.TestSkipTokenizerInitVLM.test_parallel_sample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 160, in test_parallel_sample
    self.run_decode(n=3)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_simple_decode (__main__.TestSkipTokenizerInitVLM.test_simple_decode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 10 tests in 102.655s

FAILED (errors=4)

yuan-luo · 2025-12-26T06:23:19Z

I tested main, Qwen3-VL function works correctly, but Qwen2.5-VL function was broken.

server:

➜  sglang git:(main) python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-VL-7B-Instruct --served-model-name test --trust-remote-code --disable-radix-cache --tp 4 --mem-fraction-static 0.85  --mm-attention-backend triton_attn --attention-backend flashinfer

client:

➜  bench_script bash bench_images.sh
{"id":"a18bf9a84d354d169ab37ea62005fd00","object":"chat.completion","created":1766730374,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word书写 ofVEN查Pe受 word案件 syll syll syll syll syll syll syll syll prescribed syll Unicode syll职责 syllenia syll undefined syllGoodene受 syll why syll职责 syll why试点 syll above syll advised syllSSION syll why syll condition syll condition syllSSION condition condition syll exhibiting阮案件案件enia /**\n.persistence succeedingRecommend succeeding syllSSION三点案件 undefined understand understand岫收回彼 leave_ui borough borough syll why后果案件Good undefinedRecommendRecommendReadRead scoff后果收回 leaveReadRead后果后果收回后果收回 blessing argument argument argumentRead argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentULDIllegalAccessException why condition condition condition condition督查收回 argument argument argument argument督查收回ReadRead argument argument argument argument argument argument argument argument argument argumentreasonGood good good goodGoodGood goodnessGoodGoodGoodGood_uiReadReadGoodGoodGood收回收回对你 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument Hera Ala utiliz徨收回 argument argument Heraserde Good Good Good Good Good案件 Alande Good Good案件 AlaappleGood完美 Ala完美 Ala Ala Ala受受ываем发酵发酵ываем收回发酵重现重现重现重现重现重现重现重现重现重现重现重现重现 argument徨收回发酵发酵重现 argument徨收回 Hera?$案件GoodGood完美 Ala Ala Ala Ala Ala Alande案件.theme Ala higher受 Again understand understand argument understand understand argument good goodGood goodness徨收回Good Good understand argumentGoodGoodGoodGoodGood_ui收回 argument徨 argument徨GoodGood_ui argument徨徨收回 argument徨收回徨 argument徨 argument徨收回 argument ball案件 argument徨 argument徨收回 argument encontr朋友圈发酵发酵重现 argument徨 gallon徨 gallon.performanceGoodGood_ui_ui收回收回收回收回 argument徨 gallon Regression argumentTk受 argument argument argument argument argument argument argument argument argument argument徨Tk argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentuje argument_ui argument(requiredGoodGoodGood收回 argument argumentrez readGoodGood_ui argument argumentGood_ui argument_uirezGood_uirez goodness GoodGood_ui argument gallon过程中 argument argument argument argument argument徨GoodGood收回 argument argument argument argument argument argument argument argument argument argument徨 argument argument argument argument argument徨 argument gallon argument argument argument argument argumentTk argument argument argument argument argument argument argument argument argument argument argument argument argument argument Fault argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument徨 argument gallon argument argument argument argument argument argument argument活性 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument过了Good argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send understand argument argument argument argument argument argument的方法 argument argument argument的方法 argument argument argument argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument的方法 argument argument argument argument argument argument argument argument的方法 argument的方法活性Good argument argument argument的方法 argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法的方法案件 argument argument argument argument的方法 argument argument argument的方法 argument的方法的方法的方法案件 argument的方法ocy案件 argument argument argument的方法 argument的方法 argument的方法iciency argument的方法(required徨 argument argument argument的方法 argument argument argument的方法 argument的方法的方法.persistence argument的方法 repetition argument的方法的方法收回 argument的方法的方法的方法 McLaren Panasonic measured measured measuredPlease measured measured measured案件案件 argument的方法 argument argument的方法的方法的方法的方法 Ala Ala受加工 thatGoodGood argument argument argument argument argument的方法的方法 Ala受ываем qualifyinguhn argument argument argument的方法的方法ываем案件Good goodness understand understand understand的方法 Ala goodness understand understand的方法要好好 argument的方法 argument的方法的方法 Ala事情Good goodness argument argument的方法的方法.theme加工Good attainedGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGood完美ываемGoodGoodGood goodnessGoodGood_ui argument argument argument argument argument argument的方法 argument的方法的方法 argument argument的方法 argument的方法 argument argument的方法的方法的方法的方法的方法 AlaGoodGoodGoodGoodGoodGoodGood idolGoodGoodGoodGoodGood_ui argument argument的方法的方法的方法的方法ываемываем ball ballGoodGoodGoodGoodGood_ui argument的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法 argument的方法的方法的方法的方法的方法的方法 db argument的方法的方法的方法案件 argument的方法的方法/list understand的方法要好好发酵发酵重现 argument argument argument的方法.table BarbaraываемGoodGoodGoodGood_ui argument的方法 Interpret受GoodGoodGood argument argument argument argument的方法 argument prácticaductory argument的方法 argument的方法.table职责Good goodnessGood_ui argument的方法 argument的方法 argument的方法的方法.table understand argument的方法 argument的方法.table understand understand understand的方法的方法.table understand","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":977,"total_tokens":1977,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m3.913s
user	0m0.006s
sys	0m0.004s
{"id":"dcaa0c944e604e6391cc365fd1e6ca7e","object":"chat.completion","created":1766730378,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word书写 ofVEN查Pe受 word案件 syll syll syll syll syll syll syll syll prescribed syll Unicode syll职责 syllenia syll undefined syllGoodene受 syll why syll职责 syll why试点 syll above syll advised syllSSION syll why syll condition syll condition syllSSION condition condition syll exhibiting阮案件案件enia /**\n.persistence succeedingRecommend succeeding syllSSION三点案件 undefined understand understand岫收回彼 leave_ui borough borough syll why后果案件Good undefinedRecommendRecommendReadRead scoff后果收回 leaveReadRead后果后果收回后果收回 blessing argument argument argumentRead argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentULDIllegalAccessException why condition condition condition condition督查收回 argument argument argument argument督查收回ReadRead argument argument argument argument argument argument argument argument argument argumentreasonGood good good goodGoodGood goodnessGoodGoodGoodGood_uiReadReadGoodGoodGood收回收回对你 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument Hera Ala utiliz徨收回 argument argument Heraserde Good Good Good Good Good案件 Alande Good Good案件 AlaappleGood完美 Ala完美 Ala Ala Ala受受ываем发酵发酵ываем收回发酵重现重现重现重现重现重现重现重现重现重现重现重现重现 argument徨收回发酵发酵重现 argument徨收回 Hera?$案件GoodGood完美 Ala Ala Ala Ala Ala Alande案件.theme Ala higher受 Again understand understand argument understand understand argument good goodGood goodness徨收回Good Good understand argumentGoodGoodGoodGoodGood_ui收回 argument徨 argument徨GoodGood_ui argument徨徨收回 argument徨收回徨 argument徨 argument徨收回 argument ball案件 argument徨 argument徨收回 argument encontr朋友圈发酵发酵重现 argument徨 gallon徨 gallon.performanceGoodGood_ui_ui收回收回收回收回 argument徨 gallon Regression argumentTk受 argument argument argument argument argument argument argument argument argument argument徨Tk argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentuje argument_ui argument(requiredGoodGoodGood收回 argument argumentrez readGoodGood_ui argument argumentGood_ui argument_uirezGood_uirez goodness GoodGood_ui argument gallon过程中 argument argument argument argument argument徨GoodGood收回 argument argument argument argument argument argument argument argument argument argument徨 argument argument argument argument argument徨 argument gallon argument argument argument argument argumentTk argument argument argument argument argument argument argument argument argument argument argument argument argument argument Fault argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument徨 argument gallon argument argument argument argument argument argument argument活性 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument过了Good argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send understand argument argument argument argument argument argument的方法 argument argument argument的方法 argument argument argument argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument的方法 argument argument argument argument argument argument argument argument的方法 argument的方法活性Good argument argument argument的方法 argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法的方法案件 argument argument argument argument的方法 argument argument argument的方法 argument的方法的方法的方法案件 argument的方法ocy案件 argument argument argument的方法 argument的方法 argument的方法iciency argument的方法(required徨 argument argument argument的方法 argument argument argument的方法 argument的方法的方法.persistence argument的方法 repetition argument的方法的方法收回 argument的方法的方法的方法 McLaren Panasonic measured measured measuredPlease measured measured measured案件案件 argument的方法 argument argument的方法的方法的方法的方法 Ala Ala受加工 thatGoodGood argument argument argument argument argument的方法的方法 Ala受ываем qualifyinguhn argument argument argument的方法的方法ываем案件Good goodness understand understand understand的方法 Ala goodness understand understand的方法要好好 argument的方法 argument的方法的方法 Ala事情Good goodness argument argument的方法的方法.theme加工Good attainedGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGood完美ываемGoodGoodGood goodnessGoodGood_ui argument argument argument argument argument argument的方法 argument的方法的方法 argument argument的方法 argument的方法 argument argument的方法的方法的方法的方法的方法 AlaGoodGoodGoodGoodGoodGoodGood idolGoodGoodGoodGoodGood_ui argument argument的方法的方法的方法的方法ываемываем ball ballGoodGoodGoodGoodGood_ui argument的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法 argument的方法的方法的方法的方法的方法的方法 db argument的方法的方法的方法案件 argument的方法的方法/list understand的方法要好好发酵发酵重现 argument argument argument的方法.table BarbaraываемGoodGoodGoodGood_ui argument的方法 Interpret受GoodGoodGood argument argument argument argument的方法 argument prácticaductory argument的方法 argument的方法.table职责Good goodnessGood_ui argument的方法 argument的方法 argument的方法的方法.table understand argument的方法 argument的方法.table understand understand understand的方法的方法.table understand","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":977,"total_tokens":1977,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m3.700s
user	0m0.003s
sys	0m0.006s

server:

➜  sglang git:(main) python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen3-VL-8B-Instruct --served-model-name test --trust-remote-code --disable-radix-cache --tp 4 --mem-fraction-static 0.85  --mm-attention-backend triton_attn --attention-backend flashinfer

client:

➜  bench_script bash bench_images.sh
{"id":"8d283d13fa524f1bb0746ba32339619e","object":"chat.completion","created":1766730132,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**（Laptop Computer），它的学名是**便携式个人计算机**（Portable Personal Computer）。\n\n更具体地说，图中是**一台银色的笔记本电脑**，放置在木质桌面上，背景有暖色调的灯光，营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号，但从外观来看，它具有典型的笔记本电脑特征：\n\n- 一体式机身（屏幕与键盘合二为一）\n- 便携式设计（可折叠、轻薄）\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释：**\n- “笔记本电脑”是通俗叫法，其正式学名是“便携式个人计算机”，英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域，它也常被称为 **Notebook Computer**（注意：Notebook 是笔记本电脑的另一种叫法，与“笔记本”意思相同，但“Notebook”更强调便携性）。\n- 从技术分类上，它属于**个人计算机（PC）**的一个子类，与台式机（Desktop Computer）相对。\n\n所以，图中物品的学名是：**便携式个人计算机**（Portable Personal Computer）或 **笔记本电脑**（Notebook Computer）。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.505s
user	0m0.006s
sys	0m0.009s
{"id":"ab4f8a4649584380a06511ad65541abe","object":"chat.completion","created":1766730133,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**（Laptop Computer），它的学名是**便携式个人计算机**（Portable Personal Computer）。\n\n更具体地说，图中是**一台银色的笔记本电脑**，放置在木质桌面上，背景有暖色调的灯光，营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号，但从外观来看，它具有典型的笔记本电脑特征：\n\n- 一体式机身（屏幕与键盘合二为一）\n- 便携式设计（可折叠、轻薄）\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释：**\n- “笔记本电脑”是通俗叫法，其正式学名是“便携式个人计算机”，英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域，它也常被称为 **Notebook Computer**（注意：Notebook 是笔记本电脑的另一种叫法，与“笔记本”意思相同，但“Notebook”更强调便携性）。\n- 从技术分类上，它属于**个人计算机（PC）**的一个子类，与台式机（Desktop Computer）相对。\n\n所以，图中物品的学名是：**便携式个人计算机**（Portable Personal Computer）或 **笔记本电脑**（Notebook Computer）。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.295s
user	0m0.002s
sys	0m0.006s
{"id":"29de28e8e47b4edd864fe356af2c89e2","object":"chat.completion","created":1766730135,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**（Laptop Computer），它的学名是**便携式个人计算机**（Portable Personal Computer）。\n\n更具体地说，图中是**一台银色的笔记本电脑**，放置在木质桌面上，背景有暖色调的灯光，营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号，但从外观来看，它具有典型的笔记本电脑特征：\n\n- 一体式机身（屏幕与键盘合二为一）\n- 便携式设计（可折叠、轻薄）\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释：**\n- “笔记本电脑”是通俗叫法，其正式学名是“便携式个人计算机”，英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域，它也常被称为 **Notebook Computer**（注意：Notebook 是笔记本电脑的另一种叫法，与“笔记本”意思相同，但“Notebook”更强调便携性）。\n- 从技术分类上，它属于**个人计算机（PC）**的一个子类，与台式机（Desktop Computer）相对。\n\n所以，图中物品的学名是：**便携式个人计算机**（Portable Personal Computer）或 **笔记本电脑**（Notebook Computer）。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.349s
user	0m0.004s
sys	0m0.004s

yuan-luo · 2025-12-26T06:30:04Z

Checking whether it is this PR breaks Qwen2.5-VL or not. When submitting PR, Qwen2.5-VL was working correctly.

yuan-luo · 2025-12-26T07:16:48Z

I reverted this PR in my local environment. The problem still exists.

$bash bench_n_1m_image.sh 
{"id":"01be6dc6fed949998948713ff1b59279","object":"chat.completion","created":1766733308,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.411s
user    0m0.002s
sys     0m0.002s
{"id":"a4cd664baf5e462b82b4315f65aad330","object":"chat.completion","created":1766733312,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.347s
user    0m0.001s
sys     0m0.004s
{"id":"af128e771f9c49a896842f2b637cb21c","object":"chat.completion","created":1766733316,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.336s
user    0m0.000s
sys     0m0.004s

BBuf · 2025-12-26T08:01:29Z

+    ) -> BaseMultiModalProcessorOutput:
+        """
+        A fast version of `load_mm_data` that loads multimodal data directly.
+        This version does not scan the prompt to recognize tokens. It assumes


Should we add a safety check, such as:

expected_count = ( len(image_data or []) + len(video_data or []) + len(audio_data or []) ) assert expected_count == len(tokenizer(prompt))

Will update in a new PR.

…14644)" This reverts commit 086813a.

merrymercy · 2025-12-26T21:40:36Z

@yuan-luo I reverted this PR. Please resubmit and fix the CI failures.
#15911

yuan-luo · 2025-12-27T00:35:15Z

The PR introduced Qwen2.5-VL regression is [bug fix][pp] fix weight load for qwen2.5-vl (#15138).
It has been fixed in #15398 .

yuan-luo · 2025-12-28T14:32:46Z

@yuan-luo I reverted this PR. Please resubmit and fix the CI failures. #15911

It's weird that manually run this test case test_skip_tokenizer_init.TestSkipTokenizerInitVLM.test_simple_decode_stream passed. But run the whole test suite it will fail.

root@6996fb46042d:/sgl-workspace/sglang_dev3/test/srt# python3 -m unittest test_skip_tokenizer_init.TestSkipTokenizerInitVLM.test_simple_decode_stream
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-28 14:27:12] WARNING server_args.py:1543: Attention backend not specified. Use flashinfer backend by default.
[2025-12-28 14:27:12] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7486296874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=366367333, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-28 14:27:13] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:27:18] No chat template found, defaulting to 'string' content format
[2025-12-28 14:27:20] Init torch distributed begin.
[rank0]:[W1228 14:27:20.297433853 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-28 14:27:20] Init torch distributed ends. mem usage=0.00 GB
[2025-12-28 14:27:20] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-28 14:27:20] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:27:20] Load weight begin. avail mem=78.81 GB
[2025-12-28 14:27:20] Multimodal attention backend not set. Use fa3.
[2025-12-28 14:27:20] Using fa3 as multimodal attention backend.
[2025-12-28 14:27:21] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.20it/s]

[2025-12-28 14:27:23] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=71.47 GB, mem usage=7.34 GB.
[2025-12-28 14:27:23] Using KV cache dtype: torch.bfloat16
[2025-12-28 14:27:23] The available memory for KV cache is 51.66 GB.
[2025-12-28 14:27:23] KV Cache is allocated. #tokens: 1504726, K size: 25.83 GB, V size: 25.83 GB
[2025-12-28 14:27:23] Memory pool end. avail mem=17.70 GB
[2025-12-28 14:27:23] Capture cuda graph begin. This can take up to several minutes. avail mem=17.13 GB
[2025-12-28 14:27:23] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=16.19 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:02<00:00, 12.92it/s]
[2025-12-28 14:27:26] Capture cuda graph end. Time elapsed: 3.27 s. mem usage=0.94 GB. avail mem=16.18 GB.
[2025-12-28 14:27:26] max_total_num_tokens=1504726, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=16.18 GB
[2025-12-28 14:27:26] INFO:     Started server process [212757]
[2025-12-28 14:27:26] INFO:     Waiting for application startup.
[2025-12-28 14:27:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:27:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:27:26] INFO:     Application startup complete.
[2025-12-28 14:27:26] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-28 14:27:27] INFO:     127.0.0.1:56244 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-28 14:27:27] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:27:27] INFO:     127.0.0.1:56256 - "POST /generate HTTP/1.1" 200 OK
[2025-12-28 14:27:27] The server is fired up and ready to roll!
[2025-12-28 14:27:35] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:27:36] INFO:     127.0.0.1:54240 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
----------------------------------------------------------------------
Ran 1 test in 36.588s

OK

yuan-luo · 2025-12-28T14:33:14Z

Run the test suite, it failed in the above test case:

root@6996fb46042d:/sgl-workspace/sglang_dev3# python ./test/srt/test_skip_tokenizer_init.py
command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --skip-tokenizer-init --stream-output --device cuda --host 127.0.0.1 --port 21000
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 402, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1007, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1114, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1655, in _raise_on_head_call_error
    raise head_call_error
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1543, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1460, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 283, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 307, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 419, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-695139c1-20c276683bbfa05022511a7b;379c2def-ca77-4fc6-9489-6fc1f7a44e40)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json.
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/sglang/launch_server.py", line 29, in <module>
    server_args = prepare_server_args(sys.argv[1:])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4969, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4460, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 314, in __init__
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 671, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 954, in _handle_gpu_memory_settings
    model_config = self.get_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4474, in get_model_config
    self.model_config = ModelConfig.from_server_args(self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/configs/model_config.py", line 241, in from_server_args
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/configs/model_config.py", line 126, in __init__
    self.hf_config = get_config(
                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 3169, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/hf_transformers_utils.py", line 273, in get_config
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 662, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 543, in cached_files
    raise OSError(
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct.
401 Client Error. (Request ID: Root=1-695139c1-20c276683bbfa05022511a7b;379c2def-ca77-4fc6-9489-6fc1f7a44e40)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json.
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.
EThe image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-28 14:08:17] WARNING server_args.py:1543: Attention backend not specified. Use flashinfer backend by default.
[2025-12-28 14:08:17] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7486296874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=552765458, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-28 14:08:18] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:08:23] No chat template found, defaulting to 'string' content format
[2025-12-28 14:08:24] Init torch distributed begin.
[rank0]:[W1228 14:08:25.122633731 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-28 14:08:25] Init torch distributed ends. mem usage=0.00 GB
[2025-12-28 14:08:25] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-28 14:08:25] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:08:25] Load weight begin. avail mem=78.81 GB
[2025-12-28 14:08:25] Multimodal attention backend not set. Use fa3.
[2025-12-28 14:08:25] Using fa3 as multimodal attention backend.
[2025-12-28 14:08:26] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.15it/s]

[2025-12-28 14:08:28] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=71.47 GB, mem usage=7.34 GB.
[2025-12-28 14:08:28] Using KV cache dtype: torch.bfloat16
[2025-12-28 14:08:28] The available memory for KV cache is 51.66 GB.
[2025-12-28 14:08:28] KV Cache is allocated. #tokens: 1504726, K size: 25.83 GB, V size: 25.83 GB
[2025-12-28 14:08:28] Memory pool end. avail mem=17.70 GB
[2025-12-28 14:08:28] Capture cuda graph begin. This can take up to several minutes. avail mem=17.13 GB
[2025-12-28 14:08:28] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=16.19 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:02<00:00, 12.38it/s]
[2025-12-28 14:08:31] Capture cuda graph end. Time elapsed: 3.43 s. mem usage=0.94 GB. avail mem=16.18 GB.
[2025-12-28 14:08:31] max_total_num_tokens=1504726, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=16.18 GB
[2025-12-28 14:08:31] INFO:     Started server process [211291]
[2025-12-28 14:08:31] INFO:     Waiting for application startup.
[2025-12-28 14:08:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:08:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:08:31] INFO:     Application startup complete.
[2025-12-28 14:08:31] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-28 14:08:32] INFO:     127.0.0.1:40676 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-28 14:08:32] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:33] INFO:     127.0.0.1:40682 - "POST /generate HTTP/1.1" 200 OK
[2025-12-28 14:08:33] The server is fired up and ready to roll!
[2025-12-28 14:08:40] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:41] INFO:     127.0.0.1:45896 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_eos_behavior
[2025-12-28 14:08:41] Prefill batch, #new-seq: 1, #new-token: 288, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:41] INFO:     127.0.0.1:45900 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "22027cb3a1654b77bcfe625dfa10122a",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 0,
    "e2e_latency": 0.45726776123046875,
    "response_sent_to_client_ts": 1766930921.9123194
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_logprob
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45904 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "97bf956c9e3b470b8baccb03cf573e50",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "input_token_logprobs": [
      [
        null,
        30,
        null
      ]
    ],
    "output_token_logprobs": [
      [
        -0.0001134808044298552,
        151645,
        null
      ]
    ],
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.10360574722290039,
    "response_sent_to_client_ts": 1766930922.1584053
  }
}
====================================================================================================
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45918 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "f8c222fb9f184c0b9b2fb66aba5b3c35",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "input_token_logprobs": [
      [
        null,
        30,
        null
      ]
    ],
    "output_token_logprobs": [
      [
        -0.0001134808044298552,
        151645,
        null
      ]
    ],
    "input_top_logprobs": [
      null
    ],
    "output_top_logprobs": [
      [
        [
          -0.0001134808044298552,
          151645,
          null
        ],
        [
          -9.500113487243652,
          151644,
          null
        ],
        [
          -12.375113487243652,
          151657,
          null
        ]
      ]
    ],
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.13478899002075195,
    "response_sent_to_client_ts": 1766930922.416631
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_parallel_sample
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] Prefill batch, #new-seq: 2, #new-token: 2, #cached-token: 574, token usage: 0.00, #running-req: 1, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45924 - "POST /generate HTTP/1.1" 200 OK
[
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "9817a2a0ac774fd8862ea8c46b367fba",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.19336676597595215,
      "response_sent_to_client_ts": 1766930922.7287662
    }
  },
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "b3f1663248ab4af7b57540a0bcfa9ac5",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.25332069396972656,
      "response_sent_to_client_ts": 1766930922.78847
    }
  },
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "741e5add6c2e401bbde11dcb759ccab1",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.2533283233642578,
      "response_sent_to_client_ts": 1766930922.788476
    }
  }
]
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:43] INFO:     127.0.0.1:45932 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "65e649177a0142719c237146238fe1ec",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.09384608268737793,
    "response_sent_to_client_ts": 1766930923.0011523
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: setUpClass (__main__.TestSkipTokenizerInit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev3/./test/srt/test_skip_tokenizer_init.py", line 31, in setUpClass
    cls.process = popen_launch_server(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 667, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

----------------------------------------------------------------------
Ran 5 tests in 48.155s

yuan-luo · 2025-12-28T14:46:26Z

I can still reproduce this error in main without this PR. I believe we can re-land this PR.

root@6996fb46042d:/sgl-workspace/sglang_dev3/test/srt# python ./test_skip_tokenizer_init.py
......
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-28 14:44:04] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:44:04] INFO:     127.0.0.1:35586 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "6892bd5ed632403dbb75ed42c280a6fb",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.17981410026550293,
    "response_sent_to_client_ts": 1766933044.440148
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: setUpClass (__main__.TestSkipTokenizerInit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev3/test/srt/./test_skip_tokenizer_init.py", line 31, in setUpClass
    cls.process = popen_launch_server(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 656, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

----------------------------------------------------------------------
Ran 5 tests in 49.282s

FAILED (errors=1)

…ject#14644) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuan-luo requested review from JustinTong0323, mickqian and yhyang201 as code owners December 8, 2025 12:48

gemini-code-assist Bot reviewed Dec 8, 2025

View reviewed changes

Comment thread python/sglang/srt/multimodal/processors/base_processor.py Outdated

Comment thread python/sglang/srt/multimodal/processors/base_processor.py Outdated

github-actions Bot added the run-ci label Dec 8, 2025

yuan-luo force-pushed the refactor_load_mm_data branch 2 times, most recently from 91351c0 to 9c6cc52 Compare December 10, 2025 02:53

yuan-luo changed the title ~~[VLM] Refactor load_mm_data to improve performance~~ [WIP][VLM] Refactor load_mm_data to improve performance Dec 10, 2025

yuan-luo force-pushed the refactor_load_mm_data branch from 9c6cc52 to 1b945af Compare December 10, 2025 03:49

yuan-luo changed the title ~~[WIP][VLM] Refactor load_mm_data to improve performance~~ [VLM] Refactor load_mm_data to improve performance Dec 10, 2025

yuan-luo force-pushed the refactor_load_mm_data branch from 1b945af to be874c2 Compare December 11, 2025 02:21

yuan-luo added Multi-modal multi-modal language model vlm labels Dec 16, 2025

yuan-luo force-pushed the refactor_load_mm_data branch from be874c2 to ce186d0 Compare December 17, 2025 02:15

JustinTong0323 self-assigned this Dec 20, 2025

yuan-luo force-pushed the refactor_load_mm_data branch from 5192073 to 1c4b8e8 Compare December 24, 2025 04:19

JustinTong0323 approved these changes Dec 24, 2025

View reviewed changes

Refactor load_mm_data to improve performance

ba4c1da

yuan-luo force-pushed the refactor_load_mm_data branch from 1c4b8e8 to ba4c1da Compare December 25, 2025 02:20

mickqian approved these changes Dec 26, 2025

View reviewed changes

mickqian merged commit 086813a into sgl-project:main Dec 26, 2025
213 of 258 checks passed

BBuf reviewed Dec 26, 2025

View reviewed changes

yuan-luo deleted the refactor_load_mm_data branch December 26, 2025 14:13

merrymercy added a commit that referenced this pull request Dec 26, 2025

Revert "[VLM] refactor: refactor load_mm_data to improve performance (#…

76f5dc2

…14644)" This reverts commit 086813a.

merrymercy mentioned this pull request Dec 26, 2025

Revert "[VLM] Refactor load_mm_data to improve performance" #15911

Merged

yuan-luo mentioned this pull request Dec 30, 2025

[VLM][Reland] Refactor load_mm_data to improve performance #16152

Merged

5 tasks

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[VLM] refactor: refactor load_mm_data to improve performance (sgl-pro…

29a12c2

…ject#14644) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Conversation

yuan-luo commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Dec 8, 2025

Uh oh!

yuan-luo commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Dec 17, 2025

Uh oh!

yuan-luo commented Dec 19, 2025

Uh oh!

JustinTong0323 commented Dec 24, 2025

Uh oh!

JustinTong0323 commented Dec 24, 2025

Uh oh!

JustinTong0323 commented Dec 24, 2025

Uh oh!

yuan-luo commented Dec 25, 2025

Uh oh!

JustinTong0323 commented Dec 25, 2025

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

mickqian commented Dec 26, 2025

Uh oh!

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 26, 2025

Uh oh!

BBuf Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

yuan-luo Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Dec 26, 2025

Uh oh!

yuan-luo commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Dec 28, 2025

Uh oh!

yuan-luo commented Dec 28, 2025

Uh oh!

yuan-luo commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

yuan-luo commented Dec 8, 2025 •

edited

Loading

yuan-luo commented Dec 10, 2025 •

edited

Loading

yuan-luo commented Dec 26, 2025 •

edited

Loading

yuan-luo commented Dec 27, 2025 •

edited

Loading