Skip to content

[VLM] Refactor load_mm_data to improve performance#14644

Merged
mickqian merged 1 commit intosgl-project:mainfrom
antgroup:refactor_load_mm_data
Dec 26, 2025
Merged

[VLM] Refactor load_mm_data to improve performance#14644
mickqian merged 1 commit intosgl-project:mainfrom
antgroup:refactor_load_mm_data

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Dec 8, 2025

Motivation

Inspired by @mickqian .

In the load_mm_data method, there’s currently a redundant approach where we manually detect various tokens and then load them. This could be changed to simply loading all the passed data directly, but we previously made it this way because minicpmo model has a mechanism for adjusting video frames. We decide to simplify it in order to improve majority VLM models' performance.

For example, in both load_mm_data() and submit_data_loading_tasks() there are redundant for text_part in text_parts loop.

In the new implementation, load_mm_data works as “1 token → 1 data”: it doesn’t expand frames or rewrite the prompt, it just loads all the incoming data and aligns them with the tokens in order.

We keep legacy load_mm_data dedicated for the MiniCPM model: support MiniCPM’s “1 token → multiple frames” behavior.

Log printing, result is as expected.

Server:

SGLANG_VIT_CUDA_GRAPH=1 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
python -m sglang.launch_server \
  --model-path /home/admin/Qwen2.5-VL-7B-Instruct \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-max-tokens 8192 \
  --mm-attention-backend fa3 \
  --port 30000 \
  --chunked-prefill-size 8192 \
  --disable-radix-cache \
  --disable-overlap-schedule \
  --piecewise-cuda-graph-compiler eager \
  --attention-backend fa3 \
  --tp 2 \
  --log-level debug \
  --log-level-http debug \
  --log-requests

Client:

$cat bench_remote_video.sh 
for i in {1..1}; do
    time curl 'http://127.0.0.1:30000/v1/chat/completions' --header 'Content-Type: application/json' --data '{
        "model": "auto",
        "messages": [
            {
                "role": "user",
                "content": [
                                  {"type": "video_url", "video_url": {"url": "http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/....../video_test.mp4"}},
                  {"type": "text", "text": "视频里的招牌写的什么"}
                ]
            }
        ],
                                                  
        "temperature":0.0,
        "max_tokens":1000,
        "stream": false,
        "chat_template_kwargs": {"enable_thinking": false}
    }'
done
[root  /root/luoyuan.luo/workspace/bench_script] 一 12月 08 20:37:24 
$bash bench_remote_video.sh 
{"id":"cdb06d6d5c1b428b89473566f40ad2f3","object":"chat.completion","created":1765197964,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”,并附有电话号码“1529521190”。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":7682,"total_tokens":7712,"completion_tokens":30,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default","e2e_latency":1310.2936744689941,"ttft_latency":1310.302734375,"queue_latency":1.2103579938411713}}
real    0m1.318s
user    0m0.001s
sys     0m0.003s
2025-12-08 20:46:02.862 INFO 104744 [ tokenizer_manager.py:480] Receive: obj="GenerateReqInput(rid='cdb06d6d5c1b428b89473566f40ad2f3', http_worker_ipc=None, metrics={'api_server_arrive_time': 1765197962.8611393}, text='<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n<|vision_start|><|video_pad|><|vision_end|>视频里的招牌写的什么<|im_end|>\\n<|im_start|>assistant\\n', video_data=['http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/aistudio/temp/20250910/208156dafb7b44a8/video_test.mp4'], sampling_params={'temperature': 0.0, 'max_new_tokens': 1000, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.05, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': True, 'logit_bias': None, 'custom_params': None}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_path=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, validation_time=8.022785186767578e-05, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False, mm_sampling_kwargs=None, external_trace_headers=None)"
2025-12-08 20:46:02.863 DEBUG 104744 [ tokenizer_manager.py:710] Using regular tokenizer for 1 inputs
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:791] [_submit_mm_data_loading_tasks_simple] no data for modality=IMAGE
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:798] [_submit_mm_data_loading_tasks_simple] submit load task: modality=VIDEO, index=0, data_type=<class 'str'>
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:791] [_submit_mm_data_loading_tasks_simple] no data for modality=AUDIO
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:411] [_load_single_item] start loading data, modality=VIDEO, frame_count_limit=None, audio_sample_rate=None, raw_type=<class 'str'>
2025-12-08 20:46:02.864 DEBUG 104744 [ base_processor.py:938] [load_mm_data(simple)] total futures submitted: 1
2025-12-08 20:46:03.162 DEBUG 104744 [ base_processor.py:435] [_load_single_item][VIDEO] loaded video: len=389, shape[0]=(720, 1280, 3)
2025-12-08 20:46:03.162 DEBUG 104744 [ base_processor.py:966] [load_mm_data(simple)] loaded counts: images=0, videos=1, audios=0
2025-12-08 20:46:03.508 INFO 104744 [ qwen_vl.py:304] [preprocess_video Perf], get_batch_time: 260.65 ms, smart_resize_time: 0.05 ms, torchvision_resize_time: 84.61 ms, total_time: 345.30 ms
2025-12-08 20:46:03.530 INFO 104744 [ qwen_vl.py:497] [QwenVLProcessor Perf] rid='cdb06d6d5c1b428b89473566f40ad2f3', load_time: 298.95 ms, preprocess_time: 352.29 ms, process_time: 14.30 ms, get_rope_index_time: 0.69 ms, total_time: 666.24 ms
2025-12-08 20:46:03.540 DEBUG 104744 [ cuda_ipc_transport_utils.py:75] [try_to_recycle] area=(0, 144055296), flag=1.0, tp_size=2
2025-12-08 20:46:03.541 INFO 104895 [ TP0 scheduler_metrics_mixin.py:154] Prefill batch, #new-seq: 1, #new-token: 7682, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
2025-12-08 20:46:03.590 DEBUG 104744 [ cuda_ipc_transport_utils.py:75] [try_to_recycle] area=(0, 144055296), flag=2.0, tp_size=2
2025-12-08 20:46:04.111 INFO 104895 [ TP0 scheduler_metrics_mixin.py:309] Decode batch, #running-req: 1, #token: 7699, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.08, #queue-req: 0, 
2025-12-08 20:46:04.170 INFO 104895 [ TP0 schedule_batch.py:1043] Req Time Stats(rid=cdb06d6d5c1b428b89473566f40ad2f3, input len=7682, output len=30, type=unified): queue_duration=1.21ms, forward_duration=627.68ms, start_time=9661822.019
2025-12-08 20:46:04.172 INFO 104744 [ tokenizer_manager.py:1150] "Finish: obj=GenerateReqInput(rid='cdb06d6d5c1b428b89473566f40ad2f3', http_worker_ipc=None, metrics={'api_server_arrive_time': 1765197962.8611393, 'mm_entry_time_ts': 1765197962.8640013, 'mm_entry_time': 9661821.34138114, 'mm_load_time': 9661821.640335323, 'mm_preprocess_time': 9661821.992624268, 'mm_process_time': 9661822.006922927, 'mm_get_rope_index_time': 9661822.007616928}, text='<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\n<|vision_start|><|video_pad|><|vision_end|>视频里的招牌写的什么<|im_end|>\\n<|im_start|>assistant\\n', video_data=['http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/aistudio/temp/20250910/208156dafb7b44a8/video_test.mp4'], sampling_params={'temperature': 0.0, 'max_new_tokens': 1000, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.05, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': True, 'logit_bias': None, 'custom_params': None}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_path=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, validation_time=8.022785186767578e-05, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False, mm_sampling_kwargs=None, external_trace_headers=None), out={'text': '视频里的招牌上写着“小鞋匠洗鞋”,并附有电话号码“1529521190”。', 'meta_info': {'id': 'cdb06d6d5c1b428b89473566f40ad2f3', 'finish_reason': {'type': 'stop', 'matched': 151645}, 'prompt_tokens': 7682, 'weight_version': 'default', 'total_retractions': 0, 'queue_time': 0.0012103579938411713, 'prefill_launch_delay': 0.001300731673836708, 'prefill_launch_latency': 0.17800299264490604, 'completion_tokens': 30, 'cached_tokens': 0, 'e2e_latency': 1.3102936744689941, 'request_received_ts': 1765197962.8611393, 'request_sent_to_scheduler_ts': 1765197963.531742, 'decode_finished_ts': 1765197964.171433, 'inference_time': 0.6288455203175545, 'ttft_latency': 1.310302734375, 'response_sent_to_client_ts': 1765197964.172052}}"

Modifications

Accuracy Tests

lmms_eval result no drop.

$python3 -m lmms_eval --model openai_compatible --model_args model_version=Qwen/Qwen2.5-VL-7B-Instruct   --tasks mmmu_val   --batch_size 16
/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
2025-12-10 15:39:50 | INFO     | __main__:cli_evaluate:311 - Verbosity set to INFO
2025-12-10 15:39:52 | INFO     | __main__:cli_evaluate_single:400 - Evaluation tracker args: {'token': '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'}
2025-12-10 15:39:52 | INFO     | __main__:cli_evaluate_single:480 - Selected Tasks: ['mmmu_val']
2025-12-10 15:39:52 | INFO     | lmms_eval.evaluator:simple_evaluate:161 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-12-10 15:39:56 | INFO     | lmms_eval.evaluator:evaluate:402 - Running on rank 0 (local rank 0)
2025-12-10 15:39:56 | INFO     | lmms_eval.api.task:build_all_requests:427 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14037.16it/s]
2025-12-10 15:39:56 | INFO     | lmms_eval.evaluator:evaluate:495 - Running generate_until requests
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [02:56<00:00,  2.31s/it]2025-12-10 15:42:52 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 1251.369s, Total tokens: 2040, Avg speed: 1.6 tokens/s
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [02:56<00:00,  3.09s/it]
Postprocessing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 11417.73it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.75}, 'Art': {'num': 30, 'acc': 0.8}, 'Art_Theory': {'num': 30, 'acc': 0.93333}, 'Design': {'num': 30, 'acc': 0.83333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.62667}, 'Accounting': {'num': 30, 'acc': 0.63333}, 'Economics': {'num': 30, 'acc': 0.7}, 'Finance': {'num': 30, 'acc': 0.46667}, 'Manage': {'num': 30, 'acc': 0.63333}, 'Marketing': {'num': 30, 'acc': 0.7}, 'Overall-Science': {'num': 150, 'acc': 0.57333}, 'Biology': {'num': 30, 'acc': 0.5}, 'Chemistry': {'num': 30, 'acc': 0.46667}, 'Geography': {'num': 30, 'acc': 0.73333}, 'Math': {'num': 30, 'acc': 0.5}, 'Physics': {'num': 30, 'acc': 0.66667}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.68}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.7}, 'Clinical_Medicine': {'num': 30, 'acc': 0.73333}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.46667}, 'Pharmacy': {'num': 30, 'acc': 0.76667}, 'Public_Health': {'num': 30, 'acc': 0.73333}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.8}, 'History': {'num': 30, 'acc': 0.76667}, 'Literature': {'num': 30, 'acc': 0.9}, 'Sociology': {'num': 30, 'acc': 0.76667}, 'Psychology': {'num': 30, 'acc': 0.76667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.4619}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.43333}, 'Computer_Science': {'num': 30, 'acc': 0.6}, 'Electronics': {'num': 30, 'acc': 0.4}, 'Energy_and_Power': {'num': 30, 'acc': 0.46667}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.4}, 'Overall': {'num': 900, 'acc': 0.62778}}
fatal: not a git repository (or any of the parent directories): .git
2025-12-10 15:42:52 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen2.5-VL-7B-Instruct), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6278|±  |   N/A|

Benchmarking and Profiling

As the function portion is small, there's no significant improvement for the e2e performance. Whereas the function cost time reduces.

SGLANG_MM_FEATURE_CACHE_MB=4096 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=0 \
python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-30B-A3B-Instruct \
--host 0.0.0.0 --port 30000 --trust-remote-code --tp-size 2 --enable-cache-report \
--log-level info --max-running-requests 48 --mem-fraction-static 0.7 --chunked-prefill-size 8192  \
--attention-backend flashinfer --mm-attention-backend fa3 
                                                                                               
Benchmark:
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --dataset-name image \
  --num-prompts 256 \
  --apply-chat-template \
  --random-input-len 128 \
  --random-output-len 1 \
  --image-resolution 560x560 \
  --image-format jpeg \
  --image-count 1 \
  --image-content random \
  --random-range-ratio 0.1 \
  --port 30000 \
  --max-concurrency 32
                                                                                               
Baseline:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  10.06     
Total input tokens:                      104023    
Total input text tokens:                 20567     
Total input vision tokens:               83456     
Total generated tokens:                  115       
Total generated tokens (retokenized):    115       
Request throughput (req/s):              25.46     
Input token throughput (tok/s):          10344.45  
Output token throughput (tok/s):         11.44     
Peak output token throughput (tok/s):    39.00     
Peak concurrent requests:                72        
Total token throughput (tok/s):          10355.89  
Concurrency:                             30.69     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1205.71   
Median E2E Latency (ms):                 1204.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          531.57    
Median TTFT (ms):                        0.00      
P99 TTFT (ms):                           1818.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================                                                                                        

                                                                                         
PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  10.10     
Total input tokens:                      104043    
Total input text tokens:                 20587     
Total input vision tokens:               83456     
Total generated tokens:                  115       
Total generated tokens (retokenized):    115       
Request throughput (req/s):              25.34     
Input token throughput (tok/s):          10298.52  
Output token throughput (tok/s):         11.38     
Peak output token throughput (tok/s):    37.00     
Peak concurrent requests:                65        
Total token throughput (tok/s):          10309.90  
Concurrency:                             30.62     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1208.55   
Median E2E Latency (ms):                 1238.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          535.38    
Median TTFT (ms):                        0.00      
P99 TTFT (ms):                           1729.77   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================    

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the multimodal data loading mechanism to enhance performance and simplify the default processing path. By introducing a streamlined load_mm_data that assumes a direct token-to-data mapping, it optimizes common use cases. The original, more complex logic is preserved in a legacy function, with plans for a specialized token-aware loader for specific model requirements in the future.

Highlights

  • Refactored load_mm_data for performance: The primary load_mm_data function has been refactored to a simpler, more direct approach, assuming a 1:1 alignment between tokens and multimodal data. This change aims to improve performance by removing complex token detection and frame expansion logic from the default path.
  • Introduction of _submit_mm_data_loading_tasks_simple: A new private helper method, _submit_mm_data_loading_tasks_simple, has been added to streamline the submission of I/O loading tasks for individual multimodal data items (images, videos, audio) without complex prompt scanning.
  • Renamed original load_mm_data to legacy_load_mm_data: The previous, more complex implementation of load_mm_data has been renamed to legacy_load_mm_data and marked for future obsolescence. This preserves its functionality for models like MiniCPM that require specific token-aware processing, while the new load_mm_data serves as the default.
  • Future load_mm_data_token_aware for MiniCPM: The PR description indicates a plan to introduce a dedicated load_mm_data_token_aware function in the future. This will specifically handle models like MiniCPM that require '1 token -> multiple frames' behavior, keeping the current logic separate from the simplified default.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the load_mm_data method to improve performance by introducing a simpler, faster implementation that avoids prompt scanning. The old logic is preserved in legacy_load_mm_data. A new helper _submit_mm_data_loading_tasks_simple is added to support this.

My review focuses on the new load_mm_data implementation. I've suggested improvements to its signature to remove an unused parameter and correct a type hint. I've also provided a more formal docstring and a refactoring to reduce code duplication. These changes will improve the code's clarity and maintainability.

Comment thread python/sglang/srt/multimodal/processors/base_processor.py Outdated
Comment thread python/sglang/srt/multimodal/processors/base_processor.py Outdated
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 8, 2025

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Dec 8, 2025
@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch 2 times, most recently from 91351c0 to 9c6cc52 Compare December 10, 2025 02:53
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 10, 2025

TestMiniCPMo26Server test case failed unsurprisingly. Fixing.

https://github.com/sgl-project/sglang/actions/runs/20085678223/job/57622671663?pr=14644

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Video images response:
The video clip is a close-up shot of a man, widely recognized as Steve Jobs, presenting a product on stage. The camera focuses on his face from the nose down and his right hand, which is holding a device.
----------------------------------------------------------------------

Traceback (most recent call last):
The man is wearing a black, collared shirt and thin-framed glasses. He is holding a white, rectangular electronic device, which is an early model of the Apple iPod. The device features a small, square screen at the top and a large, circular click wheel below it. He holds the iPod vertically, presenting it to the audience.
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/test/srt/test_vision_openai_server_common.py", line 41, in setUpClass
    cls.process = popen_launch_server(
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/python/sglang/test/test_utils.py", line 653, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

@yuan-luo yuan-luo changed the title [VLM] Refactor load_mm_data to improve performance [WIP][VLM] Refactor load_mm_data to improve performance Dec 10, 2025
@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from 9c6cc52 to 1b945af Compare December 10, 2025 03:49
@yuan-luo yuan-luo changed the title [WIP][VLM] Refactor load_mm_data to improve performance [VLM] Refactor load_mm_data to improve performance Dec 10, 2025
@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from 1b945af to be874c2 Compare December 11, 2025 02:21
@yuan-luo yuan-luo added Multi-modal multi-modal language model vlm labels Dec 16, 2025
@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from be874c2 to ce186d0 Compare December 17, 2025 02:15
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

TestMiniCPMo26Server test case failed unsurprisingly. Fixing.

https://github.com/sgl-project/sglang/actions/runs/20085678223/job/57622671663?pr=14644

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Video images response:
The video clip is a close-up shot of a man, widely recognized as Steve Jobs, presenting a product on stage. The camera focuses on his face from the nose down and his right hand, which is holding a device.
----------------------------------------------------------------------

Traceback (most recent call last):
The man is wearing a black, collared shirt and thin-framed glasses. He is holding a white, rectangular electronic device, which is an early model of the Apple iPod. The device features a small, square screen at the top and a large, circular click wheel below it. He holds the iPod vertically, presenting it to the audience.
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/test/srt/test_vision_openai_server_common.py", line 41, in setUpClass
    cls.process = popen_launch_server(
  File "/public_sglang_ci/runner-l1a-gpu-1/_work/sglang/sglang/python/sglang/test/test_utils.py", line 653, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

Fixed with adding fallback branch.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@JustinTong0323 JustinTong0323 self-assigned this Dec 20, 2025
@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from 5192073 to 1c4b8e8 Compare December 24, 2025 04:19
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from 1c4b8e8 to ba4c1da Compare December 25, 2025 02:20
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

1 similar comment
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@mickqian
Copy link
Copy Markdown
Collaborator

we should consider splitting base_processor.py, as it's already too large a file

@mickqian mickqian merged commit 086813a into sgl-project:main Dec 26, 2025
213 of 258 checks passed
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

There's a case not passed, not sure it's related with this PR or not. I'm manually rerun and follow up.
https://github.com/sgl-project/sglang/actions/runs/20497461686/job/58944251650?pr=14644#logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 119, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 105, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 426, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 312, in run_endpoint_function
    return await dependant.call(**values)
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 447, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 651, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
  File "/public_sglang_ci/runner-l2a-gpu-23/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
IndexError: index 1 is out of bounds for dimension 0 with size 1

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

This is an error in transformers. Might be related with not rebase main when run CI. It should be fine. Will keep a close eye on CI.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

The issue does exist in main CI. Investigating.

command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-26 06:03:31] WARNING server_args.py:1543: Attention backend not specified. Use trtllm_mha backend by default.
[2025-12-26 06:03:31] WARNING server_args.py:1613: TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-12-26 06:03:31] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7683540624999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=702667536, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=16384, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192, 8448, 8704, 8960, 9216, 9472, 9728, 9984, 10240, 10496, 10752, 11008, 11264, 11520, 11776, 12032, 12288, 12544, 12800, 13056, 13312, 13568, 13824, 14080, 14336, 14592, 14848, 15104, 15360, 15616, 15872, 16128, 16384], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-26 06:03:32] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 06:03:36] No chat template found, defaulting to 'string' content format
[2025-12-26 06:03:39] Init torch distributed begin.
[rank0]:[W1226 06:03:40.230730208 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-26 06:03:40] Init torch distributed ends. mem usage=0.00 GB
[2025-12-26 06:03:40] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-26 06:03:40] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 06:03:40] Load weight begin. avail mem=177.74 GB
[2025-12-26 06:03:41] Multimodal attention backend not set. Use triton_attn.
[2025-12-26 06:03:41] Using triton_attn as multimodal attention backend.
[2025-12-26 06:03:41] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:16<00:16, 16.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:34<00:00, 17.67s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:34<00:00, 17.49s/it]

[2025-12-26 06:04:16] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=170.40 GB, mem usage=7.34 GB.
[2025-12-26 06:04:16] Using KV cache dtype: torch.bfloat16
[2025-12-26 06:04:16] The available memory for KV cache is 129.22 GB.
[2025-12-26 06:04:16] KV Cache is allocated. #tokens: 3763904, K size: 64.61 GB, V size: 64.61 GB
[2025-12-26 06:04:16] Memory pool end. avail mem=39.14 GB
[2025-12-26 06:04:16] Capture cuda graph begin. This can take up to several minutes. avail mem=38.09 GB
[2025-12-26 06:04:16] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=37.53 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:03<00:00, 13.05it/s]
[2025-12-26 06:04:21] Capture cuda graph end. Time elapsed: 4.56 s. mem usage=0.56 GB. avail mem=37.53 GB.
[2025-12-26 06:04:21] max_total_num_tokens=3763904, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=37.53 GB
[2025-12-26 06:04:21] INFO:     Started server process [138941]
[2025-12-26 06:04:21] INFO:     Waiting for application startup.
[2025-12-26 06:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-26 06:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-26 06:04:21] INFO:     Application startup complete.
[2025-12-26 06:04:21] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-26 06:04:22] INFO:     127.0.0.1:57760 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-26 06:04:22] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-26 06:04:23] INFO:     127.0.0.1:57766 - "POST /generate HTTP/1.1" 200 OK
[2025-12-26 06:04:23] The server is fired up and ready to roll!
[2025-12-26 06:04:24] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-26 06:04:25] INFO:     127.0.0.1:57782 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_eos_behavior
[2025-12-26 06:04:26] INFO:     127.0.0.1:57788 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 167, in test_eos_behavior
    self.run_decode(max_new_tokens=256)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_logprob
[2025-12-26 06:04:26] INFO:     127.0.0.1:57804 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 164, in test_logprob
    self.run_decode(return_logprob=True, top_logprobs_num=top_logprobs_num)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_parallel_sample
[2025-12-26 06:04:26] INFO:     127.0.0.1:57806 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 485, in generate_request
    async for response in self._handle_batch_request(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 1205, in _handle_batch_request
    tokenized_objs = await asyncio.gather(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 160, in test_parallel_sample
    self.run_decode(n=3)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-26 06:04:26] INFO:     127.0.0.1:57808 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2025-12-26 06:04:26] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 118, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 104, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 428, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 314, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 480, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 684, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 970, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 920, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: test_eos_behavior (__main__.TestSkipTokenizerInitVLM.test_eos_behavior)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 167, in test_eos_behavior
    self.run_decode(max_new_tokens=256)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_logprob (__main__.TestSkipTokenizerInitVLM.test_logprob)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 164, in test_logprob
    self.run_decode(return_logprob=True, top_logprobs_num=top_logprobs_num)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_parallel_sample (__main__.TestSkipTokenizerInitVLM.test_parallel_sample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 160, in test_parallel_sample
    self.run_decode(n=3)
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_simple_decode (__main__.TestSkipTokenizerInitVLM.test_simple_decode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2504, in retry
    return fn()
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1712, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/sgl-workspace/sglang/./test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 1711, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 2512, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 10 tests in 102.655s

FAILED (errors=4)

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 26, 2025

I tested main, Qwen3-VL function works correctly, but Qwen2.5-VL function was broken.

server:

➜  sglang git:(main) python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-VL-7B-Instruct --served-model-name test --trust-remote-code --disable-radix-cache --tp 4 --mem-fraction-static 0.85  --mm-attention-backend triton_attn --attention-backend flashinfer

client:

➜  bench_script bash bench_images.sh
{"id":"a18bf9a84d354d169ab37ea62005fd00","object":"chat.completion","created":1766730374,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word书写 ofVEN查Pe受 word案件 syll syll syll syll syll syll syll syll prescribed syll Unicode syll职责 syllenia syll undefined syllGoodene受 syll why syll职责 syll why试点 syll above syll advised syllSSION syll why syll condition syll condition syllSSION condition condition syll exhibiting阮案件案件enia /**\n.persistence succeedingRecommend succeeding syllSSION三点案件 undefined understand understand岫收回彼 leave_ui borough borough syll why后果案件Good undefinedRecommendRecommendReadRead scoff后果收回 leaveReadRead后果后果收回后果收回 blessing argument argument argumentRead argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentULDIllegalAccessException why condition condition condition condition督查收回 argument argument argument argument督查收回ReadRead argument argument argument argument argument argument argument argument argument argumentreasonGood good good goodGoodGood goodnessGoodGoodGoodGood_uiReadReadGoodGoodGood收回收回对你 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument Hera Ala utiliz徨收回 argument argument Heraserde Good Good Good Good Good案件 Alande Good Good案件 AlaappleGood完美 Ala完美 Ala Ala Ala受受ываем发酵发酵ываем收回发酵重现重现重现重现重现重现重现重现重现重现重现重现重现 argument徨收回发酵发酵重现 argument徨收回 Hera?$案件GoodGood完美 Ala Ala Ala Ala Ala Alande案件.theme Ala higher受 Again understand understand argument understand understand argument good goodGood goodness徨收回Good Good understand argumentGoodGoodGoodGoodGood_ui收回 argument徨 argument徨GoodGood_ui argument徨徨收回 argument徨收回徨 argument徨 argument徨收回 argument ball案件 argument徨 argument徨收回 argument encontr朋友圈发酵发酵重现 argument徨 gallon徨 gallon.performanceGoodGood_ui_ui收回收回收回收回 argument徨 gallon Regression argumentTk受 argument argument argument argument argument argument argument argument argument argument徨Tk argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentuje argument_ui argument(requiredGoodGoodGood收回 argument argumentrez readGoodGood_ui argument argumentGood_ui argument_uirezGood_uirez goodness GoodGood_ui argument gallon过程中 argument argument argument argument argument徨GoodGood收回 argument argument argument argument argument argument argument argument argument argument徨 argument argument argument argument argument徨 argument gallon argument argument argument argument argumentTk argument argument argument argument argument argument argument argument argument argument argument argument argument argument Fault argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument徨 argument gallon argument argument argument argument argument argument argument活性 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument过了Good argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send understand argument argument argument argument argument argument的方法 argument argument argument的方法 argument argument argument argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument的方法 argument argument argument argument argument argument argument argument的方法 argument的方法活性Good argument argument argument的方法 argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法的方法案件 argument argument argument argument的方法 argument argument argument的方法 argument的方法的方法的方法案件 argument的方法ocy案件 argument argument argument的方法 argument的方法 argument的方法iciency argument的方法(required徨 argument argument argument的方法 argument argument argument的方法 argument的方法的方法.persistence argument的方法 repetition argument的方法的方法收回 argument的方法的方法的方法 McLaren Panasonic measured measured measuredPlease measured measured measured案件案件 argument的方法 argument argument的方法的方法的方法的方法 Ala Ala受加工 thatGoodGood argument argument argument argument argument的方法的方法 Ala受ываем qualifyinguhn argument argument argument的方法的方法ываем案件Good goodness understand understand understand的方法 Ala goodness understand understand的方法要好好 argument的方法 argument的方法的方法 Ala事情Good goodness argument argument的方法的方法.theme加工Good attainedGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGood完美ываемGoodGoodGood goodnessGoodGood_ui argument argument argument argument argument argument的方法 argument的方法的方法 argument argument的方法 argument的方法 argument argument的方法的方法的方法的方法的方法 AlaGoodGoodGoodGoodGoodGoodGood idolGoodGoodGoodGoodGood_ui argument argument的方法的方法的方法的方法ываемываем ball ballGoodGoodGoodGoodGood_ui argument的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法 argument的方法的方法的方法的方法的方法的方法 db argument的方法的方法的方法案件 argument的方法的方法/list understand的方法要好好发酵发酵重现 argument argument argument的方法.table BarbaraываемGoodGoodGoodGood_ui argument的方法 Interpret受GoodGoodGood argument argument argument argument的方法 argument prácticaductory argument的方法 argument的方法.table职责Good goodnessGood_ui argument的方法 argument的方法 argument的方法的方法.table understand argument的方法 argument的方法.table understand understand understand的方法的方法.table understand","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":977,"total_tokens":1977,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m3.913s
user	0m0.006s
sys	0m0.004s
{"id":"dcaa0c944e604e6391cc365fd1e6ca7e","object":"chat.completion","created":1766730378,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word书写 ofVEN查Pe受 word案件 syll syll syll syll syll syll syll syll prescribed syll Unicode syll职责 syllenia syll undefined syllGoodene受 syll why syll职责 syll why试点 syll above syll advised syllSSION syll why syll condition syll condition syllSSION condition condition syll exhibiting阮案件案件enia /**\n.persistence succeedingRecommend succeeding syllSSION三点案件 undefined understand understand岫收回彼 leave_ui borough borough syll why后果案件Good undefinedRecommendRecommendReadRead scoff后果收回 leaveReadRead后果后果收回后果收回 blessing argument argument argumentRead argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentULDIllegalAccessException why condition condition condition condition督查收回 argument argument argument argument督查收回ReadRead argument argument argument argument argument argument argument argument argument argumentreasonGood good good goodGoodGood goodnessGoodGoodGoodGood_uiReadReadGoodGoodGood收回收回对你 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument Hera Ala utiliz徨收回 argument argument Heraserde Good Good Good Good Good案件 Alande Good Good案件 AlaappleGood完美 Ala完美 Ala Ala Ala受受ываем发酵发酵ываем收回发酵重现重现重现重现重现重现重现重现重现重现重现重现重现 argument徨收回发酵发酵重现 argument徨收回 Hera?$案件GoodGood完美 Ala Ala Ala Ala Ala Alande案件.theme Ala higher受 Again understand understand argument understand understand argument good goodGood goodness徨收回Good Good understand argumentGoodGoodGoodGoodGood_ui收回 argument徨 argument徨GoodGood_ui argument徨徨收回 argument徨收回徨 argument徨 argument徨收回 argument ball案件 argument徨 argument徨收回 argument encontr朋友圈发酵发酵重现 argument徨 gallon徨 gallon.performanceGoodGood_ui_ui收回收回收回收回 argument徨 gallon Regression argumentTk受 argument argument argument argument argument argument argument argument argument argument徨Tk argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argumentuje argument_ui argument(requiredGoodGoodGood收回 argument argumentrez readGoodGood_ui argument argumentGood_ui argument_uirezGood_uirez goodness GoodGood_ui argument gallon过程中 argument argument argument argument argument徨GoodGood收回 argument argument argument argument argument argument argument argument argument argument徨 argument argument argument argument argument徨 argument gallon argument argument argument argument argumentTk argument argument argument argument argument argument argument argument argument argument argument argument argument argument Fault argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument徨 argument gallon argument argument argument argument argument argument argument活性 argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument过了Good argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument argument.send understand argument argument argument argument argument argument的方法 argument argument argument的方法 argument argument argument argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument的方法 argument argument argument argument argument argument argument argument的方法 argument的方法活性Good argument argument argument的方法 argument的方法Good argument argument argument argument argument argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument的方法 argument argument argument的方法 argument的方法 argument的方法 argument的方法的方法案件 argument argument argument argument的方法 argument argument argument的方法 argument的方法的方法的方法案件 argument的方法ocy案件 argument argument argument的方法 argument的方法 argument的方法iciency argument的方法(required徨 argument argument argument的方法 argument argument argument的方法 argument的方法的方法.persistence argument的方法 repetition argument的方法的方法收回 argument的方法的方法的方法 McLaren Panasonic measured measured measuredPlease measured measured measured案件案件 argument的方法 argument argument的方法的方法的方法的方法 Ala Ala受加工 thatGoodGood argument argument argument argument argument的方法的方法 Ala受ываем qualifyinguhn argument argument argument的方法的方法ываем案件Good goodness understand understand understand的方法 Ala goodness understand understand的方法要好好 argument的方法 argument的方法的方法 Ala事情Good goodness argument argument的方法的方法.theme加工Good attainedGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGoodGood完美ываемGoodGoodGood goodnessGoodGood_ui argument argument argument argument argument argument的方法 argument的方法的方法 argument argument的方法 argument的方法 argument argument的方法的方法的方法的方法的方法 AlaGoodGoodGoodGoodGoodGoodGood idolGoodGoodGoodGoodGood_ui argument argument的方法的方法的方法的方法ываемываем ball ballGoodGoodGoodGoodGood_ui argument的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法的方法 argument的方法的方法的方法的方法的方法的方法 db argument的方法的方法的方法案件 argument的方法的方法/list understand的方法要好好发酵发酵重现 argument argument argument的方法.table BarbaraываемGoodGoodGoodGood_ui argument的方法 Interpret受GoodGoodGood argument argument argument argument的方法 argument prácticaductory argument的方法 argument的方法.table职责Good goodnessGood_ui argument的方法 argument的方法 argument的方法的方法.table understand argument的方法 argument的方法.table understand understand understand的方法的方法.table understand","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":977,"total_tokens":1977,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m3.700s
user	0m0.003s
sys	0m0.006s

server:

➜  sglang git:(main) python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen3-VL-8B-Instruct --served-model-name test --trust-remote-code --disable-radix-cache --tp 4 --mem-fraction-static 0.85  --mm-attention-backend triton_attn --attention-backend flashinfer

client:

➜  bench_script bash bench_images.sh
{"id":"8d283d13fa524f1bb0746ba32339619e","object":"chat.completion","created":1766730132,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**(Laptop Computer),它的学名是**便携式个人计算机**(Portable Personal Computer)。\n\n更具体地说,图中是**一台银色的笔记本电脑**,放置在木质桌面上,背景有暖色调的灯光,营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号,但从外观来看,它具有典型的笔记本电脑特征:\n\n- 一体式机身(屏幕与键盘合二为一)\n- 便携式设计(可折叠、轻薄)\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释:**\n- “笔记本电脑”是通俗叫法,其正式学名是“便携式个人计算机”,英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域,它也常被称为 **Notebook Computer**(注意:Notebook 是笔记本电脑的另一种叫法,与“笔记本”意思相同,但“Notebook”更强调便携性)。\n- 从技术分类上,它属于**个人计算机(PC)**的一个子类,与台式机(Desktop Computer)相对。\n\n所以,图中物品的学名是:**便携式个人计算机**(Portable Personal Computer)或 **笔记本电脑**(Notebook Computer)。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.505s
user	0m0.006s
sys	0m0.009s
{"id":"ab4f8a4649584380a06511ad65541abe","object":"chat.completion","created":1766730133,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**(Laptop Computer),它的学名是**便携式个人计算机**(Portable Personal Computer)。\n\n更具体地说,图中是**一台银色的笔记本电脑**,放置在木质桌面上,背景有暖色调的灯光,营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号,但从外观来看,它具有典型的笔记本电脑特征:\n\n- 一体式机身(屏幕与键盘合二为一)\n- 便携式设计(可折叠、轻薄)\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释:**\n- “笔记本电脑”是通俗叫法,其正式学名是“便携式个人计算机”,英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域,它也常被称为 **Notebook Computer**(注意:Notebook 是笔记本电脑的另一种叫法,与“笔记本”意思相同,但“Notebook”更强调便携性)。\n- 从技术分类上,它属于**个人计算机(PC)**的一个子类,与台式机(Desktop Computer)相对。\n\n所以,图中物品的学名是:**便携式个人计算机**(Portable Personal Computer)或 **笔记本电脑**(Notebook Computer)。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.295s
user	0m0.002s
sys	0m0.006s
{"id":"29de28e8e47b4edd864fe356af2c89e2","object":"chat.completion","created":1766730135,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"这张图里展示的是**笔记本电脑**(Laptop Computer),它的学名是**便携式个人计算机**(Portable Personal Computer)。\n\n更具体地说,图中是**一台银色的笔记本电脑**,放置在木质桌面上,背景有暖色调的灯光,营造出温馨的工作或学习氛围。虽然图中没有显示品牌或型号,但从外观来看,它具有典型的笔记本电脑特征:\n\n- 一体式机身(屏幕与键盘合二为一)\n- 便携式设计(可折叠、轻薄)\n- 有键盘、触控板、屏幕等核心部件\n- 通常用于移动办公、学习、娱乐等\n\n**学名解释:**\n- “笔记本电脑”是通俗叫法,其正式学名是“便携式个人计算机”,英文为 **Portable Personal Computer**。\n- 在计算机科学和工程领域,它也常被称为 **Notebook Computer**(注意:Notebook 是笔记本电脑的另一种叫法,与“笔记本”意思相同,但“Notebook”更强调便携性)。\n- 从技术分类上,它属于**个人计算机(PC)**的一个子类,与台式机(Desktop Computer)相对。\n\n所以,图中物品的学名是:**便携式个人计算机**(Portable Personal Computer)或 **笔记本电脑**(Notebook Computer)。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":742,"total_tokens":1031,"completion_tokens":289,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m1.349s
user	0m0.004s
sys	0m0.004s

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Checking whether it is this PR breaks Qwen2.5-VL or not. When submitting PR, Qwen2.5-VL was working correctly.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

I reverted this PR in my local environment. The problem still exists.

$bash bench_n_1m_image.sh 
{"id":"01be6dc6fed949998948713ff1b59279","object":"chat.completion","created":1766733308,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.411s
user    0m0.002s
sys     0m0.002s
{"id":"a4cd664baf5e462b82b4315f65aad330","object":"chat.completion","created":1766733312,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.347s
user    0m0.001s
sys     0m0.004s
{"id":"af128e771f9c49a896842f2b637cb21c","object":"chat.completion","created":1766733316,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":" word word word revival syllTER活性活性痕il依文书il literary活性il依 literarycher ser痕痕 ser ser ser痕 available available available活性活性活性活性活性活性il available available活性活性il活性依UDiciency $?il available received receivedil available ask ask suggest suggest ask ask ask ask ask ask ask ask ask suggest received ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask suggest ask ask ask received str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask str str str str str str ask ask ask ask str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str ask ask ask ask ask ask ask ask ask ask ask str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":811,"total_tokens":1811,"completion_tokens":1000,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m4.336s
user    0m0.000s
sys     0m0.004s

) -> BaseMultiModalProcessorOutput:
"""
A fast version of `load_mm_data` that loads multimodal data directly.
This version does not scan the prompt to recognize tokens. It assumes
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a safety check, such as:

expected_count = (
        len(image_data or []) + 
        len(video_data or []) + 
        len(audio_data or [])
    )

assert expected_count == len(tokenizer(prompt))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in a new PR.

@merrymercy
Copy link
Copy Markdown
Contributor

@yuan-luo I reverted this PR. Please resubmit and fix the CI failures.
#15911

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 27, 2025

The PR introduced Qwen2.5-VL regression is [bug fix][pp] fix weight load for qwen2.5-vl (#15138).
It has been fixed in #15398 .

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

@yuan-luo I reverted this PR. Please resubmit and fix the CI failures. #15911

It's weird that manually run this test case test_skip_tokenizer_init.TestSkipTokenizerInitVLM.test_simple_decode_stream passed. But run the whole test suite it will fail.

root@6996fb46042d:/sgl-workspace/sglang_dev3/test/srt# python3 -m unittest test_skip_tokenizer_init.TestSkipTokenizerInitVLM.test_simple_decode_stream
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-28 14:27:12] WARNING server_args.py:1543: Attention backend not specified. Use flashinfer backend by default.
[2025-12-28 14:27:12] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7486296874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=366367333, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-28 14:27:13] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:27:18] No chat template found, defaulting to 'string' content format
[2025-12-28 14:27:20] Init torch distributed begin.
[rank0]:[W1228 14:27:20.297433853 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-28 14:27:20] Init torch distributed ends. mem usage=0.00 GB
[2025-12-28 14:27:20] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-28 14:27:20] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:27:20] Load weight begin. avail mem=78.81 GB
[2025-12-28 14:27:20] Multimodal attention backend not set. Use fa3.
[2025-12-28 14:27:20] Using fa3 as multimodal attention backend.
[2025-12-28 14:27:21] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.20it/s]

[2025-12-28 14:27:23] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=71.47 GB, mem usage=7.34 GB.
[2025-12-28 14:27:23] Using KV cache dtype: torch.bfloat16
[2025-12-28 14:27:23] The available memory for KV cache is 51.66 GB.
[2025-12-28 14:27:23] KV Cache is allocated. #tokens: 1504726, K size: 25.83 GB, V size: 25.83 GB
[2025-12-28 14:27:23] Memory pool end. avail mem=17.70 GB
[2025-12-28 14:27:23] Capture cuda graph begin. This can take up to several minutes. avail mem=17.13 GB
[2025-12-28 14:27:23] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=16.19 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:02<00:00, 12.92it/s]
[2025-12-28 14:27:26] Capture cuda graph end. Time elapsed: 3.27 s. mem usage=0.94 GB. avail mem=16.18 GB.
[2025-12-28 14:27:26] max_total_num_tokens=1504726, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=16.18 GB
[2025-12-28 14:27:26] INFO:     Started server process [212757]
[2025-12-28 14:27:26] INFO:     Waiting for application startup.
[2025-12-28 14:27:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:27:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:27:26] INFO:     Application startup complete.
[2025-12-28 14:27:26] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-28 14:27:27] INFO:     127.0.0.1:56244 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-28 14:27:27] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:27:27] INFO:     127.0.0.1:56256 - "POST /generate HTTP/1.1" 200 OK
[2025-12-28 14:27:27] The server is fired up and ready to roll!
[2025-12-28 14:27:35] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:27:36] INFO:     127.0.0.1:54240 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
----------------------------------------------------------------------
Ran 1 test in 36.588s

OK

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Run the test suite, it failed in the above test case:

root@6996fb46042d:/sgl-workspace/sglang_dev3# python ./test/srt/test_skip_tokenizer_init.py
command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --skip-tokenizer-init --stream-output --device cuda --host 127.0.0.1 --port 21000
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 402, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1007, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1114, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1655, in _raise_on_head_call_error
    raise head_call_error
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1543, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 1460, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 283, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py", line 307, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 419, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-695139c1-20c276683bbfa05022511a7b;379c2def-ca77-4fc6-9489-6fc1f7a44e40)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json.
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/sglang/launch_server.py", line 29, in <module>
    server_args = prepare_server_args(sys.argv[1:])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4969, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4460, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 314, in __init__
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 671, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 954, in _handle_gpu_memory_settings
    model_config = self.get_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 4474, in get_model_config
    self.model_config = ModelConfig.from_server_args(self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/configs/model_config.py", line 241, in from_server_args
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/configs/model_config.py", line 126, in __init__
    self.hf_config = get_config(
                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/common.py", line 3169, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/hf_transformers_utils.py", line 273, in get_config
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 662, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 543, in cached_files
    raise OSError(
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct.
401 Client Error. (Request ID: Root=1-695139c1-20c276683bbfa05022511a7b;379c2def-ca77-4fc6-9489-6fc1f7a44e40)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json.
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.
EThe image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
command=python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --skip-tokenizer-init --device cuda --host 127.0.0.1 --port 21000
[2025-12-28 14:08:17] WARNING server_args.py:1543: Attention backend not specified. Use flashinfer backend by default.
[2025-12-28 14:08:17] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=True, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7486296874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=552765458, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-3B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-28 14:08:18] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:08:23] No chat template found, defaulting to 'string' content format
[2025-12-28 14:08:24] Init torch distributed begin.
[rank0]:[W1228 14:08:25.122633731 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-28 14:08:25] Init torch distributed ends. mem usage=0.00 GB
[2025-12-28 14:08:25] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-28 14:08:25] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-28 14:08:25] Load weight begin. avail mem=78.81 GB
[2025-12-28 14:08:25] Multimodal attention backend not set. Use fa3.
[2025-12-28 14:08:25] Using fa3 as multimodal attention backend.
[2025-12-28 14:08:26] Found local HF snapshot for Qwen/Qwen2.5-VL-3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.15it/s]

[2025-12-28 14:08:28] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=71.47 GB, mem usage=7.34 GB.
[2025-12-28 14:08:28] Using KV cache dtype: torch.bfloat16
[2025-12-28 14:08:28] The available memory for KV cache is 51.66 GB.
[2025-12-28 14:08:28] KV Cache is allocated. #tokens: 1504726, K size: 25.83 GB, V size: 25.83 GB
[2025-12-28 14:08:28] Memory pool end. avail mem=17.70 GB
[2025-12-28 14:08:28] Capture cuda graph begin. This can take up to several minutes. avail mem=17.13 GB
[2025-12-28 14:08:28] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=16.19 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:02<00:00, 12.38it/s]
[2025-12-28 14:08:31] Capture cuda graph end. Time elapsed: 3.43 s. mem usage=0.94 GB. avail mem=16.18 GB.
[2025-12-28 14:08:31] max_total_num_tokens=1504726, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=16.18 GB
[2025-12-28 14:08:31] INFO:     Started server process [211291]
[2025-12-28 14:08:31] INFO:     Waiting for application startup.
[2025-12-28 14:08:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:08:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-28 14:08:31] INFO:     Application startup complete.
[2025-12-28 14:08:31] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2025-12-28 14:08:32] INFO:     127.0.0.1:40676 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-28 14:08:32] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:33] INFO:     127.0.0.1:40682 - "POST /generate HTTP/1.1" 200 OK
[2025-12-28 14:08:33] The server is fired up and ready to roll!
[2025-12-28 14:08:40] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:41] INFO:     127.0.0.1:45896 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_eos_behavior
[2025-12-28 14:08:41] Prefill batch, #new-seq: 1, #new-token: 288, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:41] INFO:     127.0.0.1:45900 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "22027cb3a1654b77bcfe625dfa10122a",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 0,
    "e2e_latency": 0.45726776123046875,
    "response_sent_to_client_ts": 1766930921.9123194
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_logprob
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45904 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "97bf956c9e3b470b8baccb03cf573e50",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "input_token_logprobs": [
      [
        null,
        30,
        null
      ]
    ],
    "output_token_logprobs": [
      [
        -0.0001134808044298552,
        151645,
        null
      ]
    ],
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.10360574722290039,
    "response_sent_to_client_ts": 1766930922.1584053
  }
}
====================================================================================================
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45918 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "f8c222fb9f184c0b9b2fb66aba5b3c35",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "input_token_logprobs": [
      [
        null,
        30,
        null
      ]
    ],
    "output_token_logprobs": [
      [
        -0.0001134808044298552,
        151645,
        null
      ]
    ],
    "input_top_logprobs": [
      null
    ],
    "output_top_logprobs": [
      [
        [
          -0.0001134808044298552,
          151645,
          null
        ],
        [
          -9.500113487243652,
          151644,
          null
        ],
        [
          -12.375113487243652,
          151657,
          null
        ]
      ]
    ],
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.13478899002075195,
    "response_sent_to_client_ts": 1766930922.416631
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_parallel_sample
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:42] Prefill batch, #new-seq: 2, #new-token: 2, #cached-token: 574, token usage: 0.00, #running-req: 1, #queue-req: 0, 
[2025-12-28 14:08:42] INFO:     127.0.0.1:45924 - "POST /generate HTTP/1.1" 200 OK
[
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "9817a2a0ac774fd8862ea8c46b367fba",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.19336676597595215,
      "response_sent_to_client_ts": 1766930922.7287662
    }
  },
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "b3f1663248ab4af7b57540a0bcfa9ac5",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.25332069396972656,
      "response_sent_to_client_ts": 1766930922.78847
    }
  },
  {
    "output_ids": [
      151645
    ],
    "meta_info": {
      "id": "741e5add6c2e401bbde11dcb759ccab1",
      "finish_reason": {
        "type": "stop",
        "matched": 151645
      },
      "prompt_tokens": 288,
      "weight_version": "default",
      "total_retractions": 0,
      "completion_tokens": 1,
      "cached_tokens": 287,
      "e2e_latency": 0.2533283233642578,
      "response_sent_to_client_ts": 1766930922.788476
    }
  }
]
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-28 14:08:42] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:08:43] INFO:     127.0.0.1:45932 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "65e649177a0142719c237146238fe1ec",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.09384608268737793,
    "response_sent_to_client_ts": 1766930923.0011523
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: setUpClass (__main__.TestSkipTokenizerInit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev3/./test/srt/test_skip_tokenizer_init.py", line 31, in setUpClass
    cls.process = popen_launch_server(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 667, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

----------------------------------------------------------------------
Ran 5 tests in 48.155s

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

I can still reproduce this error in main without this PR. I believe we can re-land this PR.

root@6996fb46042d:/sgl-workspace/sglang_dev3/test/srt# python ./test_skip_tokenizer_init.py
......
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode
[2025-12-28 14:44:04] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 287, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-28 14:44:04] INFO:     127.0.0.1:35586 - "POST /generate HTTP/1.1" 200 OK
{
  "output_ids": [
    151645
  ],
  "meta_info": {
    "id": "6892bd5ed632403dbb75ed42c280a6fb",
    "finish_reason": {
      "type": "stop",
      "matched": 151645
    },
    "prompt_tokens": 288,
    "weight_version": "default",
    "total_retractions": 0,
    "completion_tokens": 1,
    "cached_tokens": 287,
    "e2e_latency": 0.17981410026550293,
    "response_sent_to_client_ts": 1766933044.440148
  }
}
====================================================================================================
.[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.
======================================================================
ERROR: setUpClass (__main__.TestSkipTokenizerInit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev3/test/srt/./test_skip_tokenizer_init.py", line 31, in setUpClass
    cls.process = popen_launch_server(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 656, in popen_launch_server
    raise Exception(
Exception: Server process exited with code 1. Check server logs for errors.

----------------------------------------------------------------------
Ran 5 tests in 49.282s

FAILED (errors=1)

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model run-ci vlm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants