Skip to content

[Reland] perf: optimize qwen-vl with symm mem allreduce#11457

Merged
hnyls2002 merged 2 commits intosgl-project:mainfrom
antgroup:optimize_qwen2_vl
Oct 13, 2025
Merged

[Reland] perf: optimize qwen-vl with symm mem allreduce#11457
hnyls2002 merged 2 commits intosgl-project:mainfrom
antgroup:optimize_qwen2_vl

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Oct 11, 2025

Motivation

The previous PR #11381 was reverted due to break vllm dependency test, the following case failed: ./test/srt/test_gptqmodel_dynamic.py.

The root cause is previous PR includes logic to fuse all reduce to improve performance which has dependency on enabling dp_attention. But Qwen2 doesn't support dp_attention, so the logic asserts in prerequisite check. Below has more details. This PR is to revert the fuse all reduce part and reland #11381.

For more details please refer to #11381 .

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

  • New Features
    • Added optional interleaved multimodal rotary embeddings, configurable via settings.
    • Expanded in-place all-reduce with an additional backend and adjusted memory sizing for larger transfers, improving flexibility and performance.
  • Bug Fixes
    • Ensured batch sequence length totals are consistently computed as integers across input types.
  • Tests
    • Updated test runtime metadata for dependency suite to reflect longer execution time.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to successfully re-integrate critical performance optimizations for Qwen-VL, which were previously reverted due to an environmental test timeout. It ensures the stability of the testing suite by adjusting timeout configurations and enhances the model's distributed communication and rotary embedding mechanisms. The changes specifically improve the handling of all-reduce operations and introduce support for interleaved multimodal rotary embeddings, contributing to more efficient and accurate model execution.

Highlights

  • Reland of Previous Optimization: This pull request re-introduces performance optimizations for Qwen-VL that were previously reverted due to a test timeout issue. The root cause was identified as an insufficient timeout for the test_vllm_dependency.py.
  • Increased Test Timeout: The timeout for test_vllm_dependency.py has been significantly increased from 185 seconds to 800 seconds to prevent future false failures caused by long-running tests, especially when models have no cache.
  • Symmetric Memory All-Reduce Configuration: The maximum sizes for symmetric memory all-reduce operations have been adjusted for different GPU architectures (sm90 and sm100) and world sizes, allowing for larger data transfers (e.g., 32MB to 64MB for world size 4 on sm90).
  • All-Reduce Communication Prioritization: The _all_reduce_in_place function now prioritizes pynccl_comm and symm_mem_comm for all-reduce operations before falling back to torch.distributed.all_reduce, enhancing distributed communication efficiency.
  • Interleaved Multimodal Rotary Embedding (MRoPE): A new function apply_interleaved_rope and a mrope_interleaved parameter have been added to MRotaryEmbedding to support interleaved frequency layouts for 3D rotary embeddings, which is crucial for certain multimodal models like Qwen2-VL.
  • MLP All-Reduce Fusion for Qwen2: The Qwen2MLP and Qwen2DecoderLayer have been updated to incorporate LayerCommunicator logic, enabling the fusion of MLP all-reduce operations with the next layer when applicable, which can improve performance.
  • Robust seq_lens_sum Calculation: The calculation of seq_lens_sum in schedule_batch.py has been made more robust to correctly handle both torch.Tensor and numpy.ndarray types for sequence lengths.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request relands performance optimizations for qwen-vl using symmetric memory all-reduce and also fixes a flaky test by increasing its timeout. The changes involve enabling symm_mem_comm in the parallel state, adding support for interleaved MRoPE, and integrating LayerCommunicator for all-reduce fusion in the Qwen2 model. My review has identified a critical correctness bug in the all-reduce implementation and a potential performance improvement. Please see the detailed comments below.

Comment thread python/sglang/srt/distributed/parallel_state.py
Comment thread python/sglang/srt/layers/rotary_embedding.py
@yuan-luo yuan-luo changed the title [Reland] perf: optimize qwen-vl with symm mem allreduce [WIP][Reland] perf: optimize qwen-vl with symm mem allreduce Oct 11, 2025
@yuan-luo yuan-luo marked this pull request as draft October 11, 2025 07:06
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

vllm dependency ci test includes this case: ./test/srt/test_gptqmodel_dynamic.py
Repro-ed the failure locally. Investigating.

➜  sglang_dev2 git:(optimize_qwen2_vl) ✗ python ./test/srt/test_gptqmodel_dynamic.py
Auto-configed device: cuda
command=python3 -m sglang.launch_server --model-path ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse --dtype float16 --device cuda --host 127.0.0.1 --port 8000
INFO 10-11 00:09:11 [__init__.py:216] Automatically detected platform cuda.
config.json: 1.70kB [00:00, 8.67MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205/205 [00:00<00:00, 1.09MB/s]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 00:09:13] server_args=ServerArgs(model_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', tokenizer_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, modelopt_quant=None, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='float16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, mem_fraction_static=0.857, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=467869350, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, crash_on_nan=False, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', max_lora_chunk_size=16, attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill='flashmla_prefill', nsa_decode='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, sm_group_num=3)
[2025-10-11 00:09:13] Casting torch.bfloat16 to torch.float16.
[2025-10-11 00:09:13] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
tokenizer_config.json: 1.35kB [00:00, 11.0MB/s]
vocab.json: 2.78MB [00:00, 13.9MB/s]
merges.txt: 1.67MB [00:00, 14.3MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 10.7MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80.0/80.0 [00:00<00:00, 993kB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245/245 [00:00<00:00, 1.86MB/s]
[2025-10-11 00:09:18] Using default HuggingFace chat template with detected content format: string
INFO 10-11 00:09:21 [__init__.py:216] Automatically detected platform cuda.
INFO 10-11 00:09:21 [__init__.py:216] Automatically detected platform cuda.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 00:09:22] Casting torch.bfloat16 to torch.float16.
[2025-10-11 00:09:22] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 00:09:23] Casting torch.bfloat16 to torch.float16.
[2025-10-11 00:09:23] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 00:09:23] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-10-11 00:09:23] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-11 00:09:23] Init torch distributed ends. mem usage=0.00 GB
[2025-10-11 00:09:23] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-11 00:09:23] Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
[2025-10-11 00:09:24] Load weight begin. avail mem=177.74 GB
[2025-10-11 00:09:24] Using model weights format ['*.safetensors']
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.68G/1.68G [00:10<00:00, 167MB/s]
[2025-10-11 00:09:35] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.02s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.02s/it]

[2025-10-11 00:09:38] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=176.01 GB, mem usage=1.73 GB.
[2025-10-11 00:09:38] Using KV cache dtype: torch.float16
[2025-10-11 00:09:38] KV Cache is allocated. #tokens: 822419, K size: 75.29 GB, V size: 75.29 GB
[2025-10-11 00:09:38] Memory pool end. avail mem=24.78 GB
[2025-10-11 00:09:38] Capture cuda graph begin. This can take up to several minutes. avail mem=24.09 GB
[2025-10-11 00:09:38] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=22.30 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:20<00:00,  2.54it/s]
[2025-10-11 00:09:59] Capture cuda graph end. Time elapsed: 21.21 s. mem usage=1.81 GB. avail mem=22.29 GB.
[2025-10-11 00:10:00] max_total_num_tokens=822419, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=22.29 GB
[2025-10-11 00:10:01] INFO:     Started server process [90955]
[2025-10-11 00:10:01] INFO:     Waiting for application startup.
[2025-10-11 00:10:01] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.8}
[2025-10-11 00:10:01] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.8}
[2025-10-11 00:10:01] INFO:     Application startup complete.
[2025-10-11 00:10:01] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-10-11 00:10:02] INFO:     127.0.0.1:53100 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-11 00:10:02] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:10:02] INFO:     127.0.0.1:53110 - "POST /generate HTTP/1.1" 200 OK
[2025-10-11 00:10:02] The server is fired up and ready to roll!
[2025-10-11 00:10:04] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:10:05] INFO:     127.0.0.1:53114 - "GET /health_generate HTTP/1.1" 200 OK
INFO 10-11 00:10:05 [__init__.py:216] Automatically detected platform cuda.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING:sglang.srt.layers.moe.utils:MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
/usr/local/lib/python3.10/dist-packages/nvidia_cutlass_dsl/python_packages/cutlass/utils/__init__.py:22: DeprecationWarning: SMEM_CAPACITY is deprecated: Use get_smem_capacity_in_bytes from cutlass.utils.smem_capacity instead
  from .blackwell_helpers import (
/usr/local/lib/python3.10/dist-packages/nvidia_cutlass_dsl/python_packages/cutlass/utils/__init__.py:34: DeprecationWarning: SMEM_CAPACITY is deprecated: Use get_smem_capacity_in_bytes from cutlass.utils.smem_capacity instead
  from .hopper_helpers import (
/usr/local/lib/python3.10/dist-packages/flash_attn/cute/flash_fwd.py:19: DeprecationWarning: SMEM_CAPACITY is deprecated: Use get_smem_capacity_in_bytes from cutlass.utils.smem_capacity instead
  import cutlass.utils.ampere_helpers as sm80_utils_basic
WARNING:sglang.srt.layers.attention.fla.utils:Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
E[2025-10-11 00:10:09] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:10:09] Decode batch. #running-req: 1, #token: 37, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4.46, #queue-req: 0,
[2025-10-11 00:10:09] INFO:     127.0.0.1:46770 - "POST /generate HTTP/1.1" 200 OK
result = `{'text': ' ______.(\u3000\u3000)\nA. Paris\nB. London\nC. Tokyo\nD. New York\n答案:A\nParis巴黎,London伦敦,Tokyo东京,New York纽约。根据常识可知,法国的首都是巴黎。\n解析:法国的首都是巴黎。', 'output_ids': [785, 6722, 315, 9625, 374, 32671, 13, 9909, 22441, 22441, 23083, 32, 13, 12095, 198, 33, 13, 7148, 198, 34, 13, 26194, 198, 35, 13, 1532, 4261, 198, 102349, 5122, 32, 198, 59604, 106004, 3837, 39572, 107074, 3837, 52854, 16032, 107513, 3837, 3564, 4261, 106357, 1773, 100345, 107537, 115059, 3837, 104328, 9370, 59975, 100132, 106004, 8997, 106637, 5122, 104328, 9370, 59975, 100132, 106004, 1773, 151643], 'meta_info': {'id': '7db32a3e6d8b4ee8b379c6c21a321a85', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 5, 'weight_version': 'default', 'completion_tokens': 60, 'cached_tokens': 2, 'e2e_latency': 0.17472076416015625}}`
Throughput: 1423.8370358552645 tokens/s
.Auto-configed device: cuda
command=python3 -m sglang.launch_server --model-path ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue --dtype bfloat16 --device cuda --host 127.0.0.1 --port 8000
INFO 10-11 00:10:16 [__init__.py:216] Automatically detected platform cuda.
config.json: 1.68kB [00:00, 7.33MB/s]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 00:10:17] server_args=ServerArgs(model_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue', tokenizer_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, modelopt_quant=None, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, mem_fraction_static=0.857, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=649705135, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, crash_on_nan=False, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', max_lora_chunk_size=16, attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill='flashmla_prefill', nsa_decode='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, sm_group_num=3)
[2025-10-11 00:10:17] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
tokenizer_config.json: 1.35kB [00:00, 7.01MB/s]
vocab.json: 2.78MB [00:00, 12.5MB/s]
merges.txt: 1.67MB [00:00, 11.7MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 14.1MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80.0/80.0 [00:00<00:00, 706kB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245/245 [00:00<00:00, 2.45MB/s]
[2025-10-11 00:10:22] Using default HuggingFace chat template with detected content format: string
INFO 10-11 00:10:25 [__init__.py:216] Automatically detected platform cuda.
INFO 10-11 00:10:25 [__init__.py:216] Automatically detected platform cuda.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 00:10:26] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-10-11 00:10:26] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-10-11 00:10:26] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-10-11 00:10:26] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-11 00:10:27] Init torch distributed ends. mem usage=0.00 GB
[2025-10-11 00:10:27] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-11 00:10:27] Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
[2025-10-11 00:10:27] Load weight begin. avail mem=176.39 GB
[2025-10-11 00:10:28] Using model weights format ['*.safetensors']
model.safetensors:  18%|████████████████████████████████▊                                                                                                                                                       | 583M/3.27G [00:02<00:07, 337MB/s]model.safetensors:  99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 3.25G/3.27G [00:42<00:02, 9.96MB/s]

model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.27G/3.27G [00:44<00:00, 72.7MB/s]
[2025-10-11 00:11:13] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.14it/s]

[2025-10-11 00:11:15] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=173.13 GB, mem usage=3.27 GB.
[2025-10-11 00:11:15] Using KV cache dtype: torch.bfloat16
[2025-10-11 00:11:15] KV Cache is allocated. #tokens: 807755, K size: 73.95 GB, V size: 73.95 GB
[2025-10-11 00:11:15] Memory pool end. avail mem=24.62 GB
[2025-10-11 00:11:15] Capture cuda graph begin. This can take up to several minutes. avail mem=23.97 GB
[2025-10-11 00:11:15] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=22.85 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:22<00:00,  2.28it/s]
[2025-10-11 00:11:38] Capture cuda graph end. Time elapsed: 23.46 s. mem usage=1.15 GB. avail mem=22.83 GB.
[2025-10-11 00:11:39] max_total_num_tokens=807755, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=22.83 GB
[2025-10-11 00:11:40] INFO:     Started server process [92518]
[2025-10-11 00:11:40] INFO:     Waiting for application startup.
[2025-10-11 00:11:40] INFO:     Application startup complete.
[2025-10-11 00:11:40] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-10-11 00:11:41] INFO:     127.0.0.1:42380 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-11 00:11:41] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:11:50] INFO:     127.0.0.1:35544 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-10-11 00:12:00] INFO:     127.0.0.1:42382 - "POST /generate HTTP/1.1" 200 OK
[2025-10-11 00:12:00] The server is fired up and ready to roll!
[2025-10-11 00:12:00] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:12:01] INFO:     127.0.0.1:52584 - "GET /health_generate HTTP/1.1" 200 OK
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
E[2025-10-11 00:12:01] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 37, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.83, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 77, token usage: 0.00, cuda graph: True, gen throughput (token/s): 586.98, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 117, token usage: 0.00, cuda graph: True, gen throughput (token/s): 567.16, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 157, token usage: 0.00, cuda graph: True, gen throughput (token/s): 559.16, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 197, token usage: 0.00, cuda graph: True, gen throughput (token/s): 564.11, #queue-req: 0,
[2025-10-11 00:12:01] Decode batch. #running-req: 1, #token: 237, token usage: 0.00, cuda graph: True, gen throughput (token/s): 557.69, #queue-req: 0,
[2025-10-11 00:12:02] INFO:     127.0.0.1:52590 - "POST /generate HTTP/1.1" 200 OK
result = `{'text': ' ________.(\u3000\u3000)\nA. Paris\nB. London\nC. Tokyo\nD. Beijing\n答案:A\n考查英文常识.A.Paris巴黎;B.London伦敦;C.Tokyo东京;D.Beijing北京.根据常识可知,法国的首都是巴黎.\n解析:法国的首都是巴黎.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地', 'output_ids': [785, 6722, 315, 9625, 374, 32671, 563, 58883, 9909, 22441, 22441, 23083, 32, 13, 12095, 198, 33, 13, 7148, 198, 34, 13, 26194, 198, 35, 13, 26549, 198, 102349, 5122, 32, 198, 117246, 105205, 107537, 58883, 32, 58883, 59604, 106004, 24968, 33, 58883, 39572, 107074, 24968, 34, 58883, 52854, 16032, 107513, 24968, 35, 58883, 3430, 23649, 68990, 58883, 100345, 107537, 115059, 3837, 104328, 9370, 59975, 100132, 106004, 58883, 198, 106637, 5122, 104328, 9370, 59975, 100132, 106004, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267], 'meta_info': {'id': '49f25ebaeb9247498d6462d9baa00a2d', 'finish_reason': {'type': 'length', 'length': 256}, 'prompt_tokens': 5, 'weight_version': 'default', 'completion_tokens': 256, 'cached_tokens': 2, 'e2e_latency': 0.5100152492523193}}`
Throughput: 498.54249447648306 tokens/s
.
======================================================================
ERROR: test_gptq_module (__main__.TestGPTQModelDynamic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py", line 2318, in retry
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/sglang/test/test_utils.py", line 1617, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: dp attention not initialized!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/test/test_utils.py", line 1616, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py", line 2321, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

======================================================================
ERROR: test_gptq_marlin_module (__main__.TestGPTQModelDynamicWithMarlin)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py", line 2318, in retry
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/sglang/test/test_utils.py", line 1617, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: dp attention not initialized!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/test/test_utils.py", line 1616, in _callTestMethod
    retry(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py", line 2321, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 4 tests in 177.469s

FAILED (errors=2)
Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7abeaa32dcf0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache
    if not os.path.exists(cutlass.CACHE_FILE):
AttributeError: module 'cutlass' has no attribute 'CACHE_FILE'
[rank0]:[W1011 00:12:02.047126220 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Tracking the backtrace in question:

➜  sglang_dev2 git:(optimize_qwen2_vl) ✗ pytest -k test_gptq_module -x --maxfail=1 --pdb -vv -s test/srt/test_gptqmodel_dynamic.py
=============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sgl-workspace/sglang_dev2/test
configfile: pytest.ini
plugins: typeguard-4.4.4, anyio-4.9.0
collected 4 items / 3 deselected / 1 selected

test/srt/test_gptqmodel_dynamic.py::TestGPTQModelDynamic::test_gptq_module Auto-configed device: cuda
command=python3 -m sglang.launch_server --model-path ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse --dtype float16 --device cuda --host 127.0.0.1 --port 8000
INFO 10-11 08:05:04 [__init__.py:216] Automatically detected platform cuda.
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 08:05:06] server_args=ServerArgs(model_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', tokenizer_path='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, modelopt_quant=None, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='float16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, mem_fraction_static=0.857, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=200603526, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, crash_on_nan=False, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', max_lora_chunk_size=16, attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill='flashmla_prefill', nsa_decode='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, sm_group_num=3)
[2025-10-11 08:05:06] Casting torch.bfloat16 to torch.float16.
[2025-10-11 08:05:06] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 08:05:07] Using default HuggingFace chat template with detected content format: string
INFO 10-11 08:05:13 [__init__.py:216] Automatically detected platform cuda.
INFO 10-11 08:05:13 [__init__.py:216] Automatically detected platform cuda.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 08:05:14] Casting torch.bfloat16 to torch.float16.
[2025-10-11 08:05:14] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 08:05:15] Casting torch.bfloat16 to torch.float16.
[2025-10-11 08:05:15] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-10-11 08:05:15] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-10-11 08:05:15] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-11 08:05:15] Init torch distributed ends. mem usage=0.00 GB
[2025-10-11 08:05:15] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-11 08:05:16] Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
[2025-10-11 08:05:16] Load weight begin. avail mem=177.74 GB
[2025-10-11 08:05:16] Using model weights format ['*.safetensors']
[2025-10-11 08:05:17] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.05s/it]

[2025-10-11 08:05:19] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=176.01 GB, mem usage=1.73 GB.
[2025-10-11 08:05:19] Using KV cache dtype: torch.float16
[2025-10-11 08:05:20] KV Cache is allocated. #tokens: 822419, K size: 75.29 GB, V size: 75.29 GB
[2025-10-11 08:05:20] Memory pool end. avail mem=24.78 GB
[2025-10-11 08:05:20] Capture cuda graph begin. This can take up to several minutes. avail mem=24.09 GB
[2025-10-11 08:05:20] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=22.30 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:04<00:00, 10.50it/s]
[2025-10-11 08:05:26] Capture cuda graph end. Time elapsed: 5.64 s. mem usage=1.81 GB. avail mem=22.29 GB.
[2025-10-11 08:05:26] max_total_num_tokens=822419, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=22.29 GB
[2025-10-11 08:05:27] INFO:     Started server process [97146]
[2025-10-11 08:05:27] INFO:     Waiting for application startup.
[2025-10-11 08:05:27] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.8}
[2025-10-11 08:05:27] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.8}
[2025-10-11 08:05:27] INFO:     Application startup complete.
[2025-10-11 08:05:27] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-10-11 08:05:27] INFO:     127.0.0.1:59038 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-10-11 08:05:28] INFO:     127.0.0.1:59050 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-11 08:05:28] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 08:05:28] INFO:     127.0.0.1:59062 - "POST /generate HTTP/1.1" 200 OK
[2025-10-11 08:05:28] The server is fired up and ready to roll!
[2025-10-11 08:05:37] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 08:05:38] INFO:     127.0.0.1:48488 - "GET /health_generate HTTP/1.1" 200 OK
INFO 10-11 08:05:38 [__init__.py:216] Automatically detected platform cuda.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
`torch_dtype` is deprecated! Use `dtype` instead!
FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
WARNING  sglang.srt.configs.model_config:model_config.py:786 Casting torch.bfloat16 to torch.float16.
WARNING  sglang.srt.configs.model_config:model_config.py:621 gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING  sglang.srt.configs.model_config:model_config.py:786 Casting torch.bfloat16 to torch.float16.
WARNING  sglang.srt.configs.model_config:model_config.py:621 gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING  sglang.srt.layers.moe.utils:utils.py:154 MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
WARNING  sglang.srt.layers.attention.fla.utils:utils.py:53 Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

self = <test_gptqmodel_dynamic.TestGPTQModelDynamic testMethod=test_gptq_module>

    def test_gptq_module(self):
>       check_quant_method(self.MODEL_PATH, use_marlin_kernel=False)

test/srt/test_gptqmodel_dynamic.py:145:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/srt/test_gptqmodel_dynamic.py:50: in check_quant_method
    model = get_model(
/usr/local/lib/python3.10/dist-packages/sglang/srt/model_loader/__init__.py:28: in get_model
    return loader.load_model(
/usr/local/lib/python3.10/dist-packages/sglang/srt/model_loader/loader.py:576: in load_model
    model = _initialize_model(
/usr/local/lib/python3.10/dist-packages/sglang/srt/model_loader/loader.py:252: in _initialize_model
    return model_class(
/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py:460: in __init__
    self.model = Qwen2Model(
/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py:325: in __init__
    self.layers, self.start_layer, self.end_layer = make_layers(
/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py:515: in make_layers
    + get_offloader().wrap_modules(
/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/offloader.py:36: in wrap_modules
    return list(all_modules_generator)
/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py:517: in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py:327: in <lambda>
    lambda idx, prefix: decoder_layer_type(
/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2.py:258: in __init__
    self.layer_communicator = LayerCommunicator(
/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/communicator.py:190: in __init__
    self._context = CommunicateContext.init_new()
/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/communicator.py:364: in init_new
    attn_tp_rank = get_attention_tp_rank()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def get_attention_tp_rank() -> int:
>       assert _ATTN_TP_RANK is not None, "dp attention not initialized!"
E       AssertionError: dp attention not initialized!

/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/dp_attention.py:290: AssertionError

During handling of the above exception, another exception occurred:
/usr/local/lib/python3.10/dist-packages/sglang/test/test_utils.py:1616: in _callTestMethod
    retry(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fn = <function CustomTestCase._callTestMethod.<locals>.<lambda> at 0x7cc7b95430a0>, max_retry = 0, initial_delay = 2.0, max_delay = 60.0, should_retry = <function <lambda> at 0x7ccaa00eb010>

    def retry(
        fn,
        max_retry: int,
        initial_delay: float = 2.0,
        max_delay: float = 60.0,
        should_retry: Callable[[Any], bool] = lambda e: True,
    ):
        for try_index in itertools.count():
            try:
                return fn()
            except Exception as e:
                if try_index >= max_retry:
>                   raise Exception(f"retry() exceed maximum number of retries.")
E                   Exception: retry() exceed maximum number of retries.

/usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py:2321: Exception
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /usr/local/lib/python3.10/dist-packages/sglang/srt/utils/common.py(2321)retry()
-> raise Exception(f"retry() exceed maximum number of retries.")

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Oct 11, 2025

The reason is qwen2 doesn't support dp_attention.
Currently, only specific models like qwen2/3 moe, deepseek v2 and some other model support dp_attention.

@yuan-luo yuan-luo force-pushed the optimize_qwen2_vl branch 2 times, most recently from aaff096 to 4fc2cb5 Compare October 11, 2025 15:30
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 11, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Updates include adjusted symmetric-memory all-reduce size limits, a new symmetric-memory all-reduce path in parallel_state, added interleaved multimodal RoPE support with constructor parameter and helper, robust seq_lens_sum computation for non-tensor inputs, and a test metadata time change.

Changes

Cohort / File(s) Summary of changes
All-reduce configuration and path
python/sglang/srt/distributed/device_communicators/all_reduce_utils.py, python/sglang/srt/distributed/parallel_state.py
Updated SYMM_MEM_ALL_REDUCE_MAX_SIZES thresholds; added symmetric-memory all-reduce branch in _all_reduce_in_place, selecting symm_mem_comm when available before falling back to PyTorch distributed group.
Rotary embedding (multimodal interleaved RoPE)
python/sglang/srt/layers/rotary_embedding.py
Added apply_interleaved_rope helper; extended MRotaryEmbedding with mrope_interleaved parameter and handling; validated/corrected mrope_section; conditional interleaved RoPE layout in forward; propagated flag via get_rope.
Scheduler batch handling
python/sglang/srt/managers/schedule_batch.py
Ensured seq_lens_sum computed as int for tensor and non-tensor seq_lens_cpu using .sum().item() or np.asarray(...).sum().
Test metadata
scripts/sort_testcases_alphabetically.py
Increased estimated_time for test_vllm_dependency.py from 185 to 800.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ParallelState as parallel_state._all_reduce_in_place
  participant SymmComm as symm_mem_comm
  participant NCCL as pynccl_comm
  participant Torch as torch.distributed

  Caller->>ParallelState: _all_reduce_in_place(input_)
  alt pynccl available and enabled
    ParallelState->>NCCL: all_reduce(input_)
    NCCL-->>ParallelState: done
  else symm_mem_comm available and enabled
    ParallelState->>SymmComm: all_reduce(input_)
    SymmComm-->>ParallelState: done
  else
    ParallelState->>Torch: all_reduce(input_, group=device_group)
    Torch-->>ParallelState: done
  end
  ParallelState-->>Caller: input_ reduced in-place
Loading
sequenceDiagram
  autonumber
  participant Caller
  participant MRoPE as MRotaryEmbedding.forward
  participant Helper as apply_interleaved_rope

  Caller->>MRoPE: forward(x, positions)
  alt positions.ndim == 2 and mrope_interleaved
    MRoPE->>Helper: transform cos, sin with mrope_section
    Helper-->>MRoPE: interleaved cos/sin
    MRoPE->>MRoPE: apply RoPE with interleaved layout
  else
    MRoPE->>MRoPE: standard per-section concat and apply
  end
  MRoPE-->>Caller: rotated embeddings
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at shifting bytes,
New ropes interleave in modal lights—
A whisper: reduce, but choose the lane,
Symmetric paths or torch to reign.
I thump, I hop, the tests run long,
Carrots queued, the code hops strong. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes the template sections and a clear Motivation, but it lacks any details under Modifications, Accuracy Tests, and Benchmarking and Profiling, leaving key change descriptions and results empty. Consequently, essential information about what was changed and the impact of those changes is missing. Please fill in the Modifications section with details of the code changes, add relevant accuracy test results and benchmarking metrics where applicable, and update the checklist to reflect completed or inapplicable items.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title accurately summarizes the primary change of re-landing a performance optimization for qwen-vl using symmetric memory all-reduce, matching the main intent of the changeset. It is concise, clear, and directly reflects the core update without extraneous detail.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuan-luo yuan-luo changed the title [WIP][Reland] perf: optimize qwen-vl with symm mem allreduce [Reland] perf: optimize qwen-vl with symm mem allreduce Oct 11, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

With the fix, the test case passed.

Capturing batches (bs=1 avail_mem=22.69 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:04<00:00, 11.91it/s]
[2025-10-11 09:11:03] Capture cuda graph end. Time elapsed: 5.11 s. mem usage=1.15 GB. avail mem=22.67 GB.
[2025-10-11 09:11:04] max_total_num_tokens=803065, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=22.67 GB
[2025-10-11 09:11:05] INFO:     Started server process [102776]
[2025-10-11 09:11:05] INFO:     Waiting for application startup.
[2025-10-11 09:11:05] INFO:     Application startup complete.
[2025-10-11 09:11:05] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-10-11 09:11:06] INFO:     127.0.0.1:38992 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-11 09:11:06] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 09:11:06] INFO:     127.0.0.1:39006 - "POST /generate HTTP/1.1" 200 OK
[2025-10-11 09:11:06] The server is fired up and ready to roll!
[2025-10-11 09:11:07] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 09:11:08] INFO:     127.0.0.1:39018 - "GET /health_generate HTTP/1.1" 200 OK
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
WARNING:sglang.srt.configs.model_config:Casting torch.bfloat16 to torch.float16.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.17s/it]

.[2025-10-11 09:11:12] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 37, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5.28, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 77, token usage: 0.00, cuda graph: True, gen throughput (token/s): 556.67, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 117, token usage: 0.00, cuda graph: True, gen throughput (token/s): 564.76, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 157, token usage: 0.00, cuda graph: True, gen throughput (token/s): 574.29, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 197, token usage: 0.00, cuda graph: True, gen throughput (token/s): 573.34, #queue-req: 0,
[2025-10-11 09:11:12] Decode batch. #running-req: 1, #token: 237, token usage: 0.00, cuda graph: True, gen throughput (token/s): 562.13, #queue-req: 0,
[2025-10-11 09:11:12] INFO:     127.0.0.1:36056 - "POST /generate HTTP/1.1" 200 OK
result = `{'text': ' ________.(\u3000\u3000)\nA. Paris\nB. London\nC. Tokyo\nD. Beijing\n答案:A\n考查英文常识.A.Paris巴黎;B.London伦敦;C.Tokyo东京;D.Beijing北京.根据常识可知,法国的首都是巴黎.\n解析:法国的首都是巴黎.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地努力,不断地进步.不断地', 'output_ids': [785, 6722, 315, 9625, 374, 32671, 563, 58883, 9909, 22441, 22441, 23083, 32, 13, 12095, 198, 33, 13, 7148, 198, 34, 13, 26194, 198, 35, 13, 26549, 198, 102349, 5122, 32, 198, 117246, 105205, 107537, 58883, 32, 58883, 59604, 106004, 24968, 33, 58883, 39572, 107074, 24968, 34, 58883, 52854, 16032, 107513, 24968, 35, 58883, 3430, 23649, 68990, 58883, 100345, 107537, 115059, 3837, 104328, 9370, 59975, 100132, 106004, 58883, 198, 106637, 5122, 104328, 9370, 59975, 100132, 106004, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 151643, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267, 101066, 3837, 107267, 101300, 58883, 107267], 'meta_info': {'id': 'b13e4fe4a62a4123a5d9f007d81689d6', 'finish_reason': {'type': 'length', 'length': 256}, 'prompt_tokens': 5, 'weight_version': 'default', 'completion_tokens': 256, 'cached_tokens': 2, 'e2e_latency': 0.5311167240142822}}`
Throughput: 475.9460023089085 tokens/s
.
----------------------------------------------------------------------
Ran 4 tests in 72.036s

OK
Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7d5a80d59d80>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache
    if not os.path.exists(cutlass.CACHE_FILE):
AttributeError: module 'cutlass' has no attribute 'CACHE_FILE'
[rank0]:[W1011 09:11:13.980602655 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@yuan-luo yuan-luo marked this pull request as ready for review October 11, 2025 16:12
@yuan-luo yuan-luo added run-ci and removed run-ci labels Oct 11, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/sglang/srt/distributed/parallel_state.py (1)

588-601: Assertion misses symm_mem path in out-of-place reducer.

If only symm_mem is enabled, assert any([qr, ca, pymscclpp]) can trip. Include symm_mem_comm in the assertion.

Apply this diff:

-        assert any([qr_comm, ca_comm, pymscclpp_comm])
+        assert any([qr_comm, ca_comm, pymscclpp_comm, symm_mem_comm])
♻️ Duplicate comments (2)
python/sglang/srt/distributed/parallel_state.py (1)

606-611: Bug: symm_mem all-reduce not in-place (result discarded).

symm_mem_comm.all_reduce returns a new tensor unless out is provided. Current call does nothing to input_. Use the in-place output buffer.

Apply this diff:

-        elif symm_mem_comm is not None and not symm_mem_comm.disabled:
-            symm_mem_comm.all_reduce(input_)
+        elif symm_mem_comm is not None and not symm_mem_comm.disabled:
+            # Perform the reduction into the same buffer
+            symm_mem_comm.all_reduce(input_, out=input_)
python/sglang/srt/layers/rotary_embedding.py (1)

1011-1020: Interleaving logic and extra clone; propose safer, allocation‑efficient mapping.

Issues:

  • clone is unnecessary and adds allocation.
  • Slices like 1: mrope_section[i]*3:3 assume a 3x stride layout in the source x[i]; with chunked input [TT..|HH..|WW..], H/W sources should be taken from [:mrope_section[i]], not stepped positions. Also need bounds so dest indices don’t exceed D.

Use an explicit, bounded interleave into a fresh buffer to avoid aliasing:

-def apply_interleaved_rope(x: torch.Tensor, mrope_section: list[int]) -> torch.Tensor:
-    """Apply interleaved MRoPE to 3D rotary embeddings.
-    Reorganizes frequency layout from chunked [TTT...HHH...WWW] to
-    interleaved [THTHWHTHW...TT], preserving frequency continuity.
-    """
-    x_t = x[0].clone()
-    x_t[..., 1 : mrope_section[1] * 3 : 3] = x[1, ..., 1 : mrope_section[1] * 3 : 3]
-    x_t[..., 2 : mrope_section[2] * 3 : 3] = x[2, ..., 2 : mrope_section[2] * 3 : 3]
-    return x_t
+def apply_interleaved_rope(x: torch.Tensor, mrope_section: list[int]) -> torch.Tensor:
+    """
+    x: [3, ..., D] with chunked layout per axis:
+       x[0][..., :t] (T), x[1][..., :h] (H), x[2][..., :w] (W)
+    Returns: [..., D] with interleaved T,H,W at strides of 3; leftover T fills tail.
+    """
+    t, h, w = mrope_section
+    D = x.shape[-1]
+    out = torch.empty_like(x[0])
+    # Fill from T/H/W sources into 0/1/2 mod-3 slots respectively.
+    kt = min(t, (D + 2) // 3)
+    kh = min(h, (D - 1 + 2) // 3)  # ceil((D-1)/3)
+    kw = min(w, (D - 2 + 2) // 3)  # ceil((D-2)/3)
+    out[..., 0 : 3 * kt : 3] = x[0, ..., :kt]
+    out[..., 1 : 1 + 3 * kh : 3] = x[1, ..., :kh]
+    out[..., 2 : 2 + 3 * kw : 3] = x[2, ..., :kw]
+    # Fill remaining slots (if any) from the tail of T.
+    filled = max(3 * kt, 1 + 3 * kh, 2 + 3 * kw)
+    if filled < D and t > kt:
+        out[..., filled:D] = x[0, ..., kt : kt + (D - filled)]
+    return out

This avoids mutation of shared storage, removes clone, and matches the chunked→interleaved intent.

🧹 Nitpick comments (1)
python/sglang/srt/layers/rotary_embedding.py (1)

1041-1073: Replace print with logging; avoid stdout in library code.

These prints will spam logs and aren’t controllable. Use logger.warning/info and gate behind rank checks if needed.

Apply this diff:

-                print(
-                    f"MRoPE section sum mismatch: expected {expected_sum}, got {actual_sum}. "
-                    f"Adjusting mrope_section to match rotary_dim // 2 = {expected_sum}"
-                )
+                logger.warning(
+                    "MRoPE section sum mismatch: expected %d, got %d. Adjusting to %d",
+                    expected_sum, actual_sum, expected_sum,
+                )
...
-                print(
-                    f"Corrected mrope_section: {self.mrope_section} (sum={sum(self.mrope_section)})"
-                )
+                logger.info(
+                    "Corrected mrope_section: %s (sum=%d)",
+                    self.mrope_section, sum(self.mrope_section),
+                )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5dcfd4 and 4fc2cb5e0d74e594ab4eeb961f9b8360002d41dd.

📒 Files selected for processing (5)
  • python/sglang/srt/distributed/device_communicators/all_reduce_utils.py (1 hunks)
  • python/sglang/srt/distributed/parallel_state.py (1 hunks)
  • python/sglang/srt/layers/rotary_embedding.py (4 hunks)
  • python/sglang/srt/managers/schedule_batch.py (1 hunks)
  • scripts/sort_testcases_alphabetically.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/sort_testcases_alphabetically.py (1)
test/srt/run_suite.py (1)
  • TestFile (9-11)
python/sglang/srt/distributed/parallel_state.py (2)
python/sglang/srt/distributed/device_communicators/symm_mem.py (1)
  • all_reduce (134-164)
python/sglang/srt/distributed/device_communicators/pynccl.py (1)
  • all_reduce (126-148)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
  • GitHub Check: vllm-dependency-test
  • GitHub Check: run-all-notebooks
  • GitHub Check: build-test (all)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 11)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 8)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 9)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 10)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 5)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 7)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 0)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 2)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 3)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 1)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 4)
  • GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 6)
  • GitHub Check: unit-test-backend-8-gpu-amd (linux-mi300-gpu-8, 0)
  • GitHub Check: unit-test-sgl-kernel-amd (linux-mi325-gpu-1)
  • GitHub Check: unit-test-backend-2-gpu-amd (linux-mi325-gpu-2)
  • GitHub Check: unit-test-backend-8-gpu-amd (linux-mi300-gpu-8, 1)
  • GitHub Check: bench-test-2-gpu-amd (linux-mi325-gpu-2)
  • GitHub Check: performance-test-1-gpu-part-1-amd (linux-mi325-gpu-1)
  • GitHub Check: accuracy-test-1-gpu-amd (linux-mi325-gpu-1)
  • GitHub Check: performance-test-1-gpu-part-2-amd (linux-mi325-gpu-1)
  • GitHub Check: accuracy-test-2-gpu-amd (linux-mi325-gpu-2)
  • GitHub Check: mla-test-1-gpu-amd (linux-mi325-gpu-1)
🔇 Additional comments (3)
scripts/sort_testcases_alphabetically.py (1)

176-176: Bump to 800s looks fine; verify CI scheduling impact.

Large increase may reshuffle buckets or exceed per-job timeouts. Confirm this won’t cause queue starvation on vllm_dependency_test shard.

python/sglang/srt/distributed/device_communicators/all_reduce_utils.py (1)

6-8: Increased SYMM_MEM max sizes—please validate against device memory budget.

The larger caps should help throughput but can raise pressure on symmetric buffers. Sanity-check:

  • Typical per-rank tensor sizes that hit these buckets
  • Multi-model co-location and fragmentation behavior

If available, share a quick microbenchmark delta before/after.

Also applies to: 12-12

python/sglang/srt/managers/schedule_batch.py (1)

1576-1581: Robust seq_lens_sum computation LGTM.

Covers tensor and list paths; keeps an int. Please sanity-check callers (e.g., get_model_worker_batch) won’t assume seq_lens_cpu is always a Tensor.

Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But why would it affect vllm test?

Comment thread scripts/sort_testcases_alphabetically.py Outdated
@yuan-luo yuan-luo enabled auto-merge (squash) October 13, 2025 02:02
@BBuf BBuf disabled auto-merge October 13, 2025 03:08
Comment thread python/sglang/srt/managers/schedule_batch.py Outdated
@hnyls2002 hnyls2002 merged commit 0b6f535 into sgl-project:main Oct 13, 2025
81 of 96 checks passed
mwcrutcher pushed a commit to crutcher-ai/sglang that referenced this pull request Oct 15, 2025
…11457)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
…11457)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants