Skip to content

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton kernel#21019

Merged
BBuf merged 3 commits intosgl-project:mainfrom
antgroup:optimize_qwen35_proj
Mar 23, 2026
Merged

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton kernel#21019
BBuf merged 3 commits intosgl-project:mainfrom
antgroup:optimize_qwen35_proj

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Mar 20, 2026

Motivation

In PR #19321 we fused Qwen3-Next GDN's qkvz_proj and ba_proj. This PR is a follow up. The background that Qwen3-Next and Qwen3.5's checkpoint layout are different.

Qwen3-Next weight loading path

The Qwen3-Next checkpoint directly stores the fused in_proj_qkvz weight (loaded_shard_id=None). This weight is already in the interleaved layout.
During loading, it goes through contiguous TP slice case, so the interleaved layout is preserved. As a result, the matmul output is also interleaved, and the Triton kernel reads it as interleaved data.

Qwen3.5 weight loading path

The Qwen3.5 checkpoint stores in_proj_qkv and in_proj_z separately. They are mapped through stacked_params_mapping with shard_id=(0,1,2) for q,k,v and shard_id=3 for z.
During loading, it goes through a different case, where MergedColumnParallelLinear.weight_loader places q, k, and v into contiguous regions according to output_sizes. Therefore, the matmul output becomes contiguous, and a new Triton kernel is needed to read from contiguous positions.

In summary, this PR fuses the split → reshape → cat operations in Qwen3_5GatedDeltaNet into a single Triton kernel (fused_qkvzba_split_reshape_cat), eliminating multiple kernel launches and intermediate tensor allocations during both prefill and decode. More details are in the following chapter.

Modifications

  • Triton Kernel Fusion: Introduced a new Triton kernel, fused_qkvzba_split_reshape_cat_contiguous, to fuse split, reshape, and cat operations within the Qwen3.5 Gated Delta Net (GDN) projection, reducing kernel launches and intermediate memory allocations.
  • Projection Layer Refactoring: Consolidated separate in_proj_qkv, in_proj_z, in_proj_b, and in_proj_a projection layers into two fused layers: in_proj_qkvz and in_proj_ba.
  • Weight Loader Enhancement: Implemented a robust _make_packed_weight_loader to correctly handle weight loading for both fused (packed) and split checkpoint formats, ensuring proper parameter initialization.
  • Weight Loading Mappings: Updated weight loading configurations across various Qwen3.5 model classes to correctly map original split weights to the new fused projection layers.

Accuracy Tests

GSM8K

Main:

➜  sglang git:(main) lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:30000/v1/completions,model=Qwen/Qwen3.5-35B-A3B,num_concurrent=109;
2026-03-20:12:08:55 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-20:12:08:55 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-20:12:08:55 INFO     [evaluator:236] Initializing local-completions model, with arguments: {'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}
2026-03-20:12:08:55 INFO     [models.openai_completions:42] Remote tokenizer not supported. Using huggingface tokenizer backend.
2026-03-20:12:08:55 INFO     [models.api_models:172] Using max length 2048 - 1
2026-03-20:12:08:55 INFO     [models.api_models:193] Using tokenizer huggingface
2026-03-20:12:08:58 INFO     [tasks:700] Selected tasks:
2026-03-20:12:08:58 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-20:12:08:58 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-03-20:12:08:58 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 293.21it/s]
2026-03-20:12:09:03 INFO     [evaluator:584] Running generate_until requests
Requesting API: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:13<00:00,  9.84it/s]
2026-03-20:12:11:26 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
local-completions ({'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8476|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8347|±  |0.0102|

PR:

➜  bench_script lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:30000/v1/completions,model=Qwen/Qwen3.5-35B-A3B,num_concurrent=109;
2026-03-20:12:58:10 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-20:12:58:10 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-20:12:58:10 INFO     [evaluator:236] Initializing local-completions model, with arguments: {'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}
2026-03-20:12:58:10 INFO     [models.openai_completions:42] Remote tokenizer not supported. Using huggingface tokenizer backend.
2026-03-20:12:58:10 INFO     [models.api_models:172] Using max length 2048 - 1
2026-03-20:12:58:10 INFO     [models.api_models:193] Using tokenizer huggingface
2026-03-20:12:58:14 INFO     [tasks:700] Selected tasks:
2026-03-20:12:58:14 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-20:12:58:14 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-03-20:12:58:14 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 289.69it/s]
2026-03-20:12:58:18 INFO     [evaluator:584] Running generate_until requests
Requesting API: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:07<00:00, 10.35it/s]
fatal: not a git repository (or any of the parent directories): .git
2026-03-20:13:00:34 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
local-completions ({'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8340|±  |0.0102|

LLM result has no problem.

➜  bench_script cat test_openai.py
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals. Tell me how you rank them"},
    ],
    temperature=0,
    max_tokens=200,
)
print(response)

➜  bench_script python test_openai.py
ChatCompletion(id='a1b1d435525d47dc88f4ec69956cafd0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: List 3 countries and their capitals.\n    *   Task: Tell how I rank them.\n    *   Constraint: The user is asking for a ranking of countries. This is a subjective task. As an AI, I need to be careful not to express personal opinions or biases, but I can explain *criteria* for ranking or acknowledge the subjectivity.\n    *   Safety/Policy: I should avoid making value judgments that could be seen as discriminatory or controversial (e.g., ranking based on wealth, power, etc., without context). However, ranking countries based on neutral, factual criteria (like population, area, GDP) is generally acceptable, but the prompt asks "how *you* rank them," implying personal preference. I need to clarify that I don\'t have personal preferences.\n\n2.  **Determine the Content:**\n    *   Select 3 countries and', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1774007301, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=35, total_tokens=235, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

Benchmarking and Profiling

H200

Metrics Main (baseline) PR (fused kernel) Change
Benchmark duration (s) 125.38 116.72 -6.9%
Request throughput (req/s) 1.60 1.71 +6.9%
Input token throughput (tok/s) 202.88 217.92 +7.4%
Output throughput (tok/s) 3240.71 3481.08 +7.4%
Peak output throughput (tok/s) 4704.00 4902.00 +4.2%
Total token throughput (tok/s) 3443.59 3699.00 +7.4%
Mean E2E Latency (ms) 57949.06 52732.48 -9.0%
Median E2E Latency (ms) 59307.79 53722.67 -9.4%
P90 E2E Latency (ms) 92768.02 84853.67 -8.5%
P99 E2E Latency (ms) 101228.61 92919.66 -8.2%
Mean TTFT (ms) 25876.75 23071.14 -10.8%
Median TTFT (ms) 23999.71 21658.93 -9.8%
P99 TTFT (ms) 67945.42 61232.54 -9.9%
Mean TPOT (ms) 16.16 14.95 -7.5%
Median TPOT (ms) 16.72 15.43 -7.7%
P99 TPOT (ms) 22.12 20.52 -7.2%
Mean ITL (ms) 15.79 14.61 -7.5%
Median ITL (ms) 13.65 12.98 -4.9%
P95 ITL (ms) 20.86 14.36 -31.2%
P99 ITL (ms) 94.65 94.30 -0.4%
Max ITL (ms) 479.09 469.66 -2.0%
Peak concurrent requests 185 183 -1.1%
Concurrency 92.44 90.35 -2.3%
Server:
➜  sglang git:(main) CUDA_VISIBLE_DEVICES=1,2 python3 -m sglang.launch_server \
  --model Qwen/Qwen3.5-35B-A3B \
  --tp 2 \
  --port 30000 \
  --max-running-requests 64

Client:
➜  sglang_dev git:(optimize_qwen35_proj) ✗ python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --dataset-name random \
  --random-input-len 256 \
  --random-output-len 4096 \
  --num-prompts 200 \
  --request-rate 10

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    10.0
Max request concurrency:                 not set
Successful requests:                     200
Benchmark duration (s):                  125.38
Total input tokens:                      25437
Total input text tokens:                 25437
Total generated tokens:                  406325
Total generated tokens (retokenized):    399136
Request throughput (req/s):              1.60
Input token throughput (tok/s):          202.88
Output token throughput (tok/s):         3240.71
Peak output token throughput (tok/s):    4704.00
Peak concurrent requests:                185
Total token throughput (tok/s):          3443.59
Concurrency:                             92.44
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57949.06
Median E2E Latency (ms):                 59307.79
P90 E2E Latency (ms):                    92768.02
P99 E2E Latency (ms):                    101228.61
---------------Time to First Token----------------
Mean TTFT (ms):                          25876.75
Median TTFT (ms):                        23999.71
P99 TTFT (ms):                           67945.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.16
Median TPOT (ms):                        16.72
P99 TPOT (ms):                           22.12
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.79
Median ITL (ms):                         13.65
P95 ITL (ms):                            20.86
P99 ITL (ms):                            94.65
Max ITL (ms):                            479.09
==================================================

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    10.0
Max request concurrency:                 not set
Successful requests:                     200
Benchmark duration (s):                  116.72
Total input tokens:                      25437
Total input text tokens:                 25437
Total generated tokens:                  406325
Total generated tokens (retokenized):    396587
Request throughput (req/s):              1.71
Input token throughput (tok/s):          217.92
Output token throughput (tok/s):         3481.08
Peak output token throughput (tok/s):    4902.00
Peak concurrent requests:                183
Total token throughput (tok/s):          3699.00
Concurrency:                             90.35
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   52732.48
Median E2E Latency (ms):                 53722.67
P90 E2E Latency (ms):                    84853.67
P99 E2E Latency (ms):                    92919.66
---------------Time to First Token----------------
Mean TTFT (ms):                          23071.14
Median TTFT (ms):                        21658.93
P99 TTFT (ms):                           61232.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.95
Median TPOT (ms):                        15.43
P99 TPOT (ms):                           20.52
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.61
Median ITL (ms):                         12.98
P95 ITL (ms):                            14.36
P99 ITL (ms):                            94.30
Max ITL (ms):                            469.66
==================================================

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the Qwen3.5 Gated Delta Net by fusing several projection operations into a single, efficient Triton kernel. This fusion reduces computational overhead and memory footprint by minimizing kernel launches and intermediate tensor allocations, leading to improved inference performance, with benchmarks showing a +6.5% increase in output throughput and notable reductions in Time Per Output Token (TPOT) and Time To First Token (TTFT). The changes also include refactoring of projection layers and an enhanced weight loading mechanism to support both fused and split checkpoint formats, ensuring compatibility and maintainability.

Highlights

  • Triton Kernel Fusion: Introduced a new Triton kernel, fused_qkvzba_split_reshape_cat, to fuse split, reshape, and cat operations within the Qwen3.5 Gated Delta Net (GDN) projection, reducing kernel launches and intermediate memory allocations.
  • Projection Layer Refactoring: Consolidated separate in_proj_qkv, in_proj_z, in_proj_b, and in_proj_a projection layers into two fused layers: in_proj_qkvz and in_proj_ba.
  • Weight Loader Enhancement: Implemented a robust _make_packed_weight_loader to correctly handle weight loading for both fused (packed) and split checkpoint formats, ensuring proper parameter initialization.
  • Dynamic Execution Path: The forward pass now dynamically chooses between the optimized Triton kernel (if available and on CPU with AMX support) or a Python-based splitting and reshaping fallback path.
  • Weight Loading Mappings: Updated weight loading configurations across various Qwen3.5 model classes to correctly map original split weights to the new fused projection layers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 20, 2026

/tag-and-rerun-ci again

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@yuan-luo yuan-luo marked this pull request as draft March 20, 2026 12:23
@yuan-luo yuan-luo force-pushed the optimize_qwen35_proj branch from c149767 to 294ab57 Compare March 20, 2026 12:24
@yuan-luo yuan-luo marked this pull request as ready for review March 20, 2026 13:01
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yuan-luo yuan-luo force-pushed the optimize_qwen35_proj branch from 8aaf698 to ffec15a Compare March 22, 2026 01:17
@yizhang2077
Copy link
Copy Markdown
Collaborator

could you paste FP8 test results here?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

could you paste FP8 test results here?

It encountered some error. Investigating.

root@e1448ef40573:/sgl-workspace/sglang_dev/python# python3 -m sglang.launch_server --model Qwen/Qwen3.5-35B-A3B-FP8 --tp-size 2 --disable-radix-cache --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager
/sgl-workspace/sglang_dev/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-22 07:41:19] WARNING server_args.py:6487: The command line argument '--enable-piecewise-cuda-graph' is deprecated and will be removed in future versions.
config.json: 20.5kB [00:00, 57.5MB/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:21] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 244/244 [00:00<00:00, 1.06MB/s]
[2026-03-22 07:41:22] WARNING model_config.py:1098: Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:22] INFO server_args.py:2232: Attention backend not specified. Use fa3 backend by default.
/sgl-workspace/sglang_dev/python/sglang/srt/entrypoints/http_server.py:175: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-03-22 07:41:23] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2026-03-22 07:41:23] server_args=ServerArgs(model_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7712545312499999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=870060694, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3.5-35B-A3B-FP8', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-22 07:41:23] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 390/390 [00:00<00:00, 2.21MB/s]
chat_template.jinja: 7.76kB [00:00, 12.1MB/s]
tokenizer_config.json: 16.7kB [00:00, 33.0MB/s]
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:31] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:32 TP1] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:32 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s][2026-03-22 07:41:32 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:01<00:00, 9.06MB/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:34 TP0] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:35] Detected generic TokenizersBackend for Qwen/Qwen3.5-35B-A3B-FP8, retrying with trust_remote_code=True
video_preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 2.88MB/s]
[2026-03-22 07:41:38] Using default HuggingFace chat template with detected content format: openai
[2026-03-22 07:41:40 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:40 TP0] Init torch distributed begin.
[2026-03-22 07:41:40 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:40 TP1] Init torch distributed begin.
[rank1]:[W322 07:41:40.917985090 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W322 07:41:40.930351145 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-22 07:41:40 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-22 07:41:42 TP0] Init torch distributed ends. elapsed=1.84 s, mem usage=0.80 GB
[2026-03-22 07:41:42 TP1] Init torch distributed ends. elapsed=1.72 s, mem usage=0.80 GB
[2026-03-22 07:41:43 TP0] Load weight begin. avail mem=77.99 GB
[2026-03-22 07:41:43 TP0] Detected fp8 checkpoint.
[2026-03-22 07:41:43 TP1] Load weight begin. avail mem=77.99 GB
[2026-03-22 07:41:43 TP0] Multimodal attention backend not set. Use fa3.
[2026-03-22 07:41:43 TP0] Using fa3 as multimodal attention backend.
[2026-03-22 07:41:43 TP1] Multimodal attention backend not set. Use fa3.
[2026-03-22 07:41:43 TP1] Using fa3 as multimodal attention backend.
[2026-03-22 07:41:43 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 3403, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 376, in __init__
    self.init_model_worker()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 593, in init_model_worker
    self.init_tp_model_worker()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 551, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/tp_worker.py", line 261, in __init__
    self._init_model_runner()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 422, in __init__
    self.initialize(pre_model_load_memory)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 502, in initialize
    self.load_model()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 1079, in load_model
    self.model = self.loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_loader/loader.py", line 683, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_loader/loader.py", line 277, in _initialize_model
    return model_class(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 1358, in __init__
    super().__init__(config, quant_config, prefix, language_model_cls)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_vl.py", line 1100, in __init__
    self.model = language_model_cls(
                 ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 1040, in __init__
    super().__init__(config=config, quant_config=quant_config, prefix=prefix)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 872, in __init__
    self.layers, self._start_layer, self._end_layer = make_layers(
                                                      ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/common.py", line 654, in make_layers
    + get_offloader().wrap_modules(
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/common.py", line 656, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 863, in get_layer
    return layer_class(
           ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 463, in __init__
    self.linear_attn = Qwen3_5GatedDeltaNet(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 165, in __init__
    self.in_proj_qkvz.weight.weight_loader = self._make_packed_weight_loader(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: property 'weight_loader' of 'ModelWeightParameter' object has no setter

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

The reason is fp8 and unquant models use different weight parameters. To be more specific,
unquant uses: weight = Parameter(torch.empty(...), requires_grad=False)
fp8 quant uses: weight = ModelWeightParameter(...)

When running self.in_proj_qkvz.weight.weight_loader = self._make_packed_weight_loader(), for fp8 ModelWeightParameter, weight_loader is a read-only property, which has no setter and leads error in assignment.

Fixing in progress.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

FP8 problem fixed.

root@e1448ef40573:/sgl-workspace/sglang_dev/python# python3 -m sglang.launch_server --model Qwen/Qwen3.5-35B-A3B-FP8 --tp-size 2 --disable-radix-cache --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager
/sgl-workspace/sglang_dev/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-22 08:51:30] WARNING server_args.py:6487: The command line argument '--enable-piecewise-cuda-graph' is deprecated and will be removed in future versions.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:32] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:32] WARNING model_config.py:1098: Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:32] INFO server_args.py:2232: Attention backend not specified. Use fa3 backend by default.
/sgl-workspace/sglang_dev/python/sglang/srt/entrypoints/http_server.py:175: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-03-22 08:51:33] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2026-03-22 08:51:33] server_args=ServerArgs(model_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7712545312499999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=517779333, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3.5-35B-A3B-FP8', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-22 08:51:33] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41 TP1] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:41 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:42 TP0] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:43] Using default HuggingFace chat template with detected content format: openai
[2026-03-22 08:51:43] Detected generic TokenizersBackend for Qwen/Qwen3.5-35B-A3B-FP8, retrying with trust_remote_code=True
[2026-03-22 08:51:49 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:49 TP1] Init torch distributed begin.
[2026-03-22 08:51:50 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:50 TP0] Init torch distributed begin.
[rank1]:[W322 08:51:50.862068276 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W322 08:51:50.870712993 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-22 08:51:50 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-22 08:51:51 TP1] Init torch distributed ends. elapsed=1.73 s, mem usage=0.80 GB
[2026-03-22 08:51:51 TP0] Init torch distributed ends. elapsed=1.58 s, mem usage=0.80 GB
[2026-03-22 08:51:52 TP1] Load weight begin. avail mem=77.99 GB
[2026-03-22 08:51:52 TP0] Load weight begin. avail mem=77.99 GB
[2026-03-22 08:51:52 TP0] Detected fp8 checkpoint.
[2026-03-22 08:51:52 TP1] Multimodal attention backend not set. Use fa3.
[2026-03-22 08:51:52 TP1] Using fa3 as multimodal attention backend.
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-03-22 08:51:52 TP0] Multimodal attention backend not set. Use fa3.
[2026-03-22 08:51:52 TP0] Using fa3 as multimodal attention backend.
[2026-03-22 08:51:52 TP1] using attn output gate!
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-03-22 08:51:52 TP0] using attn output gate!
[2026-03-22 08:51:53 TP0] Found local HF snapshot for Qwen/Qwen3.5-35B-A3B-FP8 at /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B-FP8/snapshots/0b2752837483aa34b3db6e83e151b150c0e00e49; skipping download.
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:09<00:00,  1.38it/s][2026-03-22 08:52:03 TP1] Load weight end. elapsed=11.01 s, type=Qwen3_5MoeForConditionalGeneration, quant=fp8, avail mem=60.82 GB, mem usage=17.17 GB.
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:10<00:00,  1.32it/s]
[2026-03-22 08:52:04 TP0] Load weight end. elapsed=11.55 s, type=Qwen3_5MoeForConditionalGeneration, quant=fp8, avail mem=60.82 GB, mem usage=17.17 GB.
[2026-03-22 08:52:04 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-22 08:52:04 TP0] Mamba Cache is allocated. max_mamba_cache_size: 678, conv_state size: 0.47GB, ssm_state size: 19.89GB 
[2026-03-22 08:52:04 TP1] Mamba Cache is allocated. max_mamba_cache_size: 678, conv_state size: 0.47GB, ssm_state size: 19.89GB 
[2026-03-22 08:52:04 TP0] KV Cache is allocated. #tokens: 2371049, K size: 11.31 GB, V size: 11.31 GB
[2026-03-22 08:52:04 TP1] KV Cache is allocated. #tokens: 2371049, K size: 11.31 GB, V size: 11.31 GB
[2026-03-22 08:52:04 TP0] Memory pool end. avail mem=17.13 GB
[2026-03-22 08:52:04 TP1] Memory pool end. avail mem=17.13 GB
[2026-03-22 08:52:04 TP0] Linear attention kernel backend: decode=triton, prefill=triton
[2026-03-22 08:52:04 TP0] Using hybrid linear attention backend for hybrid GDN models.
[2026-03-22 08:52:04 TP0] GDN kernel dispatcher: decode=TritonGDNKernel, extend=TritonGDNKernel, verify=TritonGDNKernel packed_decode=True
[2026-03-22 08:52:04 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.03 GB
[2026-03-22 08:52:04 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2026-03-22 08:52:04 TP1] Using hybrid linear attention backend for hybrid GDN models.
[2026-03-22 08:52:04 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=17.03 GB
Capturing batches (bs=256 avail_mem=16.54 GB):   0%|                                                                                                                                    | 0/36 [00:00<?, ?it/s][2026-03-22 08:52:11 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:52:11 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=6144, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:52:11 TP0] Required memory for warmup: 0.23046875GB, Available memory: 16.51776123046875GB
                                                                                                                                                                                                              [2026-03-22 08:52:17 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 16777216                                                              | 65/16384 [00:05<19:49, 13.72it/s]
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:55<00:00, 293.25it/s]
[2026-03-22 08:53:07 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:53:07 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:53:07 TP0] Required memory for warmup: 0.09765625GB, Available memory: 16.44940185546875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:56<00:00, 141.19it/s]
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 16777216██████████████████████████████████████████████▍         | 15298/16384 [01:56<00:00, 1215.95it/s]
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
[2026-03-22 08:55:03 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 2097152
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 2097152
[2026-03-22 08:55:03 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 50331648
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 50331648
[2026-03-22 08:55:03 TP1] FlashInfer workspace initialized for rank 1, world_size 2, backend trtllm
[2026-03-22 08:55:03 TP0] FlashInfer workspace initialized for rank 0, world_size 2, backend trtllm
[2026-03-22 08:55:03 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:55:03 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=512, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:55:03 TP0] Required memory for warmup: 0.0478515625GB, Available memory: 14.65252685546875GB
                                                                                                                                                                                                              [2026-03-22 08:55:06 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 08:55:06 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [04:01<00:00, 67.74it/s]
[2026-03-22 08:59:05 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:59:05 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=256, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:59:05 TP0] Required memory for warmup: 0.06689453125GB, Available memory: 14.64666748046875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:09<00:00, 236.48it/s]
[2026-03-22 09:00:14 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 09:00:14 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 09:00:14 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 09:00:14 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=4608, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 09:00:14 TP0] Required memory for warmup: 0.1806640625GB, Available memory: 14.59393310546875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:07<00:00, 241.33it/s]
Capturing batches (bs=1 avail_mem=13.97 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [09:30<00:00, 15.86s/it]
[2026-03-22 09:01:35 TP0] Registering 72 cuda graph addresses
[2026-03-22 09:01:36 TP1] Capture cuda graph end. Time elapsed: 571.66 s. mem usage=3.06 GB. avail mem=13.97 GB.
[2026-03-22 09:01:36 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-03-22 09:01:36 TP0] Capture cuda graph end. Time elapsed: 571.66 s. mem usage=3.07 GB. avail mem=13.96 GB.
[2026-03-22 09:01:36 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-03-22 09:01:44 TP0] max_total_num_tokens=2371049, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=678, context_len=262144, available_gpu_mem=13.96 GB
[2026-03-22 09:01:45] INFO:     Started server process [50198]
[2026-03-22 09:01:45] INFO:     Waiting for application startup.
[2026-03-22 09:01:45] Using default chat sampling params from model generation config: {'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}
[2026-03-22 09:01:45] INFO:     Application startup complete.
[2026-03-22 09:01:45] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-03-22 09:01:46] INFO:     127.0.0.1:46550 - "GET /model_info HTTP/1.1" 200 OK
[2026-03-22 09:01:48 TP0] Prefill batch, #new-seq: 1, #new-token: 80, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-03-22 09:01:48] INFO:     127.0.0.1:46562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-03-22 09:01:48] The server is fired up and ready to roll!
[2026-03-22 09:01:53 TP0] Prefill batch, #new-seq: 1, #new-token: 35, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 15.64
[2026-03-22 09:01:53 TP0] Decode batch, #running-req: 1, #full token: 68, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0
[2026-03-22 09:01:53 TP0] Decode batch, #running-req: 1, #full token: 108, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 244.26, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 148, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 244.08, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 188, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 243.97, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 228, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 243.84, #queue-req: 0
[2026-03-22 09:01:54] INFO:     127.0.0.1:60384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
root@e1448ef40573:/sgl-workspace/bench_script# python test_openai.py 
ChatCompletion(id='c87bd63dec9844c184b3a4506266d5d0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: List 3 countries and their capitals.\n    *   Task: Tell the user how I rank them.\n    *   Constraint: The user is asking for a ranking of countries/capitals.\n\n2.  **Evaluate Safety & Policy:**\n    *   Is there any sensitive or controversial topic here? No, countries and capitals are general knowledge.\n    *   Is there a risk of bias? Yes, ranking countries can be subjective and potentially offensive or misleading if based on arbitrary criteria (e.g., "best," "most important," "wealthiest" without context).\n    *   Is there a risk of hallucination? No, country/capital data is factual.\n    *   Is there a risk of violating policies on neutrality? Yes, as an AI, I should avoid making subjective value judgments about nations unless there\'s a clear, objective metric (like population', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1774170114, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=35, total_tokens=235, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

@yuan-luo yuan-luo force-pushed the optimize_qwen35_proj branch from ffec15a to 77cec9e Compare March 22, 2026 09:08
@jasperjiaguo
Copy link
Copy Markdown
Contributor

Thanks @yuan-luo taking a look

@edwingao28
Copy link
Copy Markdown
Contributor

The GEMM fusion follows the same approach as #19321 for Qwen3-Next.
The reshape kernel overhead on small models is worth investigating separately but shouldn't block this PR.

Copy link
Copy Markdown
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes approach lgtm. I will keep a separate tab on the small model perf.

@yuan-luo yuan-luo force-pushed the optimize_qwen35_proj branch from 77cec9e to 1b0fa5f Compare March 23, 2026 03:45
@BBuf BBuf merged commit 5bdc07d into sgl-project:main Mar 23, 2026
320 of 375 checks passed
@yuan-luo yuan-luo deleted the optimize_qwen35_proj branch March 24, 2026 02:52
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
…rnel (sgl-project#21019)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
…rnel (sgl-project#21019)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
@cs-cat
Copy link
Copy Markdown
Contributor

cs-cat commented Mar 31, 2026

Thank you for your work @yuan-luo . This PR does indeed bring significant performance improvements, but it seems to affect the accuracy of the model? Please refer to #21696.

IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026
Fuse split/reshape/cat ops in GDN projection, adapted for contiguous layout.
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026
…sed split/reshape/cat ops in gdn

Adds FP8 quantization support for the fused GDN projection layers:
- _override_weight_loader: robust loader override for FP8/quantized params
- _bind_packed_weight_loaders: covers weight, weight_scale_inv, weight_scale, input_scale
- _get_split_sizes_for_param: handles BlockQuantScaleParameter and PerTensorScaleParameter
- Updated _make_packed_weight_loader to support FP8 scale parameters
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026
- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support
- Add BF16 qkv z b a fusion and PTPC quant config
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support
- Add BF16 qkv z b a fusion and PTPC quant config
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support
- Add BF16 qkv z b a fusion and PTPC quant config
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support
- Add BF16 qkv z b a fusion and PTPC quant config
qichu-yun pushed a commit to zejunchen-zejun/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support
- Add BF16 qkv z b a fusion and PTPC quant config
IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019 load weight func
- Add BF16 qkv z b a fusion and PTPC quant config
qichu-yun pushed a commit to zejunchen-zejun/sglang that referenced this pull request Apr 3, 2026
- Cherry-pick PR sgl-project#21019 load weight func
- Add BF16 qkv z b a fusion and PTPC quant config
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…rnel (sgl-project#21019)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…rnel (sgl-project#21019)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants