[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton kernel by yuan-luo · Pull Request #21019 · sgl-project/sglang

yuan-luo · 2026-03-20T11:39:55Z

Motivation

In PR #19321 we fused Qwen3-Next GDN's qkvz_proj and ba_proj. This PR is a follow up. The background that Qwen3-Next and Qwen3.5's checkpoint layout are different.

Qwen3-Next weight loading path

The Qwen3-Next checkpoint directly stores the fused in_proj_qkvz weight (loaded_shard_id=None). This weight is already in the interleaved layout.
During loading, it goes through contiguous TP slice case, so the interleaved layout is preserved. As a result, the matmul output is also interleaved, and the Triton kernel reads it as interleaved data.

Qwen3.5 weight loading path

The Qwen3.5 checkpoint stores in_proj_qkv and in_proj_z separately. They are mapped through stacked_params_mapping with shard_id=(0,1,2) for q,k,v and shard_id=3 for z.
During loading, it goes through a different case, where MergedColumnParallelLinear.weight_loader places q, k, and v into contiguous regions according to output_sizes. Therefore, the matmul output becomes contiguous, and a new Triton kernel is needed to read from contiguous positions.

In summary, this PR fuses the split → reshape → cat operations in Qwen3_5GatedDeltaNet into a single Triton kernel (fused_qkvzba_split_reshape_cat), eliminating multiple kernel launches and intermediate tensor allocations during both prefill and decode. More details are in the following chapter.

Modifications

Triton Kernel Fusion: Introduced a new Triton kernel, fused_qkvzba_split_reshape_cat_contiguous, to fuse split, reshape, and cat operations within the Qwen3.5 Gated Delta Net (GDN) projection, reducing kernel launches and intermediate memory allocations.
Projection Layer Refactoring: Consolidated separate in_proj_qkv, in_proj_z, in_proj_b, and in_proj_a projection layers into two fused layers: in_proj_qkvz and in_proj_ba.
Weight Loader Enhancement: Implemented a robust _make_packed_weight_loader to correctly handle weight loading for both fused (packed) and split checkpoint formats, ensuring proper parameter initialization.
Weight Loading Mappings: Updated weight loading configurations across various Qwen3.5 model classes to correctly map original split weights to the new fused projection layers.

Accuracy Tests

GSM8K

Main:

➜  sglang git:(main) lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:30000/v1/completions,model=Qwen/Qwen3.5-35B-A3B,num_concurrent=109;
2026-03-20:12:08:55 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-20:12:08:55 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-20:12:08:55 INFO     [evaluator:236] Initializing local-completions model, with arguments: {'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}
2026-03-20:12:08:55 INFO     [models.openai_completions:42] Remote tokenizer not supported. Using huggingface tokenizer backend.
2026-03-20:12:08:55 INFO     [models.api_models:172] Using max length 2048 - 1
2026-03-20:12:08:55 INFO     [models.api_models:193] Using tokenizer huggingface
2026-03-20:12:08:58 INFO     [tasks:700] Selected tasks:
2026-03-20:12:08:58 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-20:12:08:58 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-03-20:12:08:58 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 293.21it/s]
2026-03-20:12:09:03 INFO     [evaluator:584] Running generate_until requests
Requesting API: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:13<00:00,  9.84it/s]
2026-03-20:12:11:26 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
local-completions ({'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8476|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8347|±  |0.0102|

PR:

➜  bench_script lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:30000/v1/completions,model=Qwen/Qwen3.5-35B-A3B,num_concurrent=109;
2026-03-20:12:58:10 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-20:12:58:10 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-20:12:58:10 INFO     [evaluator:236] Initializing local-completions model, with arguments: {'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}
2026-03-20:12:58:10 INFO     [models.openai_completions:42] Remote tokenizer not supported. Using huggingface tokenizer backend.
2026-03-20:12:58:10 INFO     [models.api_models:172] Using max length 2048 - 1
2026-03-20:12:58:10 INFO     [models.api_models:193] Using tokenizer huggingface
2026-03-20:12:58:14 INFO     [tasks:700] Selected tasks:
2026-03-20:12:58:14 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-20:12:58:14 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-03-20:12:58:14 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 289.69it/s]
2026-03-20:12:58:18 INFO     [evaluator:584] Running generate_until requests
Requesting API: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:07<00:00, 10.35it/s]
fatal: not a git repository (or any of the parent directories): .git
2026-03-20:13:00:34 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
local-completions ({'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8340|±  |0.0102|

LLM result has no problem.

➜  bench_script cat test_openai.py
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals. Tell me how you rank them"},
    ],
    temperature=0,
    max_tokens=200,
)
print(response)

➜  bench_script python test_openai.py
ChatCompletion(id='a1b1d435525d47dc88f4ec69956cafd0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: List 3 countries and their capitals.\n    *   Task: Tell how I rank them.\n    *   Constraint: The user is asking for a ranking of countries. This is a subjective task. As an AI, I need to be careful not to express personal opinions or biases, but I can explain *criteria* for ranking or acknowledge the subjectivity.\n    *   Safety/Policy: I should avoid making value judgments that could be seen as discriminatory or controversial (e.g., ranking based on wealth, power, etc., without context). However, ranking countries based on neutral, factual criteria (like population, area, GDP) is generally acceptable, but the prompt asks "how *you* rank them," implying personal preference. I need to clarify that I don\'t have personal preferences.\n\n2.  **Determine the Content:**\n    *   Select 3 countries and', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1774007301, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=35, total_tokens=235, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

Benchmarking and Profiling

H200

Metrics	Main (baseline)	PR (fused kernel)	Change
Benchmark duration (s)	125.38	116.72	-6.9%
Request throughput (req/s)	1.60	1.71	+6.9%
Input token throughput (tok/s)	202.88	217.92	+7.4%
Output throughput (tok/s)	3240.71	3481.08	+7.4%
Peak output throughput (tok/s)	4704.00	4902.00	+4.2%
Total token throughput (tok/s)	3443.59	3699.00	+7.4%
Mean E2E Latency (ms)	57949.06	52732.48	-9.0%
Median E2E Latency (ms)	59307.79	53722.67	-9.4%
P90 E2E Latency (ms)	92768.02	84853.67	-8.5%
P99 E2E Latency (ms)	101228.61	92919.66	-8.2%
Mean TTFT (ms)	25876.75	23071.14	-10.8%
Median TTFT (ms)	23999.71	21658.93	-9.8%
P99 TTFT (ms)	67945.42	61232.54	-9.9%
Mean TPOT (ms)	16.16	14.95	-7.5%
Median TPOT (ms)	16.72	15.43	-7.7%
P99 TPOT (ms)	22.12	20.52	-7.2%
Mean ITL (ms)	15.79	14.61	-7.5%
Median ITL (ms)	13.65	12.98	-4.9%
P95 ITL (ms)	20.86	14.36	-31.2%
P99 ITL (ms)	94.65	94.30	-0.4%
Max ITL (ms)	479.09	469.66	-2.0%
Peak concurrent requests	185	183	-1.1%
Concurrency	92.44	90.35	-2.3%

Server:
➜  sglang git:(main) CUDA_VISIBLE_DEVICES=1,2 python3 -m sglang.launch_server \
  --model Qwen/Qwen3.5-35B-A3B \
  --tp 2 \
  --port 30000 \
  --max-running-requests 64

Client:
➜  sglang_dev git:(optimize_qwen35_proj) ✗ python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 --port 30000 \
  --dataset-name random \
  --random-input-len 256 \
  --random-output-len 4096 \
  --num-prompts 200 \
  --request-rate 10

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    10.0
Max request concurrency:                 not set
Successful requests:                     200
Benchmark duration (s):                  125.38
Total input tokens:                      25437
Total input text tokens:                 25437
Total generated tokens:                  406325
Total generated tokens (retokenized):    399136
Request throughput (req/s):              1.60
Input token throughput (tok/s):          202.88
Output token throughput (tok/s):         3240.71
Peak output token throughput (tok/s):    4704.00
Peak concurrent requests:                185
Total token throughput (tok/s):          3443.59
Concurrency:                             92.44
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57949.06
Median E2E Latency (ms):                 59307.79
P90 E2E Latency (ms):                    92768.02
P99 E2E Latency (ms):                    101228.61
---------------Time to First Token----------------
Mean TTFT (ms):                          25876.75
Median TTFT (ms):                        23999.71
P99 TTFT (ms):                           67945.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.16
Median TPOT (ms):                        16.72
P99 TPOT (ms):                           22.12
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.79
Median ITL (ms):                         13.65
P95 ITL (ms):                            20.86
P99 ITL (ms):                            94.65
Max ITL (ms):                            479.09
==================================================

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    10.0
Max request concurrency:                 not set
Successful requests:                     200
Benchmark duration (s):                  116.72
Total input tokens:                      25437
Total input text tokens:                 25437
Total generated tokens:                  406325
Total generated tokens (retokenized):    396587
Request throughput (req/s):              1.71
Input token throughput (tok/s):          217.92
Output token throughput (tok/s):         3481.08
Peak output token throughput (tok/s):    4902.00
Peak concurrent requests:                183
Total token throughput (tok/s):          3699.00
Concurrency:                             90.35
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   52732.48
Median E2E Latency (ms):                 53722.67
P90 E2E Latency (ms):                    84853.67
P99 E2E Latency (ms):                    92919.66
---------------Time to First Token----------------
Mean TTFT (ms):                          23071.14
Median TTFT (ms):                        21658.93
P99 TTFT (ms):                           61232.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.95
Median TPOT (ms):                        15.43
P99 TPOT (ms):                           20.52
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.61
Median ITL (ms):                         12.98
P95 ITL (ms):                            14.36
P99 ITL (ms):                            94.30
Max ITL (ms):                            469.66
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-20T11:40:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the Qwen3.5 Gated Delta Net by fusing several projection operations into a single, efficient Triton kernel. This fusion reduces computational overhead and memory footprint by minimizing kernel launches and intermediate tensor allocations, leading to improved inference performance, with benchmarks showing a +6.5% increase in output throughput and notable reductions in Time Per Output Token (TPOT) and Time To First Token (TTFT). The changes also include refactoring of projection layers and an enhanced weight loading mechanism to support both fused and split checkpoint formats, ensuring compatibility and maintainability.

Highlights

Triton Kernel Fusion: Introduced a new Triton kernel, fused_qkvzba_split_reshape_cat, to fuse split, reshape, and cat operations within the Qwen3.5 Gated Delta Net (GDN) projection, reducing kernel launches and intermediate memory allocations.
Projection Layer Refactoring: Consolidated separate in_proj_qkv, in_proj_z, in_proj_b, and in_proj_a projection layers into two fused layers: in_proj_qkvz and in_proj_ba.
Weight Loader Enhancement: Implemented a robust _make_packed_weight_loader to correctly handle weight loading for both fused (packed) and split checkpoint formats, ensuring proper parameter initialization.
Dynamic Execution Path: The forward pass now dynamically chooses between the optimized Triton kernel (if available and on CPU with AMX support) or a Python-based splitting and reshaping fallback path.
Weight Loading Mappings: Updated weight loading configurations across various Qwen3.5 model classes to correctly map original split weights to the new fused projection layers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yuan-luo · 2026-03-20T11:43:33Z

/tag-and-rerun-ci again

gemini-code-assist · 2026-03-20T11:45:23Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

gemini-code-assist · 2026-03-20T13:01:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yuan-luo · 2026-03-20T13:29:30Z

/gemini review

gemini-code-assist · 2026-03-20T13:29:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yizhang2077 · 2026-03-22T07:46:33Z

could you paste FP8 test results here?

yuan-luo · 2026-03-22T07:48:10Z

could you paste FP8 test results here?

It encountered some error. Investigating.

root@e1448ef40573:/sgl-workspace/sglang_dev/python# python3 -m sglang.launch_server --model Qwen/Qwen3.5-35B-A3B-FP8 --tp-size 2 --disable-radix-cache --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager
/sgl-workspace/sglang_dev/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-22 07:41:19] WARNING server_args.py:6487: The command line argument '--enable-piecewise-cuda-graph' is deprecated and will be removed in future versions.
config.json: 20.5kB [00:00, 57.5MB/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:21] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 244/244 [00:00<00:00, 1.06MB/s]
[2026-03-22 07:41:22] WARNING model_config.py:1098: Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:22] INFO server_args.py:2232: Attention backend not specified. Use fa3 backend by default.
/sgl-workspace/sglang_dev/python/sglang/srt/entrypoints/http_server.py:175: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-03-22 07:41:23] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2026-03-22 07:41:23] server_args=ServerArgs(model_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7712545312499999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=870060694, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3.5-35B-A3B-FP8', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-22 07:41:23] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 390/390 [00:00<00:00, 2.21MB/s]
chat_template.jinja: 7.76kB [00:00, 12.1MB/s]
tokenizer_config.json: 16.7kB [00:00, 33.0MB/s]
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:31] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:32 TP1] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:32 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
tokenizer.json:   0%|                                                                                                                                                              | 0.00/12.8M [00:00<?, ?B/s][2026-03-22 07:41:32 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:01<00:00, 9.06MB/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:34 TP0] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 07:41:35] Detected generic TokenizersBackend for Qwen/Qwen3.5-35B-A3B-FP8, retrying with trust_remote_code=True
video_preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 2.88MB/s]
[2026-03-22 07:41:38] Using default HuggingFace chat template with detected content format: openai
[2026-03-22 07:41:40 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:40 TP0] Init torch distributed begin.
[2026-03-22 07:41:40 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 07:41:40 TP1] Init torch distributed begin.
[rank1]:[W322 07:41:40.917985090 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W322 07:41:40.930351145 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-22 07:41:40 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-22 07:41:42 TP0] Init torch distributed ends. elapsed=1.84 s, mem usage=0.80 GB
[2026-03-22 07:41:42 TP1] Init torch distributed ends. elapsed=1.72 s, mem usage=0.80 GB
[2026-03-22 07:41:43 TP0] Load weight begin. avail mem=77.99 GB
[2026-03-22 07:41:43 TP0] Detected fp8 checkpoint.
[2026-03-22 07:41:43 TP1] Load weight begin. avail mem=77.99 GB
[2026-03-22 07:41:43 TP0] Multimodal attention backend not set. Use fa3.
[2026-03-22 07:41:43 TP0] Using fa3 as multimodal attention backend.
[2026-03-22 07:41:43 TP1] Multimodal attention backend not set. Use fa3.
[2026-03-22 07:41:43 TP1] Using fa3 as multimodal attention backend.
[2026-03-22 07:41:43 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 3403, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 376, in __init__
    self.init_model_worker()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 593, in init_model_worker
    self.init_tp_model_worker()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/scheduler.py", line 551, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/tp_worker.py", line 261, in __init__
    self._init_model_runner()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 422, in __init__
    self.initialize(pre_model_load_memory)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 502, in initialize
    self.load_model()
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_executor/model_runner.py", line 1079, in load_model
    self.model = self.loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_loader/loader.py", line 683, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/model_loader/loader.py", line 277, in _initialize_model
    return model_class(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 1358, in __init__
    super().__init__(config, quant_config, prefix, language_model_cls)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_vl.py", line 1100, in __init__
    self.model = language_model_cls(
                 ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 1040, in __init__
    super().__init__(config=config, quant_config=quant_config, prefix=prefix)
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 872, in __init__
    self.layers, self._start_layer, self._end_layer = make_layers(
                                                      ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/common.py", line 654, in make_layers
    + get_offloader().wrap_modules(
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/utils/common.py", line 656, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 863, in get_layer
    return layer_class(
           ^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 463, in __init__
    self.linear_attn = Qwen3_5GatedDeltaNet(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang_dev/python/sglang/srt/models/qwen3_5.py", line 165, in __init__
    self.in_proj_qkvz.weight.weight_loader = self._make_packed_weight_loader(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: property 'weight_loader' of 'ModelWeightParameter' object has no setter

yuan-luo · 2026-03-22T08:39:24Z

The reason is fp8 and unquant models use different weight parameters. To be more specific,
unquant uses: weight = Parameter(torch.empty(...), requires_grad=False)
fp8 quant uses: weight = ModelWeightParameter(...)

When running self.in_proj_qkvz.weight.weight_loader = self._make_packed_weight_loader(), for fp8 ModelWeightParameter, weight_loader is a read-only property, which has no setter and leads error in assignment.

Fixing in progress.

yuan-luo · 2026-03-22T09:03:07Z

FP8 problem fixed.

root@e1448ef40573:/sgl-workspace/sglang_dev/python# python3 -m sglang.launch_server --model Qwen/Qwen3.5-35B-A3B-FP8 --tp-size 2 --disable-radix-cache --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager
/sgl-workspace/sglang_dev/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-22 08:51:30] WARNING server_args.py:6487: The command line argument '--enable-piecewise-cuda-graph' is deprecated and will be removed in future versions.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:32] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:32] WARNING model_config.py:1098: Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:32] INFO server_args.py:2232: Attention backend not specified. Use fa3 backend by default.
/sgl-workspace/sglang_dev/python/sglang/srt/entrypoints/http_server.py:175: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-03-22 08:51:33] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2026-03-22 08:51:33] server_args=ServerArgs(model_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_path='Qwen/Qwen3.5-35B-A3B-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.7712545312499999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=517779333, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3.5-35B-A3B-FP8', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-22 08:51:33] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41 TP1] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:41 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:41 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:42 TP0] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-22 08:51:43] Using default HuggingFace chat template with detected content format: openai
[2026-03-22 08:51:43] Detected generic TokenizersBackend for Qwen/Qwen3.5-35B-A3B-FP8, retrying with trust_remote_code=True
[2026-03-22 08:51:49 TP1] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:49 TP1] Init torch distributed begin.
[2026-03-22 08:51:50 TP0] Transformers version 5.3.0 is used for model type qwen3_5_moe. If you experience issues related to RoPE parameters, they may be due to incompatibilities between Transformers >=5.0.0 and some models. You can try downgrading to transformers==4.57.1 as a workaround.
[2026-03-22 08:51:50 TP0] Init torch distributed begin.
[rank1]:[W322 08:51:50.862068276 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W322 08:51:50.870712993 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-22 08:51:50 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-22 08:51:51 TP1] Init torch distributed ends. elapsed=1.73 s, mem usage=0.80 GB
[2026-03-22 08:51:51 TP0] Init torch distributed ends. elapsed=1.58 s, mem usage=0.80 GB
[2026-03-22 08:51:52 TP1] Load weight begin. avail mem=77.99 GB
[2026-03-22 08:51:52 TP0] Load weight begin. avail mem=77.99 GB
[2026-03-22 08:51:52 TP0] Detected fp8 checkpoint.
[2026-03-22 08:51:52 TP1] Multimodal attention backend not set. Use fa3.
[2026-03-22 08:51:52 TP1] Using fa3 as multimodal attention backend.
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-03-22 08:51:52 TP0] Multimodal attention backend not set. Use fa3.
[2026-03-22 08:51:52 TP0] Using fa3 as multimodal attention backend.
[2026-03-22 08:51:52 TP1] using attn output gate!
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-03-22 08:51:52 TP0] using attn output gate!
[2026-03-22 08:51:53 TP0] Found local HF snapshot for Qwen/Qwen3.5-35B-A3B-FP8 at /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B-FP8/snapshots/0b2752837483aa34b3db6e83e151b150c0e00e49; skipping download.
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:09<00:00,  1.38it/s][2026-03-22 08:52:03 TP1] Load weight end. elapsed=11.01 s, type=Qwen3_5MoeForConditionalGeneration, quant=fp8, avail mem=60.82 GB, mem usage=17.17 GB.
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:10<00:00,  1.32it/s]
[2026-03-22 08:52:04 TP0] Load weight end. elapsed=11.55 s, type=Qwen3_5MoeForConditionalGeneration, quant=fp8, avail mem=60.82 GB, mem usage=17.17 GB.
[2026-03-22 08:52:04 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-22 08:52:04 TP0] Mamba Cache is allocated. max_mamba_cache_size: 678, conv_state size: 0.47GB, ssm_state size: 19.89GB 
[2026-03-22 08:52:04 TP1] Mamba Cache is allocated. max_mamba_cache_size: 678, conv_state size: 0.47GB, ssm_state size: 19.89GB 
[2026-03-22 08:52:04 TP0] KV Cache is allocated. #tokens: 2371049, K size: 11.31 GB, V size: 11.31 GB
[2026-03-22 08:52:04 TP1] KV Cache is allocated. #tokens: 2371049, K size: 11.31 GB, V size: 11.31 GB
[2026-03-22 08:52:04 TP0] Memory pool end. avail mem=17.13 GB
[2026-03-22 08:52:04 TP1] Memory pool end. avail mem=17.13 GB
[2026-03-22 08:52:04 TP0] Linear attention kernel backend: decode=triton, prefill=triton
[2026-03-22 08:52:04 TP0] Using hybrid linear attention backend for hybrid GDN models.
[2026-03-22 08:52:04 TP0] GDN kernel dispatcher: decode=TritonGDNKernel, extend=TritonGDNKernel, verify=TritonGDNKernel packed_decode=True
[2026-03-22 08:52:04 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.03 GB
[2026-03-22 08:52:04 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2026-03-22 08:52:04 TP1] Using hybrid linear attention backend for hybrid GDN models.
[2026-03-22 08:52:04 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=17.03 GB
Capturing batches (bs=256 avail_mem=16.54 GB):   0%|                                                                                                                                    | 0/36 [00:00<?, ?it/s][2026-03-22 08:52:11 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:52:11 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=6144, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:52:11 TP0] Required memory for warmup: 0.23046875GB, Available memory: 16.51776123046875GB
                                                                                                                                                                                                              [2026-03-22 08:52:17 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 16777216                                                              | 65/16384 [00:05<19:49, 13.72it/s]
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:55<00:00, 293.25it/s]
[2026-03-22 08:53:07 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:53:07 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:53:07 TP0] Required memory for warmup: 0.09765625GB, Available memory: 16.44940185546875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:56<00:00, 141.19it/s]
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 16777216██████████████████████████████████████████████▍         | 15298/16384 [01:56<00:00, 1215.95it/s]
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
[2026-03-22 08:55:03 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 2097152
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 2097152
[2026-03-22 08:55:03 TP1] [SymmDeviceMemory] Rank: 1, Group size: 2, device_idx: 1, Signal pad offset: 50331648
[2026-03-22 08:55:03 TP0] [SymmDeviceMemory] Rank: 0, Group size: 2, device_idx: 0, Signal pad offset: 50331648
[2026-03-22 08:55:03 TP1] FlashInfer workspace initialized for rank 1, world_size 2, backend trtllm
[2026-03-22 08:55:03 TP0] FlashInfer workspace initialized for rank 0, world_size 2, backend trtllm
[2026-03-22 08:55:03 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:55:03 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=512, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:55:03 TP0] Required memory for warmup: 0.0478515625GB, Available memory: 14.65252685546875GB
                                                                                                                                                                                                              [2026-03-22 08:55:06 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 08:55:06 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [04:01<00:00, 67.74it/s]
[2026-03-22 08:59:05 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 08:59:05 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=256, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 08:59:05 TP0] Required memory for warmup: 0.06689453125GB, Available memory: 14.64666748046875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:09<00:00, 236.48it/s]
[2026-03-22 09:00:14 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 09:00:14 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang_dev/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=256,N=256,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-03-22 09:00:14 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-03-22 09:00:14 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=4608, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-03-22 09:00:14 TP0] Required memory for warmup: 0.1806640625GB, Available memory: 14.59393310546875GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [01:07<00:00, 241.33it/s]
Capturing batches (bs=1 avail_mem=13.97 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [09:30<00:00, 15.86s/it]
[2026-03-22 09:01:35 TP0] Registering 72 cuda graph addresses
[2026-03-22 09:01:36 TP1] Capture cuda graph end. Time elapsed: 571.66 s. mem usage=3.06 GB. avail mem=13.97 GB.
[2026-03-22 09:01:36 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-03-22 09:01:36 TP0] Capture cuda graph end. Time elapsed: 571.66 s. mem usage=3.07 GB. avail mem=13.96 GB.
[2026-03-22 09:01:36 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-03-22 09:01:44 TP0] max_total_num_tokens=2371049, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=678, context_len=262144, available_gpu_mem=13.96 GB
[2026-03-22 09:01:45] INFO:     Started server process [50198]
[2026-03-22 09:01:45] INFO:     Waiting for application startup.
[2026-03-22 09:01:45] Using default chat sampling params from model generation config: {'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}
[2026-03-22 09:01:45] INFO:     Application startup complete.
[2026-03-22 09:01:45] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-03-22 09:01:46] INFO:     127.0.0.1:46550 - "GET /model_info HTTP/1.1" 200 OK
[2026-03-22 09:01:48 TP0] Prefill batch, #new-seq: 1, #new-token: 80, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-03-22 09:01:48] INFO:     127.0.0.1:46562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-03-22 09:01:48] The server is fired up and ready to roll!
[2026-03-22 09:01:53 TP0] Prefill batch, #new-seq: 1, #new-token: 35, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 15.64
[2026-03-22 09:01:53 TP0] Decode batch, #running-req: 1, #full token: 68, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0
[2026-03-22 09:01:53 TP0] Decode batch, #running-req: 1, #full token: 108, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 244.26, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 148, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 244.08, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 188, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 243.97, #queue-req: 0
[2026-03-22 09:01:54 TP0] Decode batch, #running-req: 1, #full token: 228, full token usage: 0.00, mamba num: 1, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 243.84, #queue-req: 0
[2026-03-22 09:01:54] INFO:     127.0.0.1:60384 - "POST /v1/chat/completions HTTP/1.1" 200 OK

root@e1448ef40573:/sgl-workspace/bench_script# python test_openai.py 
ChatCompletion(id='c87bd63dec9844c184b3a4506266d5d0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: List 3 countries and their capitals.\n    *   Task: Tell the user how I rank them.\n    *   Constraint: The user is asking for a ranking of countries/capitals.\n\n2.  **Evaluate Safety & Policy:**\n    *   Is there any sensitive or controversial topic here? No, countries and capitals are general knowledge.\n    *   Is there a risk of bias? Yes, ranking countries can be subjective and potentially offensive or misleading if based on arbitrary criteria (e.g., "best," "most important," "wealthiest" without context).\n    *   Is there a risk of hallucination? No, country/capital data is factual.\n    *   Is there a risk of violating policies on neutrality? Yes, as an AI, I should avoid making subjective value judgments about nations unless there\'s a clear, objective metric (like population', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1774170114, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=35, total_tokens=235, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

jasperjiaguo · 2026-03-23T03:04:00Z

Thanks @yuan-luo taking a look

edwingao28 · 2026-03-23T03:37:18Z

The GEMM fusion follows the same approach as #19321 for Qwen3-Next.
The reshape kernel overhead on small models is worth investigating separately but shouldn't block this PR.

jasperjiaguo

Yes approach lgtm. I will keep a separate tab on the small model perf.

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

cs-cat · 2026-03-31T12:35:09Z

Thank you for your work @yuan-luo . This PR does indeed bring significant performance improvements, but it seems to affect the accuracy of the model? Please refer to #21696.

Fuse split/reshape/cat ops in GDN projection, adapted for contiguous layout.

… contiguous kernel

…sed split/reshape/cat ops in gdn Adds FP8 quantization support for the fused GDN projection layers: - _override_weight_loader: robust loader override for FP8/quantized params - _bind_packed_weight_loaders: covers weight, weight_scale_inv, weight_scale, input_scale - _get_split_sizes_for_param: handles BlockQuantScaleParameter and PerTensorScaleParameter - Updated _make_packed_weight_loader to support FP8 scale parameters

- Cherry-pick PR sgl-project#21019: Fuse GDN split/reshape/cat ops with FP8/BF16 quant support - Add BF16 qkv z b a fusion and PTPC quant config

- Cherry-pick PR sgl-project#21019 load weight func - Add BF16 qkv z b a fusion and PTPC quant config

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuan-luo requested review from BBuf, DarkSharpness, Fridge003, HydraQYH, b8zhong, hlu1, ispobock, kaixih, yizhang2077 and zminglei March 20, 2026 11:39

yuan-luo requested a review from celve as a code owner March 20, 2026 11:39

github-actions Bot added the jit-kernel label Mar 20, 2026

github-actions Bot added the run-ci label Mar 20, 2026

yuan-luo mentioned this pull request Mar 20, 2026

[Qwen3-Next] Fuse Qwen3-Next GDN's qkvz_proj and ba_proj #19321

Merged

5 tasks

yuan-luo marked this pull request as draft March 20, 2026 12:23

yuan-luo force-pushed the optimize_qwen35_proj branch from c149767 to 294ab57 Compare March 20, 2026 12:24

yuan-luo marked this pull request as ready for review March 20, 2026 13:01

yuan-luo force-pushed the optimize_qwen35_proj branch from 8aaf698 to ffec15a Compare March 22, 2026 01:17

yizhang2077 approved these changes Mar 22, 2026

View reviewed changes

yuan-luo force-pushed the optimize_qwen35_proj branch from ffec15a to 77cec9e Compare March 22, 2026 09:08

jasperjiaguo approved these changes Mar 23, 2026

View reviewed changes

luoyuan.luo added 3 commits March 23, 2026 11:44

Optimize qwen3.5 project

575d8d7

Fix gsm8k acc drop

9e95616

Support fp8 quant for fused split/reshape/cat ops in gdn

1b0fa5f

yuan-luo force-pushed the optimize_qwen35_proj branch from 77cec9e to 1b0fa5f Compare March 23, 2026 03:45

BBuf merged commit 5bdc07d into sgl-project:main Mar 23, 2026
320 of 375 checks passed

yuan-luo deleted the optimize_qwen35_proj branch March 24, 2026 02:52

adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton ke…

99bdf31

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton ke…

3f2cea3

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

b8zhong mentioned this pull request Mar 31, 2026

[Bug] Qwen3 next PD disaggregation accuracy drop #21744

Open

5 tasks

IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026

Cherry-pick PR sgl-project#21019 commit 1/3: Optimize qwen3.5 project

c4dff7a

Fuse split/reshape/cat ops in GDN projection, adapted for contiguous layout.

IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 2, 2026

Cherry-pick PR sgl-project#21019 commit 2/3: Fix gsm8k acc drop - use…

a6a3b54

… contiguous kernel

IzacharyI mentioned this pull request Apr 2, 2026

[Fix] Fixed Qwen3.5 dense model load weight issue. zejunchen-zejun/sglang#235

Closed

4 tasks

cs-cat mentioned this pull request Apr 2, 2026

Quality regression (LLM-as-judge scores) after upgrading from 0.5.9 to 0.5.10rc0 with Qwen3.5-27B-FP8 on Blackwell SM120 #21696

Closed

IzacharyI mentioned this pull request Apr 3, 2026

[Fix] Fixed Qwen3.5 dense model load weight issue (#232). zejunchen-zejun/sglang#236

Merged

4 tasks

IzacharyI added a commit to IzacharyI/sglang that referenced this pull request Apr 3, 2026

[Fix] Fixed Qwen3.5 dense model load weight issue (sgl-project#232).

d68fe2a

- Cherry-pick PR sgl-project#21019 load weight func - Add BF16 qkv z b a fusion and PTPC quant config

IzacharyI mentioned this pull request Apr 3, 2026

[Fix] Fixed Qwen3.5 dense model load weight issue (#232). zejunchen-zejun/sglang#239

Merged

4 tasks

qichu-yun pushed a commit to zejunchen-zejun/sglang that referenced this pull request Apr 3, 2026

[Fix] Fixed Qwen3.5 dense model load weight issue (#232). (#239)

773c750

- Cherry-pick PR sgl-project#21019 load weight func - Add BF16 qkv z b a fusion and PTPC quant config

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton ke…

e4671ee

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

alexnails mentioned this pull request Apr 18, 2026

[bugfix]fix(qwen3_5): broadcast per-tensor scale in _make_packed_weight_loader for FP8 models #23062

Merged

5 tasks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton ke…

2274bb0

…rnel (sgl-project#21019) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Conversation

yuan-luo commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Qwen3-Next weight loading path

Qwen3.5 weight loading path

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

yuan-luo commented Mar 20, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

yuan-luo commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

yizhang2077 commented Mar 22, 2026

Uh oh!

yuan-luo commented Mar 22, 2026

Uh oh!

yuan-luo commented Mar 22, 2026

Uh oh!

yuan-luo commented Mar 22, 2026

Uh oh!

jasperjiaguo commented Mar 23, 2026

Uh oh!

edwingao28 commented Mar 23, 2026

Uh oh!

jasperjiaguo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cs-cat commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yuan-luo commented Mar 20, 2026 •

edited

Loading

yuan-luo commented Mar 20, 2026 •

edited by b8zhong

Loading

cs-cat commented Mar 31, 2026 •

edited

Loading