Checklist
Describe the bug
[CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens
Summary
When running Qwen3-Coder-Next (GDN + Full Attention hybrid) on CPU with BF16 precision, the GDN kernel chunk_gated_delta_rule_cpu produces NaN values when the prefill length exceeds ~4096 tokens. This corrupts the logits and causes torch.multinomial to crash with RuntimeError: probability tensor contains either inf, nan or element < 0. Setting --chunked-prefill-size 2048 works around the issue.
Reproduction
numactl --interleave=all python -m sglang.launch_server \
--model Qwen/Qwen3-Coder-Next \
--trust-remote-code --disable-overlap-schedule \
--device cpu --host 0.0.0.0 --port 8000 \
--tp 4 --disable-cuda-graph \
--chunked-prefill-size 8192 # default, causes NaN
# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Coder-Next",
"messages": [
{"role": "system", "content": "<~3500 token system prompt>"},
{"role": "user", "content": "hello"}
],
"temperature": 0.6,
"max_tokens": 256
}'
# All TP schedulers crash simultaneously.
Observed threshold
By binary-searching the system prompt length:
| Total tokens (approx) |
Result |
| ~4085 |
OK, normal response |
| ~4107 |
NaN crash |
| ~4200+ |
NaN crash (consistent) |
The boundary is around ~4096 tokens, which aligns with the internal chunking logic.
Crash log
All 4 TP schedulers report the same error simultaneously:
ERROR srt_server - Scheduler hit an exception: RuntimeError: probability tensor contains either inf, nan or element < 0
Full traceback points to:
File ".../sglang/srt/layers/sampler.py", line 476, in forward
batch_next_token_ids = torch.multinomial(probs, ...)
RuntimeError: probability tensor contains either inf, nan or element < 0
Workaround
Setting --chunked-prefill-size 2048 resolves the issue. All prompts (including 4000+ token system prompts) work correctly.
Notes
temperature=0 (greedy decoding) avoids the torch.multinomial crash but produces garbage output (e.g., "!!!!!!!!"), confirming the logits themselves are corrupted (NaN), not just a sampling issue.
- FP32 would likely fix the numerical issue but defeats the purpose of AMX BF16 acceleration.
- The issue is specific to the GDN (Gated DeltaNet) attention layers. The relevant CPU kernel is
chunk_gated_delta_rule_kernel_impl in sgl-kernel/csrc/cpu/mamba/fla.cpp (template parameter chunk_size=64).
Root cause analysis
The chunk_gated_delta_rule_cpu kernel (fla.cpp:30) performs BF16 matrix multiplications within each chunk (size 64). When --chunked-prefill-size is large, many chunks are processed sequentially within a single prefill pass. The accumulated state (h tensor — the recurrent hidden state of GDN) appears to overflow BF16 dynamic range after processing ~4096 tokens worth of chunks.
BF16 has only 8-bit exponent + 7-bit mantissa (vs FP32's 23-bit mantissa), making it vulnerable to accumulated rounding errors and overflow in long sequential computations like recurrent state updates.
The kernel uses float (FP32) for some intermediate buffers (decay_mask, curr_attn, k_cumdecay) but the input/output tensors and the recurrent state h remain in BF16 (scalar_t), which is where the overflow likely occurs.
Suggested fix
-
Auto-tune --chunked-prefill-size for CPU — detect BF16 dtype and cap the default chunked-prefill-size to 2048 for CPU backends with GDN/recurrent attention models.
-
Document the limitation — add a note to the CPU server docs that --chunked-prefill-size 2048 is recommended for BF16 models with GDN/recurrent attention.
Environment
Note: python3 -m sglang.check_env does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).
Manual environment info:
- Python: 3.12.3 (GCC 13.3.0)
- PyTorch: 2.9.0+cpu
- transformers: 4.57.1
- SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7e)
- CUDA: N/A (CPU only)
- CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
- RAM: 503 GB (4 NUMA nodes, SNC enabled)
- OS: Ubuntu, Linux 6.8.0-101-generic x86_64
Reproduction
numactl --interleave=all python -m sglang.launch_server \
--model Qwen/Qwen3-Coder-Next \
--trust-remote-code --disable-overlap-schedule \
--device cpu --host 0.0.0.0 --port 8000 \
--tp 4 --disable-cuda-graph \
--chunked-prefill-size 8192 # default, causes NaN
# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Coder-Next",
"messages": [
{"role": "system", "content": "<~3500 token system prompt>"},
{"role": "user", "content": "hello"}
],
"temperature": 0.6,
"max_tokens": 256
}'
# All TP schedulers crash simultaneously.
Environment
Note: python3 -m sglang.check_env does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).
Manual environment info:
- Python: 3.12.3 (GCC 13.3.0)
- PyTorch: 2.9.0+cpu
- transformers: 4.57.1
- SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7e)
- CUDA: N/A (CPU only)
- CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
- RAM: 503 GB (4 NUMA nodes, SNC enabled)
- OS: Ubuntu, Linux 6.8.0-101-generic x86_64
Checklist
Describe the bug
[CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens
Summary
When running Qwen3-Coder-Next (GDN + Full Attention hybrid) on CPU with BF16 precision, the GDN kernel
chunk_gated_delta_rule_cpuproduces NaN values when the prefill length exceeds ~4096 tokens. This corrupts the logits and causestorch.multinomialto crash withRuntimeError: probability tensor contains either inf, nan or element < 0. Setting--chunked-prefill-size 2048works around the issue.Reproduction
numactl --interleave=all python -m sglang.launch_server \ --model Qwen/Qwen3-Coder-Next \ --trust-remote-code --disable-overlap-schedule \ --device cpu --host 0.0.0.0 --port 8000 \ --tp 4 --disable-cuda-graph \ --chunked-prefill-size 8192 # default, causes NaN # Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message) curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-Coder-Next", "messages": [ {"role": "system", "content": "<~3500 token system prompt>"}, {"role": "user", "content": "hello"} ], "temperature": 0.6, "max_tokens": 256 }' # All TP schedulers crash simultaneously.Observed threshold
By binary-searching the system prompt length:
The boundary is around ~4096 tokens, which aligns with the internal chunking logic.
Crash log
All 4 TP schedulers report the same error simultaneously:
Full traceback points to:
Workaround
Setting
--chunked-prefill-size 2048resolves the issue. All prompts (including 4000+ token system prompts) work correctly.Notes
temperature=0(greedy decoding) avoids thetorch.multinomialcrash but produces garbage output (e.g., "!!!!!!!!"), confirming the logits themselves are corrupted (NaN), not just a sampling issue.chunk_gated_delta_rule_kernel_implinsgl-kernel/csrc/cpu/mamba/fla.cpp(template parameterchunk_size=64).Root cause analysis
The
chunk_gated_delta_rule_cpukernel (fla.cpp:30) performs BF16 matrix multiplications within each chunk (size 64). When--chunked-prefill-sizeis large, many chunks are processed sequentially within a single prefill pass. The accumulated state (htensor — the recurrent hidden state of GDN) appears to overflow BF16 dynamic range after processing ~4096 tokens worth of chunks.BF16 has only 8-bit exponent + 7-bit mantissa (vs FP32's 23-bit mantissa), making it vulnerable to accumulated rounding errors and overflow in long sequential computations like recurrent state updates.
The kernel uses
float(FP32) for some intermediate buffers (decay_mask,curr_attn,k_cumdecay) but the input/output tensors and the recurrent statehremain in BF16 (scalar_t), which is where the overflow likely occurs.Suggested fix
Auto-tune
--chunked-prefill-sizefor CPU — detect BF16 dtype and cap the default chunked-prefill-size to 2048 for CPU backends with GDN/recurrent attention models.Document the limitation — add a note to the CPU server docs that
--chunked-prefill-size 2048is recommended for BF16 models with GDN/recurrent attention.Environment
Note:
python3 -m sglang.check_envdoes not support CPU backend(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).
Manual environment info:
Reproduction
numactl --interleave=all python -m sglang.launch_server \ --model Qwen/Qwen3-Coder-Next \ --trust-remote-code --disable-overlap-schedule \ --device cpu --host 0.0.0.0 --port 8000 \ --tp 4 --disable-cuda-graph \ --chunked-prefill-size 8192 # default, causes NaN # Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message) curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-Coder-Next", "messages": [ {"role": "system", "content": "<~3500 token system prompt>"}, {"role": "user", "content": "hello"} ], "temperature": 0.6, "max_tokens": 256 }' # All TP schedulers crash simultaneously.Environment
Note:
python3 -m sglang.check_envdoes not support CPU backend(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).
Manual environment info: