[Bug] [CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

# [CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens

## Summary

When running Qwen3-Coder-Next (GDN + Full Attention hybrid) on CPU with BF16 precision, the GDN kernel `chunk_gated_delta_rule_cpu` produces NaN values when the prefill length exceeds ~4096 tokens. This corrupts the logits and causes `torch.multinomial` to crash with `RuntimeError: probability tensor contains either inf, nan or element < 0`. Setting `--chunked-prefill-size 2048` works around the issue.

## Reproduction

```bash
numactl --interleave=all python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-Next \
    --trust-remote-code --disable-overlap-schedule \
    --device cpu --host 0.0.0.0 --port 8000 \
    --tp 4 --disable-cuda-graph \
    --chunked-prefill-size 8192   # default, causes NaN

# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-Next",
    "messages": [
      {"role": "system", "content": "<~3500 token system prompt>"},
      {"role": "user", "content": "hello"}
    ],
    "temperature": 0.6,
    "max_tokens": 256
  }'
# All TP schedulers crash simultaneously.
```

### Observed threshold

By binary-searching the system prompt length:

| Total tokens (approx) | Result |
|---|---|
| ~4085 | OK, normal response |
| ~4107 | NaN crash |
| ~4200+ | NaN crash (consistent) |

The boundary is around **~4096 tokens**, which aligns with the internal chunking logic.

### Crash log

All 4 TP schedulers report the same error simultaneously:

```
ERROR srt_server - Scheduler hit an exception: RuntimeError: probability tensor contains either inf, nan or element < 0
```

Full traceback points to:

```
File ".../sglang/srt/layers/sampler.py", line 476, in forward
    batch_next_token_ids = torch.multinomial(probs, ...)
RuntimeError: probability tensor contains either inf, nan or element < 0
```

### Workaround

Setting `--chunked-prefill-size 2048` resolves the issue. All prompts (including 4000+ token system prompts) work correctly.

### Notes

- `temperature=0` (greedy decoding) avoids the `torch.multinomial` crash but produces garbage output (e.g., "!!!!!!!!"), confirming the logits themselves are corrupted (NaN), not just a sampling issue.
- FP32 would likely fix the numerical issue but defeats the purpose of AMX BF16 acceleration.
- The issue is specific to the GDN (Gated DeltaNet) attention layers. The relevant CPU kernel is `chunk_gated_delta_rule_kernel_impl` in `sgl-kernel/csrc/cpu/mamba/fla.cpp` (template parameter `chunk_size=64`).

## Root cause analysis

The `chunk_gated_delta_rule_cpu` kernel (`fla.cpp:30`) performs BF16 matrix multiplications within each chunk (size 64). When `--chunked-prefill-size` is large, many chunks are processed sequentially within a single prefill pass. The accumulated state (`h` tensor — the recurrent hidden state of GDN) appears to overflow BF16 dynamic range after processing ~4096 tokens worth of chunks.

BF16 has only 8-bit exponent + 7-bit mantissa (vs FP32's 23-bit mantissa), making it vulnerable to accumulated rounding errors and overflow in long sequential computations like recurrent state updates.

The kernel uses `float` (FP32) for some intermediate buffers (`decay_mask`, `curr_attn`, `k_cumdecay`) but the input/output tensors and the recurrent state `h` remain in BF16 (`scalar_t`), which is where the overflow likely occurs.

## Suggested fix

1. **Auto-tune `--chunked-prefill-size` for CPU** — detect BF16 dtype and cap the default chunked-prefill-size to 2048 for CPU backends with GDN/recurrent attention models.

2. **Document the limitation** — add a note to the CPU server docs that `--chunked-prefill-size 2048` is recommended for BF16 models with GDN/recurrent attention.

## Environment

Note: `python3 -m sglang.check_env` does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).

Manual environment info:
- Python: 3.12.3 (GCC 13.3.0)
- PyTorch: 2.9.0+cpu
- transformers: 4.57.1
- SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7eeb)
- CUDA: N/A (CPU only)
- CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
- RAM: 503 GB (4 NUMA nodes, SNC enabled)
- OS: Ubuntu, Linux 6.8.0-101-generic x86_64


### Reproduction

```bash
numactl --interleave=all python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-Next \
    --trust-remote-code --disable-overlap-schedule \
    --device cpu --host 0.0.0.0 --port 8000 \
    --tp 4 --disable-cuda-graph \
    --chunked-prefill-size 8192   # default, causes NaN

# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-Next",
    "messages": [
      {"role": "system", "content": "<~3500 token system prompt>"},
      {"role": "user", "content": "hello"}
    ],
    "temperature": 0.6,
    "max_tokens": 256
  }'
# All TP schedulers crash simultaneously.
```

### Environment

Note: `python3 -m sglang.check_env` does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).

Manual environment info:
- Python: 3.12.3 (GCC 13.3.0)
- PyTorch: 2.9.0+cpu
- transformers: 4.57.1
- SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7eeb)
- CUDA: N/A (CPU only)
- CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
- RAM: 503 GB (4 NUMA nodes, SNC enabled)
- OS: Ubuntu, Linux 6.8.0-101-generic x86_64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens #20051

Checklist

Describe the bug

[CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens

Summary

Reproduction

Observed threshold

Crash log

Workaround

Notes

Root cause analysis

Suggested fix

Environment

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Total tokens (approx)	Result
~4085	OK, normal response
~4107	NaN crash
~4200+	NaN crash (consistent)

[Bug] [CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens #20051

Description

Checklist

Describe the bug

[CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens

Summary

Reproduction

Observed threshold

Crash log

Workaround

Notes

Root cause analysis

Suggested fix

Environment

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions