Skip to content

[Bug] [CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens #20051

@gxoga

Description

@gxoga

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

[CPU] GDN (chunk_gated_delta_rule_cpu) produces NaN with BF16 when prefill exceeds ~4096 tokens

Summary

When running Qwen3-Coder-Next (GDN + Full Attention hybrid) on CPU with BF16 precision, the GDN kernel chunk_gated_delta_rule_cpu produces NaN values when the prefill length exceeds ~4096 tokens. This corrupts the logits and causes torch.multinomial to crash with RuntimeError: probability tensor contains either inf, nan or element < 0. Setting --chunked-prefill-size 2048 works around the issue.

Reproduction

numactl --interleave=all python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-Next \
    --trust-remote-code --disable-overlap-schedule \
    --device cpu --host 0.0.0.0 --port 8000 \
    --tp 4 --disable-cuda-graph \
    --chunked-prefill-size 8192   # default, causes NaN

# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-Next",
    "messages": [
      {"role": "system", "content": "<~3500 token system prompt>"},
      {"role": "user", "content": "hello"}
    ],
    "temperature": 0.6,
    "max_tokens": 256
  }'
# All TP schedulers crash simultaneously.

Observed threshold

By binary-searching the system prompt length:

Total tokens (approx) Result
~4085 OK, normal response
~4107 NaN crash
~4200+ NaN crash (consistent)

The boundary is around ~4096 tokens, which aligns with the internal chunking logic.

Crash log

All 4 TP schedulers report the same error simultaneously:

ERROR srt_server - Scheduler hit an exception: RuntimeError: probability tensor contains either inf, nan or element < 0

Full traceback points to:

File ".../sglang/srt/layers/sampler.py", line 476, in forward
    batch_next_token_ids = torch.multinomial(probs, ...)
RuntimeError: probability tensor contains either inf, nan or element < 0

Workaround

Setting --chunked-prefill-size 2048 resolves the issue. All prompts (including 4000+ token system prompts) work correctly.

Notes

  • temperature=0 (greedy decoding) avoids the torch.multinomial crash but produces garbage output (e.g., "!!!!!!!!"), confirming the logits themselves are corrupted (NaN), not just a sampling issue.
  • FP32 would likely fix the numerical issue but defeats the purpose of AMX BF16 acceleration.
  • The issue is specific to the GDN (Gated DeltaNet) attention layers. The relevant CPU kernel is chunk_gated_delta_rule_kernel_impl in sgl-kernel/csrc/cpu/mamba/fla.cpp (template parameter chunk_size=64).

Root cause analysis

The chunk_gated_delta_rule_cpu kernel (fla.cpp:30) performs BF16 matrix multiplications within each chunk (size 64). When --chunked-prefill-size is large, many chunks are processed sequentially within a single prefill pass. The accumulated state (h tensor — the recurrent hidden state of GDN) appears to overflow BF16 dynamic range after processing ~4096 tokens worth of chunks.

BF16 has only 8-bit exponent + 7-bit mantissa (vs FP32's 23-bit mantissa), making it vulnerable to accumulated rounding errors and overflow in long sequential computations like recurrent state updates.

The kernel uses float (FP32) for some intermediate buffers (decay_mask, curr_attn, k_cumdecay) but the input/output tensors and the recurrent state h remain in BF16 (scalar_t), which is where the overflow likely occurs.

Suggested fix

  1. Auto-tune --chunked-prefill-size for CPU — detect BF16 dtype and cap the default chunked-prefill-size to 2048 for CPU backends with GDN/recurrent attention models.

  2. Document the limitation — add a note to the CPU server docs that --chunked-prefill-size 2048 is recommended for BF16 models with GDN/recurrent attention.

Environment

Note: python3 -m sglang.check_env does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).

Manual environment info:

  • Python: 3.12.3 (GCC 13.3.0)
  • PyTorch: 2.9.0+cpu
  • transformers: 4.57.1
  • SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7e)
  • CUDA: N/A (CPU only)
  • CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
  • RAM: 503 GB (4 NUMA nodes, SNC enabled)
  • OS: Ubuntu, Linux 6.8.0-101-generic x86_64

Reproduction

numactl --interleave=all python -m sglang.launch_server \
    --model Qwen/Qwen3-Coder-Next \
    --trust-remote-code --disable-overlap-schedule \
    --device cpu --host 0.0.0.0 --port 8000 \
    --tp 4 --disable-cuda-graph \
    --chunked-prefill-size 8192   # default, causes NaN

# Send a request with ~4000+ tokens (system prompt ~3500 tokens + short user message)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-Next",
    "messages": [
      {"role": "system", "content": "<~3500 token system prompt>"},
      {"role": "user", "content": "hello"}
    ],
    "temperature": 0.6,
    "max_tokens": 256
  }'
# All TP schedulers crash simultaneously.

Environment

Note: python3 -m sglang.check_env does not support CPU backend
(falls through to NameError because no CPUEnv class exists in check_env.py:516-525).

Manual environment info:

  • Python: 3.12.3 (GCC 13.3.0)
  • PyTorch: 2.9.0+cpu
  • transformers: 4.57.1
  • SGLang: 0.5.9 (v0.5.9 tag, commit bbe9c7e)
  • CUDA: N/A (CPU only)
  • CPU: Intel Xeon 6768P, 2 sockets, 128 cores / 256 threads (Granite Rapids)
  • RAM: 503 GB (4 NUMA nodes, SNC enabled)
  • OS: Ubuntu, Linux 6.8.0-101-generic x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions