[Bug] [GDN] Accuracy degradation with flashinfer `gated_delta_rule_decode_pretranspose` under `no_buffer` scheduling

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

## Summary

FlashInfer GDN decode accuracy degrades significantly with SGLang's `no_buffer` Mamba scheduling
strategy, while the Triton FLA kernel is unaffected.

**Model**: `nvidia/Qwen3.5-397B-A17B-NVFP4`, 4× B200
**Eval**: gsm8k, 200 questions, temperature=0.6, top_p=0.95, top_k=20, 128 concurrent threads

| Config | FlashInfer | Triton |
|---|---|---|
| `extra_buffer` | **0.990** | 0.990 |
| `no_buffer` | 0.940 | 0.990 |
| `no_buffer` + `--disable-radix-cache` | ~0.890 | 0.990 |

## Root Cause Hypothesis

The `no_buffer` strategy allows more aggressive SSM state slot reuse. The Triton path does
explicit gather/scatter (copies state in/out per request), while FlashInfer's pool API
(`initial_state + initial_state_indices`) reads and writes state in-place. We suspect this
causes a state aliasing or ordering issue under `no_buffer`, but have not identified the
exact mechanism.

**Question**: Is the pool API safe when the same slot index is reused across requests, or
when a slot is reassigned before the previous write completes?

## Note

`extra_buffer` and `--disable-radix-cache` are mutually exclusive in SGLang, so the only
valid production configs are `extra_buffer` (radix cache on) or `no_buffer` + `--disable-radix-cache`.
This means `no_buffer` + `--disable-radix-cache` is a common deployment config that is
currently broken for FlashInfer GDN decode.


### Reproduction

server side:
```
# Usage: bash 01.run_qwen3_5.sh nvfp4 flashinfer
MODE=${1:-""}
BACKEND=${2:-"triton"}

# Set ENABLE_HF=1 to use HuggingFace weights (requires HF_TOKEN)
ENABLE_HF=${ENABLE_HF:-0}

export HF_TOKEN="hf_xxx"
if [ "$MODE" = "nvfp4" ]; then
    MODEL="nvidia/Qwen3.5-397B-A17B-NVFP4"
fi

# Auto-detect number of GPUs and set TP_SIZE accordingly
NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
if [ "$NUM_GPUS" -eq 4 ]; then
    TP_SIZE=4
elif [ "$NUM_GPUS" -eq 8 ]; then
    TP_SIZE=8
else
    echo "Warning: Detected $NUM_GPUS GPUs. Defaulting to TP_SIZE=8"
    TP_SIZE=8
fi

echo "Detected $NUM_GPUS GPUs, using TP_SIZE=$TP_SIZE"
PORT=30000

# Set ENABLE_MTP=1 to enable Multi-Token Prediction (MTP)
ENABLE_MTP=${ENABLE_MTP:-0}

MTP_ARGS=""
if [ "$ENABLE_MTP" -eq 1 ]; then
    MTP_ARGS="--speculative-algo NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4"
fi

MODE_SUFFIX=""
if [ "$MODE" = "fp8" ]; then
    MODE_SUFFIX="_fp8"
elif [ "$MODE" = "nvfp4" ]; then
    MODE_SUFFIX="_nvfp4"
fi

if [ "$ENABLE_MTP" -eq 1 ]; then
    LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}_mtp${MODE_SUFFIX}.log"
else
    LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}${MODE_SUFFIX}.log"
fi

NVFP4_ARGS=()
if [ "$MODE" = "nvfp4" ]; then
    NVFP4_ARGS=(
        --quantization modelopt_fp4
        --mamba-scheduler-strategy no_buffer
        --mamba-track-interval 128
        --model-loader-extra-config '{"enable_multithread_load": true,"num_threads": 64}'
    )
fi

export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=False
export SGLANG_TORCH_PROFILER_DIR=/scratch/repo/sglang/lab/qwen3-next-mtp/profile_fi_pool

python -m sglang.launch_server \
    --model $MODEL \
    --tp-size ${TP_SIZE} \
    --port ${PORT} \
    --max-running-requests 128 \
    --chunked-prefill-size 2048 \
    --mamba-ssm-dtype bfloat16 \
    --reasoning-parser qwen3 \
    --attention-backend trtllm_mha \
    --linear-attn-decode-backend $BACKEND \
    --disable-radix-cache \
    $MTP_ARGS \
    "${NVFP4_ARGS[@]}" 2>&1 | tee "$LOG_FILE"

```

client:
```
#!/bin/bash
PORT=${1:-30000}
NUM_QUESTIONS=${2:-200}
MODEL=${3:-"nvidia/Qwen3.5-397B-A17B-NVFP4"}

python3 - <<EOF
from types import SimpleNamespace
from sglang.test.run_eval import run_eval

args = SimpleNamespace(
    model="${MODEL}",
    eval_name="gsm8k",
    num_shots=5,
    num_examples=${NUM_QUESTIONS},
    max_tokens=16000,
    num_threads=128,
    repeat=1,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    base_url="http://127.0.0.1:${PORT}",
    host="http://127.0.0.1",
    port=${PORT},
)
metrics = run_eval(args)
print(f"metrics={metrics}")
EOF
```

### Environment

Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]                                                            
CUDA available: True                                                                                                 
GPU 0,1,2,3,4,5,6,7: NVIDIA B200                                                                                     
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0            
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu129
sglang: 0.5.8.post1
sglang-kernel: 0.4.0
flashinfer_python: 0.6.6
flashinfer_cubin: 0.6.6
flashinfer_jit_cache: 0.6.6+cu129
triton: 3.5.1  
transformers: 4.57.1
torchao: 0.9.0 
numpy: 2.4.3   
aiohttp: 3.13.3 
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.35.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.42.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.85.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PXB     NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     SYS     SYS     0-55,112-167    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PXB     NODE    NODE    NODE    NODE    NODE    56-111,168-223  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    PXB     NODE    NODE    56-111,168-223  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     NODE    56-111,168-223  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PXB     56-111,168-223  1               N/A
NIC0    PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC3    NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC4    NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC5    NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PXB     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      PIX     NODE    NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    PIX      X      NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      NODE    NODE
NIC10   SYS     SYS     SYS     SYS     NODE    NODE    PXB     NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      NODE
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PXB     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_4
  NIC1: mlx5_5
  NIC2: mlx5_6
  NIC3: mlx5_7
  NIC4: mlx5_8
  NIC5: mlx5_9
  NIC6: mlx5_10
  NIC7: mlx5_11
  NIC8: mlx5_12
  NIC9: mlx5_13
  NIC10: mlx5_14
  NIC11: mlx5_15


ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [GDN] Accuracy degradation with flashinfer `gated_delta_rule_decode_pretranspose` under `no_buffer` scheduling #20791

Checklist

Describe the bug

Summary

Root Cause Hypothesis

Note

Reproduction

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	FlashInfer	Triton
`extra_buffer`	0.990	0.990
`no_buffer`	0.940	0.990
`no_buffer` + `--disable-radix-cache`	~0.890	0.990

[Bug] [GDN] Accuracy degradation with flashinfer gated_delta_rule_decode_pretranspose under no_buffer scheduling #20791

Description

Checklist

Describe the bug

Summary

Root Cause Hypothesis

Note

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] [GDN] Accuracy degradation with flashinfer `gated_delta_rule_decode_pretranspose` under `no_buffer` scheduling #20791