Skip to content

[Bug] [GDN] Accuracy degradation with flashinfer gated_delta_rule_decode_pretranspose under no_buffer scheduling #20791

@kaixih

Description

@kaixih

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Summary

FlashInfer GDN decode accuracy degrades significantly with SGLang's no_buffer Mamba scheduling
strategy, while the Triton FLA kernel is unaffected.

Model: nvidia/Qwen3.5-397B-A17B-NVFP4, 4× B200
Eval: gsm8k, 200 questions, temperature=0.6, top_p=0.95, top_k=20, 128 concurrent threads

Config FlashInfer Triton
extra_buffer 0.990 0.990
no_buffer 0.940 0.990
no_buffer + --disable-radix-cache ~0.890 0.990

Root Cause Hypothesis

The no_buffer strategy allows more aggressive SSM state slot reuse. The Triton path does
explicit gather/scatter (copies state in/out per request), while FlashInfer's pool API
(initial_state + initial_state_indices) reads and writes state in-place. We suspect this
causes a state aliasing or ordering issue under no_buffer, but have not identified the
exact mechanism.

Question: Is the pool API safe when the same slot index is reused across requests, or
when a slot is reassigned before the previous write completes?

Note

extra_buffer and --disable-radix-cache are mutually exclusive in SGLang, so the only
valid production configs are extra_buffer (radix cache on) or no_buffer + --disable-radix-cache.
This means no_buffer + --disable-radix-cache is a common deployment config that is
currently broken for FlashInfer GDN decode.

Reproduction

server side:

# Usage: bash 01.run_qwen3_5.sh nvfp4 flashinfer
MODE=${1:-""}
BACKEND=${2:-"triton"}

# Set ENABLE_HF=1 to use HuggingFace weights (requires HF_TOKEN)
ENABLE_HF=${ENABLE_HF:-0}

export HF_TOKEN="hf_xxx"
if [ "$MODE" = "nvfp4" ]; then
    MODEL="nvidia/Qwen3.5-397B-A17B-NVFP4"
fi

# Auto-detect number of GPUs and set TP_SIZE accordingly
NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
if [ "$NUM_GPUS" -eq 4 ]; then
    TP_SIZE=4
elif [ "$NUM_GPUS" -eq 8 ]; then
    TP_SIZE=8
else
    echo "Warning: Detected $NUM_GPUS GPUs. Defaulting to TP_SIZE=8"
    TP_SIZE=8
fi

echo "Detected $NUM_GPUS GPUs, using TP_SIZE=$TP_SIZE"
PORT=30000

# Set ENABLE_MTP=1 to enable Multi-Token Prediction (MTP)
ENABLE_MTP=${ENABLE_MTP:-0}

MTP_ARGS=""
if [ "$ENABLE_MTP" -eq 1 ]; then
    MTP_ARGS="--speculative-algo NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4"
fi

MODE_SUFFIX=""
if [ "$MODE" = "fp8" ]; then
    MODE_SUFFIX="_fp8"
elif [ "$MODE" = "nvfp4" ]; then
    MODE_SUFFIX="_nvfp4"
fi

if [ "$ENABLE_MTP" -eq 1 ]; then
    LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}_mtp${MODE_SUFFIX}.log"
else
    LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}${MODE_SUFFIX}.log"
fi

NVFP4_ARGS=()
if [ "$MODE" = "nvfp4" ]; then
    NVFP4_ARGS=(
        --quantization modelopt_fp4
        --mamba-scheduler-strategy no_buffer
        --mamba-track-interval 128
        --model-loader-extra-config '{"enable_multithread_load": true,"num_threads": 64}'
    )
fi

export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=False
export SGLANG_TORCH_PROFILER_DIR=/scratch/repo/sglang/lab/qwen3-next-mtp/profile_fi_pool

python -m sglang.launch_server \
    --model $MODEL \
    --tp-size ${TP_SIZE} \
    --port ${PORT} \
    --max-running-requests 128 \
    --chunked-prefill-size 2048 \
    --mamba-ssm-dtype bfloat16 \
    --reasoning-parser qwen3 \
    --attention-backend trtllm_mha \
    --linear-attn-decode-backend $BACKEND \
    --disable-radix-cache \
    $MTP_ARGS \
    "${NVFP4_ARGS[@]}" 2>&1 | tee "$LOG_FILE"

client:

#!/bin/bash
PORT=${1:-30000}
NUM_QUESTIONS=${2:-200}
MODEL=${3:-"nvidia/Qwen3.5-397B-A17B-NVFP4"}

python3 - <<EOF
from types import SimpleNamespace
from sglang.test.run_eval import run_eval

args = SimpleNamespace(
    model="${MODEL}",
    eval_name="gsm8k",
    num_shots=5,
    num_examples=${NUM_QUESTIONS},
    max_tokens=16000,
    num_threads=128,
    repeat=1,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    base_url="http://127.0.0.1:${PORT}",
    host="http://127.0.0.1",
    port=${PORT},
)
metrics = run_eval(args)
print(f"metrics={metrics}")
EOF

Environment

Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu129
sglang: 0.5.8.post1
sglang-kernel: 0.4.0
flashinfer_python: 0.6.6
flashinfer_cubin: 0.6.6
flashinfer_jit_cache: 0.6.6+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.3
aiohttp: 3.13.3
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.35.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.42.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.85.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-111,168-223 1 N/A
NIC0 PXB NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_4
NIC1: mlx5_5
NIC2: mlx5_6
NIC3: mlx5_7
NIC4: mlx5_8
NIC5: mlx5_9
NIC6: mlx5_10
NIC7: mlx5_11
NIC8: mlx5_12
NIC9: mlx5_13
NIC10: mlx5_14
NIC11: mlx5_15

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions