Checklist
Describe the bug
Summary
FlashInfer GDN decode accuracy degrades significantly with SGLang's no_buffer Mamba scheduling
strategy, while the Triton FLA kernel is unaffected.
Model: nvidia/Qwen3.5-397B-A17B-NVFP4, 4× B200
Eval: gsm8k, 200 questions, temperature=0.6, top_p=0.95, top_k=20, 128 concurrent threads
| Config |
FlashInfer |
Triton |
extra_buffer |
0.990 |
0.990 |
no_buffer |
0.940 |
0.990 |
no_buffer + --disable-radix-cache |
~0.890 |
0.990 |
Root Cause Hypothesis
The no_buffer strategy allows more aggressive SSM state slot reuse. The Triton path does
explicit gather/scatter (copies state in/out per request), while FlashInfer's pool API
(initial_state + initial_state_indices) reads and writes state in-place. We suspect this
causes a state aliasing or ordering issue under no_buffer, but have not identified the
exact mechanism.
Question: Is the pool API safe when the same slot index is reused across requests, or
when a slot is reassigned before the previous write completes?
Note
extra_buffer and --disable-radix-cache are mutually exclusive in SGLang, so the only
valid production configs are extra_buffer (radix cache on) or no_buffer + --disable-radix-cache.
This means no_buffer + --disable-radix-cache is a common deployment config that is
currently broken for FlashInfer GDN decode.
Reproduction
server side:
# Usage: bash 01.run_qwen3_5.sh nvfp4 flashinfer
MODE=${1:-""}
BACKEND=${2:-"triton"}
# Set ENABLE_HF=1 to use HuggingFace weights (requires HF_TOKEN)
ENABLE_HF=${ENABLE_HF:-0}
export HF_TOKEN="hf_xxx"
if [ "$MODE" = "nvfp4" ]; then
MODEL="nvidia/Qwen3.5-397B-A17B-NVFP4"
fi
# Auto-detect number of GPUs and set TP_SIZE accordingly
NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
if [ "$NUM_GPUS" -eq 4 ]; then
TP_SIZE=4
elif [ "$NUM_GPUS" -eq 8 ]; then
TP_SIZE=8
else
echo "Warning: Detected $NUM_GPUS GPUs. Defaulting to TP_SIZE=8"
TP_SIZE=8
fi
echo "Detected $NUM_GPUS GPUs, using TP_SIZE=$TP_SIZE"
PORT=30000
# Set ENABLE_MTP=1 to enable Multi-Token Prediction (MTP)
ENABLE_MTP=${ENABLE_MTP:-0}
MTP_ARGS=""
if [ "$ENABLE_MTP" -eq 1 ]; then
MTP_ARGS="--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4"
fi
MODE_SUFFIX=""
if [ "$MODE" = "fp8" ]; then
MODE_SUFFIX="_fp8"
elif [ "$MODE" = "nvfp4" ]; then
MODE_SUFFIX="_nvfp4"
fi
if [ "$ENABLE_MTP" -eq 1 ]; then
LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}_mtp${MODE_SUFFIX}.log"
else
LOG_FILE="/scratch/repo/sglang/lab/qwen3-next-mtp/qwen35_tp${TP_SIZE}${MODE_SUFFIX}.log"
fi
NVFP4_ARGS=()
if [ "$MODE" = "nvfp4" ]; then
NVFP4_ARGS=(
--quantization modelopt_fp4
--mamba-scheduler-strategy no_buffer
--mamba-track-interval 128
--model-loader-extra-config '{"enable_multithread_load": true,"num_threads": 64}'
)
fi
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=False
export SGLANG_TORCH_PROFILER_DIR=/scratch/repo/sglang/lab/qwen3-next-mtp/profile_fi_pool
python -m sglang.launch_server \
--model $MODEL \
--tp-size ${TP_SIZE} \
--port ${PORT} \
--max-running-requests 128 \
--chunked-prefill-size 2048 \
--mamba-ssm-dtype bfloat16 \
--reasoning-parser qwen3 \
--attention-backend trtllm_mha \
--linear-attn-decode-backend $BACKEND \
--disable-radix-cache \
$MTP_ARGS \
"${NVFP4_ARGS[@]}" 2>&1 | tee "$LOG_FILE"
client:
#!/bin/bash
PORT=${1:-30000}
NUM_QUESTIONS=${2:-200}
MODEL=${3:-"nvidia/Qwen3.5-397B-A17B-NVFP4"}
python3 - <<EOF
from types import SimpleNamespace
from sglang.test.run_eval import run_eval
args = SimpleNamespace(
model="${MODEL}",
eval_name="gsm8k",
num_shots=5,
num_examples=${NUM_QUESTIONS},
max_tokens=16000,
num_threads=128,
repeat=1,
temperature=0.6,
top_p=0.95,
top_k=20,
base_url="http://127.0.0.1:${PORT}",
host="http://127.0.0.1",
port=${PORT},
)
metrics = run_eval(args)
print(f"metrics={metrics}")
EOF
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu129
sglang: 0.5.8.post1
sglang-kernel: 0.4.0
flashinfer_python: 0.6.6
flashinfer_cubin: 0.6.6
flashinfer_jit_cache: 0.6.6+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.3
aiohttp: 3.13.3
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.35.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.42.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.85.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-111,168-223 1 N/A
NIC0 PXB NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_4
NIC1: mlx5_5
NIC2: mlx5_6
NIC3: mlx5_7
NIC4: mlx5_8
NIC5: mlx5_9
NIC6: mlx5_10
NIC7: mlx5_11
NIC8: mlx5_12
NIC9: mlx5_13
NIC10: mlx5_14
NIC11: mlx5_15
ulimit soft: 1048576
Checklist
Describe the bug
Summary
FlashInfer GDN decode accuracy degrades significantly with SGLang's
no_bufferMamba schedulingstrategy, while the Triton FLA kernel is unaffected.
Model:
nvidia/Qwen3.5-397B-A17B-NVFP4, 4× B200Eval: gsm8k, 200 questions, temperature=0.6, top_p=0.95, top_k=20, 128 concurrent threads
extra_bufferno_bufferno_buffer+--disable-radix-cacheRoot Cause Hypothesis
The
no_bufferstrategy allows more aggressive SSM state slot reuse. The Triton path doesexplicit gather/scatter (copies state in/out per request), while FlashInfer's pool API
(
initial_state + initial_state_indices) reads and writes state in-place. We suspect thiscauses a state aliasing or ordering issue under
no_buffer, but have not identified theexact mechanism.
Question: Is the pool API safe when the same slot index is reused across requests, or
when a slot is reassigned before the previous write completes?
Note
extra_bufferand--disable-radix-cacheare mutually exclusive in SGLang, so the onlyvalid production configs are
extra_buffer(radix cache on) orno_buffer+--disable-radix-cache.This means
no_buffer+--disable-radix-cacheis a common deployment config that iscurrently broken for FlashInfer GDN decode.
Reproduction
server side:
client:
Environment
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu129
sglang: 0.5.8.post1
sglang-kernel: 0.4.0
flashinfer_python: 0.6.6
flashinfer_cubin: 0.6.6
flashinfer_jit_cache: 0.6.6+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.3
aiohttp: 3.13.3
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.35.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.42.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.85.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-111,168-223 1 N/A
NIC0 PXB NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_4
NIC1: mlx5_5
NIC2: mlx5_6
NIC3: mlx5_7
NIC4: mlx5_8
NIC5: mlx5_9
NIC6: mlx5_10
NIC7: mlx5_11
NIC8: mlx5_12
NIC9: mlx5_13
NIC10: mlx5_14
NIC11: mlx5_15
ulimit soft: 1048576