Skip to content

[Bug] TypeError in NSA backend: torch.repeat_interleave called with repeats=list during DeepSeek-V3.2 DEEPGEMM warm up (nightly docker) #15428

@momaek

Description

@momaek

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

When launching SGLang with DeepSeek-V3.2 and speculative decoding (EAGLE), the scheduler crashes at startup with:

TypeError: repeat_interleave() received an invalid combination of arguments - got (Tensor, dim=int, repeats=list)

This happens inside python/sglang/srt/layers/attention/nsa_backend.py during init_forward_metadata()
PyTorch documentation indicates torch.repeat_interleave expects repeats to be a Tensor or int, not a Python list. 

Also, SGLang docs mention that DeepSeek V3.2 uses the NSA attention backend by default (unless overridden). 

Reproduction

Using Docker Compose:

version: "3.9"
services:
  deepseek-v32:
    image: lmsysorg/sglang:nightly-dev-20251218-d20699a3
    container_name: sglang-deepseek-v32
    restart: unless-stopped
    ports:
      - "40000:30000"
    shm_size: "256g"
    ipc: host
    volumes:
      - /data3/models/DeepSeek-V3.2:/models/DeepSeek-V3.2:ro
      - /data3/deepgemm-cache-20251219:/root/.cache/deep_gemm
    environment:
      - HF_HOME=/data3/hf-cache
      - HUGGINGFACE_HUB_CACHE=/data3/hf-cache
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    runtime: nvidia
    command: >
      python -m sglang.compile_deep_gemm
      --model-path /models/DeepSeek-V3.2
      --tp 8
      --dp 1
      --enable-dp-attention
      --host 0.0.0.0
      --port 30000
      --reasoning-parser deepseek-v3
      --tool-call-parser deepseekv32
      --speculative-algorithm EAGLE
      --speculative-num-steps 3
      --speculative-eagle-topk 1
      --speculative-num-draft-tokens 4
      --mem-fraction-static 0.85
      --max-running-requests 64
      --max-prefill-tokens 32768
      --chunked-prefill-size 8192
      --log-requests
      --log-requests-level 3

SGLang crashes with the following stack trace:

Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2798, in run_scheduler_process
    scheduler.event_loop_normal()
  ...
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/nsa_backend.py", line 436, in init_forward_metadata
    page_table = torch.repeat_interleave(..., dim=..., repeats=[...])
TypeError: repeat_interleave() received an invalid combination of arguments - got (Tensor, dim=int, repeats=list), but expected one of:
 * (Tensor input, Tensor repeats, int dim = None, *, int output_size = None)
 * (Tensor input, int repeats, int dim = None, *, int output_size = None)

And Fixed with this patch:

diff --git a/python/sglang/srt/layers/attention/nsa_backend.py b/python/sglang/srt/layers/attention/nsa_backend.py
index 18b1b9daf..4202501b1 100644
--- a/python/sglang/srt/layers/attention/nsa_backend.py
+++ b/python/sglang/srt/layers/attention/nsa_backend.py
@@ -435,7 +435,7 @@ class NativeSparseAttnBackend(
                 # after verification. Lengths vary per request based on how many tokens
                 # were accepted.
                 page_table = torch.repeat_interleave(
-                    page_table, repeats=extend_seq_lens_cpu, dim=0
+                    page_table, repeats=forward_batch.extend_seq_lens, dim=0
                 )
 
         elif forward_batch.forward_mode.is_extend():

Environment

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.86.10
PyTorch: 2.9.1+cu129
sglang: 0.5.6
sgl_kernel: 0.3.18.post2
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.123.5
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.32.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0,2,4,6,8,10 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0,2,4,6,8,10 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0,2,4,6,8,10 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 1,3,5,7,9,11 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 1,3,5,7,9,11 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 1,3,5,7,9,11 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 1,3,5,7,9,11 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions