Skip to content

[Bug] illegal memory of BatchQKApplyRotaryPosIdsCosSinCache when spec decoding #10713

@hnyls2002

Description

@hnyls2002

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

  File "/root/sglang/python/sglang/srt/layers/rotary_embedding.py", line 228, in forward_cuda
    apply_rope_with_cos_sin_cache_inplace(
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py", line 323, in apply_rope_with_cos_sin_cache_inplace
    torch.ops.sgl_kernel.apply_rope_pos_ids_cos_sin_cache.default(
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchQKApplyRotaryPosIdsCosSinCache failed with error code an illegal memory access was encountered

Reproduction

Launch the server

export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
export SPEC_MODEL=lmsys/sglang-EAGLE-LLaMA3-Instruct-8B
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python -m sglang.launch_server \
    --dtype float16 \
    --model-path $MODEL \
    --attention-backend triton \
    --decode-log-interval 1 \
    --cuda-graph-bs $(seq -s ' ' 1 64) \
    --mem-fraction-static 0.75 \
    --disable-radix-cache \
    --speculative-algorithm EAGLE \
    --speculative-draft-model $SPEC_MODEL \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --host 127.0.0.1 \
    --port 23333

Run the benchmarking

export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
python3 -m sglang.bench_serving \
    --port 23333 \
    --model $MODEL \
    --dataset-name sharegpt \
    --backend sglang-oai \
    --random-range-ratio 0 \
    --random-input-len 1200 \
    --random-output-len 512 \
    --num-prompts 1000

Environment

H100, latest main and lmsysorg/sglang:dev image

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions