Speculative decoding with lookahead#2790
Speculative decoding with lookahead#2790jjjjohnson wants to merge 3 commits intosgl-project:mainfrom
Conversation
|
Hi @jjjjohnson Could you help resolve the conflicts? Thanks. |
Done |
|
Could you share any performance results? |
|
I find this PR cannot run DeepSeek V3, have you test this model? |
No. What is the error message? |
mla crash,no show very useful message. |
db61dbe to
f775d00
Compare
|
I find this PR cannot run llama 8b with triton backend, the error is: 46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify Does this PR support triton backend? |
I think mla attention not support tree mask,so this pr not work with Deepseek. |
lookahead depend on flashinfer tree mask attention.triton now is not support tree mask. |
f1bbfd4 to
da7ab03
Compare
True. I have updated serve args to make sure flashinfer is used when lookahead is on. |
90ee287 to
b5a91cb
Compare
b5a91cb to
25a3a57
Compare
|
@zhyncs /ready |
|
Any progress on this? |
|
Hello, this PR has been pending for a while. Any updates or blockers I should know about? |
@hotelll |
|
@jjjjohnson hi,why this PR is closed? Is there any plan? Thanks. |
|
@jjjjohnson During the RL training process, model weights are constantly changing, so we cannot train a specialized weight-based draft model (such as Eagle) for the model. Therefore, lookahead, a statistics-based speculative decoding approach, would be very useful for accelerating the rollout process in RL training. Do you have plans to continue merging this PR? |
|
@jjjjohnson The original implementation of this PR has relatively low performance. I have optimized it based on the original version, achieving a 2.x times speedup in our application scenario. If possible, I would like to submit a PR. |
this is the PR #9873 |
Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue. When I commented out the line |
@valorix25 #!/bin/bash
set -ex
PYTHON_VERSION=$1
CUDA_VERSION=$2
PYTHON_ROOT_PATH=/opt/python/cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}
# export CUDA_NVCC_EXECUTABLE=$(which nvcc)
# export NVCC="ccache $CUDA_NVCC_EXECUTABLE"
# set CUDA envs
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_BIN_PATH=/usr/local/cuda/bin
export TORCH_CUDA_ARCH_LIST="8.9 9.0+PTX"
export SGL_KERNEL_ENABLE_BF16=1
export SGL_KERNEL_ENABLE_FP8=1
export SGL_KERNEL_ENABLE_SM90A=1
# set ccache envs
if command -v ccache &> /dev/null; then
export CCACHE_DIR="$HOME/.ccache"
export PATH="/usr/lib/ccache:$PATH"
export CCACHE_CPP2=yes
echo "ccache enabled"
fi
# clean build dir
# rm -rf build
if [ -z "$3" ]; then
ARCH=$(uname -i)
else
ARCH=$3
fi
echo "ARCH: $ARCH"
if [ ${ARCH} = "aarch64" ]; then
LIBCUDA_ARCH="sbsa"
BUILDER_NAME="pytorch/manylinuxaarch64-builder"
CMAKE_BUILD_PARALLEL_LEVEL=16
else
LIBCUDA_ARCH=${ARCH}
BUILDER_NAME="pytorch/manylinux2_28-builder"
fi
if [ ${CUDA_VERSION} = "12.9" ]; then
DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu129"
elif [ ${CUDA_VERSION} = "12.8" ]; then
DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128"
else
DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126"
fi
make build
bash rename_wheels.sh |
It takes a long time to compile the entire project using this script. Is there any way to perform incremental compilation only? |











Motivation
n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.
Related resources
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Overall workflow
Features
Checklist