Speculative decoding with lookahead by jjjjohnson · Pull Request #2790 · sgl-project/sglang

jjjjohnson · 2025-01-08T09:41:02Z

Motivation

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

Related resources

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Overall workflow

Features

No need to train draft model.
Trie tree will be updated with both prompt tokens and output tokens.
The draft tokens generation is a frequency based sort mechanism from the specific prompt tokens and ALL history output tokens(with evict).
Both Single-branch and Multi-branch are supported.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

jjjjohnson · 2025-01-08T09:45:17Z

import sglang as sgl
import time
import json
import numpy as np

def main():
    # Sample prompts.
    prompts = [
        '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你是谁？<|im_end|>\n<|im_start|>assistant\n'
    ]

    sampling_params = {"temperature": 0.7, "repetition_penalty":1,
                       "max_new_tokens": 256,"top_k": 1,
                       "stop_token_ids": [151645, 151644, 151643]}


    model_path = "Qwen/Qwen2-7B-Instruct"

    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, speculative_one_branch=True, disable_cuda_graph=False, 
                     speculative_num_draft_tokens=4, speculative_algorithm='LOOKAHEAD', mem_fraction_static=0.60, 
                     watchdog_timeout=1e8, log_level='info')


    for idx in range(5):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        completion_tokens = 0
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            completion_tokens += output["meta_info"]["completion_tokens"]
            print(f"{output['text']}")
            print('======================')
        print(f"{idx=}!!!!!!!!! tps =: {completion_tokens/cos}\n\n")

if __name__ == "__main__":
    main()

zhyncs · 2025-01-11T15:14:21Z

Hi @jjjjohnson Could you help resolve the conflicts? Thanks.

jjjjohnson · 2025-01-12T13:52:14Z

Hi @jjjjohnson Could you help resolve the conflicts? Thanks.

Done

merrymercy · 2025-01-13T05:45:26Z

Could you share any performance results?

jjjjohnson · 2025-01-16T09:41:03Z

Could you share any performance results?

Sure!
Since the Lookahead speculative decode will cache input and output tokens, I run sglang.bench_serving 2 turns and disable the random.shuffle(dataset) to make the request same for 2 turns to compare the performance difference with normal decode.
Note: Lookahead speculative decode is turned off when batch size > 4 and I limit the max-concurrency and request-rate.

Start Server:

Normal decode:

python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct --trust-remote-code --tp 1

Lookahead speculative decode:

python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct \
      --trust-remote-code --tp 1 --speculative-num-draft-tokens 4 --speculative-algorithm LOOKAHEAD --speculative-one-branch

Benchmark:

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /oss/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --max-concurrency 3 --request-rate 2

Result:

Normal decode first run turn:

Normal decode second run turn:

Lookahead speculative decode first run turn:

Lookahead speculative decode second run turn:

mpjlu · 2025-02-07T08:06:58Z

I find this PR cannot run DeepSeek V3, have you test this model?

jjjjohnson · 2025-02-07T08:56:20Z

I find this PR cannot run DeepSeek V3, have you test this model?

No. What is the error message?

mpjlu · 2025-02-07T14:21:48Z

I find this PR cannot run DeepSeek V3, have you test this model?

No. What is the error message?

mla crash，no show very useful message.

mpjlu · 2025-02-11T12:18:31Z

I find this PR cannot run llama 8b with triton backend, the error is:

46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify
47 batch.seq_lens_sum = batch.seq_lens.sum().item()
48 RuntimeError: CUDA error: an illegal memory access was encountered

Does this PR support triton backend?

coolhok · 2025-02-12T02:26:33Z

mla

I think mla attention not support tree mask,so this pr not work with Deepseek.

coolhok · 2025-02-12T02:28:26Z

I find this PR cannot run llama 8b with triton backend, the error is:

46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify 47 batch.seq_lens_sum = batch.seq_lens.sum().item() 48 RuntimeError: CUDA error: an illegal memory access was encountered

Does this PR support triton backend?

lookahead depend on flashinfer tree mask attention.triton now is not support tree mask.

jjjjohnson · 2025-02-12T06:31:39Z

I find this PR cannot run llama 8b with triton backend, the error is:
46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify 47 batch.seq_lens_sum = batch.seq_lens.sum().item() 48 RuntimeError: CUDA error: an illegal memory access was encountered
Does this PR support triton backend?

lookahead depend on flashinfer tree mask attention.triton now is not support tree mask.

True. I have updated serve args to make sure flashinfer is used when lookahead is on.

jjjjohnson · 2025-02-20T08:45:24Z

@zhyncs /ready

Swipe4057 · 2025-02-24T09:08:38Z

Can a comparison be made?

https://developer.nvidia.com/blog/optimizing-qwen2-5-coder-throughput-with-nvidia-tensorrt-llm-lookahead-decoding/

lambert0312 · 2025-04-07T01:24:51Z

Any progress on this?

huakyouin · 2025-04-14T09:17:26Z

Hello, this PR has been pending for a while. Any updates or blockers I should know about?

hotelll · 2025-04-23T02:57:54Z

Could you share any performance results?

Sure! Since the Lookahead speculative decode will cache input and output tokens, I run sglang.bench_serving 2 turns and disable the random.shuffle(dataset) to make the request same for 2 turns to compare the performance difference with normal decode. Note: Lookahead speculative decode is turned off when batch size > 4 and I limit the max-concurrency and request-rate.
# Start Server: ## Normal decode: `python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct --trust-remote-code --tp 1 `
Lookahead speculative decode:
python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct \
      --trust-remote-code --tp 1 --speculative-num-draft-tokens 4 --speculative-algorithm LOOKAHEAD --speculative-one-branch
Benchmark:

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /oss/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --max-concurrency 3 --request-rate 2

Result:

Normal decode first run turn:
## Normal decode second run turn: ## Lookahead speculative decode first run turn: ## Lookahead speculative decode second run turn:

it seems that the inference speed of lookahead speculative decoding is slower than the normal decoding? I wonder the reasons and is it still meaningful to use lookahead speculative decoding in sglang...

a4zhangfei · 2025-06-06T09:21:28Z

Could you share any performance results?

Sure! Since the Lookahead speculative decode will cache input and output tokens, I run sglang.bench_serving 2 turns and disable the random.shuffle(dataset) to make the request same for 2 turns to compare the performance difference with normal decode. Note: Lookahead speculative decode is turned off when batch size > 4 and I limit the max-concurrency and request-rate.

Start Server:

Normal decode:

python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct --trust-remote-code --tp 1

Lookahead speculative decode:
python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct \
      --trust-remote-code --tp 1 --speculative-num-draft-tokens 4 --speculative-algorithm LOOKAHEAD --speculative-one-branch
Benchmark:

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /oss/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --max-concurrency 3 --request-rate 2

Result:

Normal decode first run turn:
## Normal decode second run turn: ## Lookahead speculative decode first run turn: ## Lookahead speculative decode second run turn:
it seems that the inference speed of lookahead speculative decoding is slower than the normal decoding? I wonder the reasons and is it still meaningful to use lookahead speculative decoding in sglang...

@hotelll
(791-611)/611=29%, It looks like the acceleration effect is not bad.

mmyxym · 2025-07-07T02:09:51Z

@jjjjohnson hi，why this PR is closed? Is there any plan? Thanks.

yukavio · 2025-08-20T10:05:56Z

@jjjjohnson During the RL training process, model weights are constantly changing, so we cannot train a specialized weight-based draft model (such as Eagle) for the model. Therefore, lookahead, a statistics-based speculative decoding approach, would be very useful for accelerating the rollout process in RL training. Do you have plans to continue merging this PR?

a4zhangfei · 2025-08-20T10:14:25Z

@jjjjohnson The original implementation of this PR has relatively low performance. I have optimized it based on the original version, achieving a 2.x times speedup in our application scenario. If possible, I would like to submit a PR.

a4zhangfei · 2025-09-01T11:17:53Z

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

this is the PR #9873

valorix25 · 2025-09-02T08:42:49Z

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

this is the PR #9873

Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue.

When I commented out the line # from sgl_kernel import common_ops in sglang/sgl-kernel/python/sgl_kernel/__init__.py, I encountered the following error:

Traceback (most recent call last):
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 39, in <module>
    main()
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 22, in main
    llm = sgl.Engine(model_path=model_path, speculative_algorithm='LOOKAHEAD',
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/utils.py", line 313, in __call__
    return module(*args, **kwargs)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 127, in __init__
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 715, in _launch_subprocesses
    _set_envs_and_config(server_args)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 682, in _set_envs_and_config
    assert_pkg_version(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/utils.py", line 820, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.3.7, which is less than the minimum required version 0.3.7.post1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`

a4zhangfei · 2025-09-02T09:01:59Z

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

this is the PR #9873

Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue.

When I commented out the line # from sgl_kernel import common_ops in sglang/sgl-kernel/python/sgl_kernel/__init__.py, I encountered the following error:
Traceback (most recent call last):
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 39, in <module>
    main()
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 22, in main
    llm = sgl.Engine(model_path=model_path, speculative_algorithm='LOOKAHEAD',
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/utils.py", line 313, in __call__
    return module(*args, **kwargs)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 127, in __init__
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 715, in _launch_subprocesses
    _set_envs_and_config(server_args)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 682, in _set_envs_and_config
    assert_pkg_version(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/utils.py", line 820, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.3.7, which is less than the minimum required version 0.3.7.post1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`

@valorix25
Change the content of sglang/sgl-kernel/build.sh to the following and run cd sglang/sgl-kernel/build.sh && bash build.sh:

#!/bin/bash
set -ex

PYTHON_VERSION=$1
CUDA_VERSION=$2
PYTHON_ROOT_PATH=/opt/python/cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}
# export CUDA_NVCC_EXECUTABLE=$(which nvcc)
# export NVCC="ccache $CUDA_NVCC_EXECUTABLE"

# set CUDA envs
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_BIN_PATH=/usr/local/cuda/bin

export TORCH_CUDA_ARCH_LIST="8.9 9.0+PTX"
export SGL_KERNEL_ENABLE_BF16=1
export SGL_KERNEL_ENABLE_FP8=1
export SGL_KERNEL_ENABLE_SM90A=1

# set ccache envs
if command -v ccache &> /dev/null; then
    export CCACHE_DIR="$HOME/.ccache"
    export PATH="/usr/lib/ccache:$PATH"
    export CCACHE_CPP2=yes
    echo "ccache enabled"
fi

# clean build dir
# rm -rf build
if [ -z "$3" ]; then
   ARCH=$(uname -i)
else
   ARCH=$3
fi

echo "ARCH:  $ARCH"
if [ ${ARCH} = "aarch64" ]; then
   LIBCUDA_ARCH="sbsa"
   BUILDER_NAME="pytorch/manylinuxaarch64-builder"
   CMAKE_BUILD_PARALLEL_LEVEL=16
else
   LIBCUDA_ARCH=${ARCH}
   BUILDER_NAME="pytorch/manylinux2_28-builder"
fi

if [ ${CUDA_VERSION} = "12.9" ]; then
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu129"
elif [ ${CUDA_VERSION} = "12.8" ]; then
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128"
else
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126"
fi


make build
bash rename_wheels.sh

valorix25 · 2025-09-03T11:28:28Z

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

this is the PR #9873

Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue.
When I commented out the line # from sgl_kernel import common_ops in sglang/sgl-kernel/python/sgl_kernel/__init__.py, I encountered the following error:
Traceback (most recent call last):
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 39, in <module>
    main()
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 22, in main
    llm = sgl.Engine(model_path=model_path, speculative_algorithm='LOOKAHEAD',
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/utils.py", line 313, in __call__
    return module(*args, **kwargs)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 127, in __init__
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 715, in _launch_subprocesses
    _set_envs_and_config(server_args)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 682, in _set_envs_and_config
    assert_pkg_version(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/utils.py", line 820, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.3.7, which is less than the minimum required version 0.3.7.post1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`

@valorix25 Change the content of sglang/sgl-kernel/build.sh to the following and run cd sglang/sgl-kernel/build.sh && bash build.sh:

#!/bin/bash
set -ex

PYTHON_VERSION=$1
CUDA_VERSION=$2
PYTHON_ROOT_PATH=/opt/python/cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}
# export CUDA_NVCC_EXECUTABLE=$(which nvcc)
# export NVCC="ccache $CUDA_NVCC_EXECUTABLE"

# set CUDA envs
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_BIN_PATH=/usr/local/cuda/bin

export TORCH_CUDA_ARCH_LIST="8.9 9.0+PTX"
export SGL_KERNEL_ENABLE_BF16=1
export SGL_KERNEL_ENABLE_FP8=1
export SGL_KERNEL_ENABLE_SM90A=1

# set ccache envs
if command -v ccache &> /dev/null; then
    export CCACHE_DIR="$HOME/.ccache"
    export PATH="/usr/lib/ccache:$PATH"
    export CCACHE_CPP2=yes
    echo "ccache enabled"
fi

# clean build dir
# rm -rf build
if [ -z "$3" ]; then
   ARCH=$(uname -i)
else
   ARCH=$3
fi

echo "ARCH:  $ARCH"
if [ ${ARCH} = "aarch64" ]; then
   LIBCUDA_ARCH="sbsa"
   BUILDER_NAME="pytorch/manylinuxaarch64-builder"
   CMAKE_BUILD_PARALLEL_LEVEL=16
else
   LIBCUDA_ARCH=${ARCH}
   BUILDER_NAME="pytorch/manylinux2_28-builder"
fi

if [ ${CUDA_VERSION} = "12.9" ]; then
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu129"
elif [ ${CUDA_VERSION} = "12.8" ]; then
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128"
else
   DOCKER_IMAGE="${BUILDER_NAME}:cuda${CUDA_VERSION}"
   TORCH_INSTALL="pip install --no-cache-dir torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126"
fi


make build
bash rename_wheels.sh

It takes a long time to compile the entire project using this script. Is there any way to perform incremental compilation only?

jjjjohnson requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners January 8, 2025 09:41

jjjjohnson mentioned this pull request Jan 8, 2025

[willing to PR] Add Lookahead speculative decoding #2772

Closed

2 tasks

zhyncs added the enhancement New feature or request label Jan 8, 2025

jjjjohnson mentioned this pull request Jan 8, 2025

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

zhyncs requested a review from yukavio January 8, 2025 11:10

zhyncs assigned merrymercy, yukavio and hnyls2002 Jan 8, 2025

zhyncs added the high priority label Jan 11, 2025

merrymercy mentioned this pull request Jan 15, 2025

[WIP] ngram spec #2886

Closed

yukavio reviewed Jan 25, 2025

View reviewed changes

Comment thread python/sglang/srt/server_args.py Outdated

Comment thread python/sglang/srt/managers/scheduler.py Outdated

Comment thread python/sglang/srt/model_executor/model_runner.py Outdated

Comment thread python/sglang/srt/speculative/lookahead_cache.py Outdated

jjjjohnson force-pushed the lookahead branch from 35cfaee to f775d00 Compare January 27, 2025 08:42

zhyncs assigned ispobock Jan 30, 2025

michaelfeil reviewed Feb 6, 2025

View reviewed changes

Comment thread python/sglang/srt/model_executor/cuda_graph_runner.py Outdated

jjjjohnson force-pushed the lookahead branch from db61dbe to f775d00 Compare February 11, 2025 06:06

jjjjohnson force-pushed the lookahead branch from f1bbfd4 to da7ab03 Compare February 12, 2025 06:31

jjjjohnson requested a review from yukavio February 12, 2025 07:12

jjjjohnson added 2 commits February 12, 2025 15:54

lookahead init rebase

503d5f9

Add detailed description for --speculative-lookahead-path

b7aa6dd

jjjjohnson force-pushed the lookahead branch 2 times, most recently from 90ee287 to b5a91cb Compare February 12, 2025 10:58

disable mla and double_sparsity

25a3a57

jjjjohnson force-pushed the lookahead branch from b5a91cb to 25a3a57 Compare February 12, 2025 14:44

zhyncs mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

merrymercy requested review from kssteven418 and rkooo567 as code owners March 8, 2025 06:12

lambert0312 mentioned this pull request Apr 14, 2025

[Feature] support ngram as speculative-model #5365

Closed

2 tasks

jjjjohnson closed this Jun 10, 2025

a4zhangfei mentioned this pull request Sep 1, 2025

[Feature] Speculative decoding support lookahead #9873

Merged

4 tasks

Conversation

jjjjohnson commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Related resources

Overall workflow

Features

Checklist

Uh oh!

jjjjohnson commented Jan 8, 2025

Uh oh!

zhyncs commented Jan 11, 2025

Uh oh!

jjjjohnson commented Jan 12, 2025

Uh oh!

merrymercy commented Jan 13, 2025

Uh oh!

jjjjohnson commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Start Server:

Normal decode:

Lookahead speculative decode:

Benchmark:

Result:

Normal decode first run turn:

Normal decode second run turn:

Lookahead speculative decode first run turn:

Lookahead speculative decode second run turn:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mpjlu commented Feb 7, 2025

Uh oh!

jjjjohnson commented Feb 7, 2025

Uh oh!

mpjlu commented Feb 7, 2025

Uh oh!

mpjlu commented Feb 11, 2025

Uh oh!

coolhok commented Feb 12, 2025

Uh oh!

coolhok commented Feb 12, 2025

Uh oh!

jjjjohnson commented Feb 12, 2025

Uh oh!

jjjjohnson commented Feb 20, 2025

Uh oh!

Swipe4057 commented Feb 24, 2025

Uh oh!

lambert0312 commented Apr 7, 2025

Uh oh!

huakyouin commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hotelll commented Apr 23, 2025

Lookahead speculative decode:

Benchmark:

Result:

Normal decode first run turn:

Uh oh!

a4zhangfei commented Jun 6, 2025

Start Server:

Normal decode:

Lookahead speculative decode:

Benchmark:

Result:

Normal decode first run turn:

Uh oh!

mmyxym commented Jul 7, 2025

Uh oh!

yukavio commented Aug 20, 2025

Uh oh!

a4zhangfei commented Aug 20, 2025

Uh oh!

a4zhangfei commented Sep 1, 2025

Uh oh!

valorix25 commented Sep 2, 2025

jjjjohnson commented Jan 8, 2025 •

edited

Loading

jjjjohnson commented Jan 16, 2025 •

edited

Loading

huakyouin commented Apr 14, 2025 •

edited

Loading