Enable CPU device on SGLang by chunyuan-w · Pull Request #2806 · sgl-project/sglang

chunyuan-w · 2025-01-09T07:02:22Z

Motivation

This PR enables CPU device on SGLang.
Currently we fallback attention and MoE to the torch native backend and make the functionality work on CPU.
We will submit follow-up PRs to provide optimized kernels to further improvement the performance.

For vllm installation for CPU, users could follow the instruction provided by vllm here.

Modifications

The main modifications include:

Add a native implementation for MoE (moe_forward_native) following the original implementation in the model: moe_infer in deepseek. This performs better than the existing fused_moe_forward_native on CPU.
For CPU, we won't call the code to set the number of threads to 1 anymore: link to the current code in SGLang, otherwise, only 1 CPU core will be used when running the workload. This change will improve the performance on CPU.
For the rotary embedding part, in the DeepseekScalingRotaryEmbedding class defined in vllm, the device has been hard-coded to "cuda" in these two places: _compute_inv_freq, _compute_cos_sin_cache. We temporarily port the related code into SGLang to make it compatible with the CPU version. We will add an optimized rotary embedding kernel for CPU and will remove the ported code then.

Example

Below are some example command lines to use on CPU with this PR. We only support --disable-mla for now.
Supposing we want to use 40 CPU cores on the NUMA node 0:

Bench one batch

numactl --physcpubind=0-39 --membind=0 python3 -m sglang.bench_one_batch --batch-size 1 --input 1024 --output 8 --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --device cpu --attention-backend torch_native --disable-mla

Server mode

Command line on server side:

numactl --physcpubind=0-39 --membind=0 python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --disable-radix --trust-remote-code --device cpu --attention-backend torch_native --disable-mla --log-requests

Command line on client side:

# Bench the serving
python3 -m sglang.bench_serving --backend sglang --num-prompts 8

# Collecting the score on mmlu
python3 -m sglang.test.run_eval --eval-name mmlu --num-examples 64 --port 30000

merrymercy · 2025-01-17T05:23:03Z

@chunyuan-w merged. Thanks!

mingfeima mentioned this pull request Jan 9, 2025

[Feature] RFC for adding CPU support for SGLang #2807

Closed

8 tasks

chunyuan-w force-pushed the chunyuan/enable_cpu_device branch 2 times, most recently from ff9b4e1 to dafbe3e Compare January 14, 2025 07:57

chunyuan-w marked this pull request as ready for review January 14, 2025 08:07

chunyuan-w requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners January 14, 2025 08:07

chunyuan-w force-pushed the chunyuan/enable_cpu_device branch 3 times, most recently from 5f8ca68 to ab3b275 Compare January 16, 2025 05:25

zhyncs assigned ispobock and BBuf Jan 16, 2025

chunyuan-w added 5 commits January 16, 2025 13:22

Enable CPU device on SGLang

1f995ce

add wrapper calls for vllm TP and rope functions

021fc4c

simplify TP related changes

af1f425

update pyproject.toml for CPU

ab3b275

Merge branch 'main' into chunyuan/enable_cpu_device

852b06d

merrymercy requested changes Jan 16, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

chunyuan-w requested a review from merrymercy January 17, 2025 02:15

merrymercy merged commit 6305173 into sgl-project:main Jan 17, 2025

make self.current_stream.synchronize a no-op during init for CPU

a6473e6

merrymercy mentioned this pull request Jan 19, 2025

feat: remove vllm get_rope #2964

Merged

4 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Enable CPU device on SGLang (sgl-project#2806)

ec1495a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable CPU device on SGLang#2806

Enable CPU device on SGLang#2806
merrymercy merged 6 commits intosgl-project:mainfrom
chunyuan-w:chunyuan/enable_cpu_device

chunyuan-w commented Jan 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

merrymercy commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chunyuan-w commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Example

Bench one batch

Server mode

Uh oh!

Uh oh!

merrymercy commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chunyuan-w commented Jan 9, 2025 •

edited

Loading