Skip to content

Enable CPU device on SGLang#2806

Merged
merrymercy merged 6 commits intosgl-project:mainfrom
chunyuan-w:chunyuan/enable_cpu_device
Jan 17, 2025
Merged

Enable CPU device on SGLang#2806
merrymercy merged 6 commits intosgl-project:mainfrom
chunyuan-w:chunyuan/enable_cpu_device

Conversation

@chunyuan-w
Copy link
Copy Markdown
Contributor

@chunyuan-w chunyuan-w commented Jan 9, 2025

Motivation

This PR enables CPU device on SGLang.
Currently we fallback attention and MoE to the torch native backend and make the functionality work on CPU.
We will submit follow-up PRs to provide optimized kernels to further improvement the performance.

For vllm installation for CPU, users could follow the instruction provided by vllm here.

Modifications

The main modifications include:

  • Add a native implementation for MoE (moe_forward_native) following the original implementation in the model: moe_infer in deepseek. This performs better than the existing fused_moe_forward_native on CPU.
  • For CPU, we won't call the code to set the number of threads to 1 anymore: link to the current code in SGLang, otherwise, only 1 CPU core will be used when running the workload. This change will improve the performance on CPU.
  • For the rotary embedding part, in the DeepseekScalingRotaryEmbedding class defined in vllm, the device has been hard-coded to "cuda" in these two places: _compute_inv_freq, _compute_cos_sin_cache. We temporarily port the related code into SGLang to make it compatible with the CPU version. We will add an optimized rotary embedding kernel for CPU and will remove the ported code then.

Example

Below are some example command lines to use on CPU with this PR. We only support --disable-mla for now.
Supposing we want to use 40 CPU cores on the NUMA node 0:

Bench one batch

numactl --physcpubind=0-39 --membind=0 python3 -m sglang.bench_one_batch --batch-size 1 --input 1024 --output 8 --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --device cpu --attention-backend torch_native --disable-mla

Server mode

Command line on server side:

numactl --physcpubind=0-39 --membind=0 python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --disable-radix --trust-remote-code --device cpu --attention-backend torch_native --disable-mla --log-requests

Command line on client side:

# Bench the serving
python3 -m sglang.bench_serving --backend sglang --num-prompts 8

# Collecting the score on mmlu
python3 -m sglang.test.run_eval --eval-name mmlu --num-examples 64 --port 30000

@chunyuan-w chunyuan-w force-pushed the chunyuan/enable_cpu_device branch 2 times, most recently from ff9b4e1 to dafbe3e Compare January 14, 2025 07:57
@chunyuan-w chunyuan-w marked this pull request as ready for review January 14, 2025 08:07
@chunyuan-w chunyuan-w force-pushed the chunyuan/enable_cpu_device branch 3 times, most recently from 5f8ca68 to ab3b275 Compare January 16, 2025 05:25
Comment thread python/sglang/srt/managers/scheduler.py Outdated
@chunyuan-w chunyuan-w requested a review from merrymercy January 17, 2025 02:15
@merrymercy merrymercy merged commit 6305173 into sgl-project:main Jan 17, 2025
@merrymercy
Copy link
Copy Markdown
Contributor

@chunyuan-w merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants