Enable CPU device on SGLang#2806
Merged
merrymercy merged 6 commits intosgl-project:mainfrom Jan 17, 2025
Merged
Conversation
8 tasks
ff9b4e1 to
dafbe3e
Compare
5f8ca68 to
ab3b275
Compare
merrymercy
requested changes
Jan 16, 2025
Contributor
|
@chunyuan-w merged. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR enables CPU device on SGLang.
Currently we fallback attention and MoE to the torch native backend and make the functionality work on CPU.
We will submit follow-up PRs to provide optimized kernels to further improvement the performance.
For vllm installation for CPU, users could follow the instruction provided by vllm here.
Modifications
The main modifications include:
moe_forward_native) following the original implementation in the model: moe_infer in deepseek. This performs better than the existing fused_moe_forward_native on CPU.DeepseekScalingRotaryEmbeddingclass defined in vllm, the device has been hard-coded to"cuda"in these two places: _compute_inv_freq, _compute_cos_sin_cache. We temporarily port the related code into SGLang to make it compatible with the CPU version. We will add an optimized rotary embedding kernel for CPU and will remove the ported code then.Example
Below are some example command lines to use on CPU with this PR. We only support
--disable-mlafor now.Supposing we want to use 40 CPU cores on the NUMA node 0:
Bench one batch
Server mode
Command line on server side:
Command line on client side: