Optimize `all_reduce` by porting the shared memory kernel of deepspeed by chunyuan-w · Pull Request #5 · mingfeima/sglang

chunyuan-w · 2025-02-17T06:30:17Z

Motivation

Optimize all_reduce by porting the shared memory implementation in deepspeed.

Modifications

We added a shm_allreduce operator in sgl-kernel.

The implementation is ported from DeepSpeed:
sgl-kernel/src/sgl-kernel/csrc/cpu/shm.h ---> DeepSpeed: shm.h
sgl-kernel/src/sgl-kernel/csrc/cpu/shm.cpp ---> DeepSpeed: shm.cpp
sgl-kernel/src/sgl-kernel/csrc/cpu/interface.cpp ---> DeepSpeed: ccl.cpp

To build the kernel:
```
cd sgl-kernel
python setup.py develop
```
We added a wrapper call tensor_model_parallel_all_reduce_wrapper in SGLang to call the shm_allreduce for CPU and call the original tensor_model_parallel_all_reduce API in vllm for other devices, so that we don't need to change the tensor_model_parallel_all_reduce function in vllm.
In sgl-kernel/src/sgl-kernel/ops/__init__.py, we only import cuda kernels if cuda is available.

Benchmarks

Accuracy

The score on mmlu (higher is better) for tp=2:
score without this PR (using torch.distributed.all_reduce): 0.594
score with this PR: 0.594

Note: --disable-overlap-schedule is needed in the args for CPU, otherwise, a new thread will be created for the forward batch here and the OMP threads binding will not work on this newly created thread.

Command line:

# Server side
# tp = 2:
SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79" python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --disable-radix --trust-remote-code --device cpu --attention-backend torch_native --disable-mla --log-requests --disable-overlap-schedule --tp 2

# Client side
python3 -m sglang.test.run_eval --eval-name mmlu --num-examples 64 --port 30000

Performance

We can observe 25% and 37% speedup on first and next token latency respectively after switching from torch.distributed.all_reduce to shm_allreduce for the below tp=2 command line on GNR.

SGLANG_CPU_OMP_THREADS_BIND="0-39|40-79" python3 -m sglang.bench_one_batch --batch-size 1 --input 1024 --output 8 --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --device cpu --attention-backend torch_native --disable-mla --tp 2

Limitations

The shm all_reduce only supports FP32 and BF16 and the cases where all the ranks are local. We fallback to torch.distributed.all_reduce for unsupported cases.

mingfeima

Generally LGTM! Try to use TAP 2 spaces to align with pytorch coding styles.

Later on we can simplify this piece of code by applying vec.h dtype conversion and AT_DISPATCH_xxx macros.

mingfeima · 2025-02-18T07:11:32Z

+    }
+}
+
+__m512 cvt_bf16_to_fp32(const __m256i src) __attribute__((target("avx512bw")));


we can remove these cvt functions later on when I uploaded vec.h

mingfeima · 2025-02-18T07:12:48Z

+    return _mm512_cvtps_ph(src, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
+}
+
+void reduce_bf16_buffers(int start_elements, int num_elements, char* to_buffer, char** buffers)


we can use ATen stype AT_DISPATCH_xx Macros to simply the code. You can just leave it as it is. And make the change later on.

mingfeima · 2025-02-18T07:20:38Z

+    }
+}
+
+static bool is_initialized = 0;


do not use 0, use false.

Fixed in fe608ad

mingfeima · 2025-02-18T07:21:10Z

+
+void shm_initialize(int size, int rank, char* addr_string, char* port_string)
+{
+    if (is_initialized) return;


if (is_initialized) { return; }

Fixed in fe608ad

mingfeima · 2025-02-18T07:24:34Z

+
+#define positive_mod(num, mod) ((((num) % (mod)) + (mod)) % (mod))
+#define rank_mod(rank) positive_mod(rank, world_size)
+size_t slice_size(size_t chunk_el, int slice_idx)


compute slice_size before calling into this function to remove integer div.

mingfeima · 2025-02-18T07:25:37Z

+        if (i != world_rank) wait_buffer_state_until_2(i, reduce_current, copy_next, state_group);
+    }
+
+    auto t4 = std::chrono::system_clock::now();


not used t4

Removed t4 in fe608ad

chunyuan-w · 2025-02-18T08:51:48Z

Generally LGTM! Try to use TAP 2 spaces to align with pytorch coding styles.

Later on we can simplify this piece of code by applying vec.h dtype conversion and AT_DISPATCH_xxx macros.

Re-formatted the c++ files in 1f1218d following pytorch coding styles.

Kp/gemma4 dense

Optimize all_reduce by porting the shm kernel of deepspeed

15530d6

chunyuan-w changed the title ~~Optimize all_reduce by porting the shared memory kernel of deepspeed.~~ Optimize all_reduce by porting the shared memory kernel of deepspeed Feb 18, 2025

chunyuan-w marked this pull request as ready for review February 18, 2025 06:47

chunyuan-w requested a review from mingfeima February 18, 2025 06:48

mingfeima approved these changes Feb 18, 2025

View reviewed changes

chunyuan-w added 3 commits February 18, 2025 16:33

re-format c++ files

1f1218d

fix is_initialized and remove unused t4

fe608ad

fix is_initialized in interface.cpp

06ae3e7

mingfeima merged commit f90cfb7 into mingfeima:cpu_opt Feb 19, 2025

chunyuan-w mentioned this pull request Feb 20, 2025

Optimize all_reduce by porting the shared memory kernel of deepspeed #6

Merged

CaoE pushed a commit to CaoE/sglang that referenced this pull request Apr 17, 2026

Merge pull request mingfeima#5 from pyc96/kp/gemma4-dense

eebbbd1

Kp/gemma4 dense

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `all_reduce` by porting the shared memory kernel of deepspeed#5

Optimize `all_reduce` by porting the shared memory kernel of deepspeed#5
mingfeima merged 4 commits intomingfeima:cpu_optfrom
chunyuan-w:chunyuan/pr_shm_allreduce

chunyuan-w commented Feb 17, 2025 •

edited

Loading

Uh oh!

mingfeima left a comment

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

chunyuan-w Feb 18, 2025

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

chunyuan-w Feb 18, 2025

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

mingfeima Feb 18, 2025

Uh oh!

chunyuan-w Feb 18, 2025

Uh oh!

chunyuan-w commented Feb 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chunyuan-w commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarks

Accuracy

Performance

Limitations

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chunyuan-w commented Feb 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chunyuan-w commented Feb 17, 2025 •

edited

Loading