Add endpoints to dump selected expert ids by yuhsuan-t · Pull Request #4435 · sgl-project/sglang

yuhsuan-t · 2025-03-14T19:28:22Z

Motivation

When optimizing the performance of MoE models, understanding the expert id distribution helps us to identify the performance bottlenecks and come up with a plan to fix performance issues. Such information can be captured in python/sglang/srt/layers/moe/topk.py.

Modifications

Created a singleton class in python/sglang/srt/managers/utils.py to record the layer id, expert id, and the topk id in a data structure.
Called the singleton class in python/sglang/srt/models/deepseek_v2.py to recored the layer id into the data structure. (The layer id recording is optional and can be removed.)
Called the singleton class in python/sglang/srt/layers/moe/topk.py to record the expert id and the topk id into the data structure.
Added two endpoints in python/sglang/srt/entrypoints/http_server.py to dump the information captured. All the other changes under python/sglang/srt/managers are related to the two endpoints added.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yuhsuan-t · 2025-03-14T19:39:54Z

Note that this expert id recording involves copying tensors from gpu to cpu during the model execution so will affect throughput and latency significantly. I will add an argument to turn off the recording by default.

ch-wan · 2025-03-19T07:24:25Z

Hi @yuhsaun-t, thank you for your contribution. Could you please provide more details about the output of dump_record? The output format is not clear to me. Also, can we use non-blocking copy (reference) to address the blocking issue?

yuhsuan-t · 2025-03-19T20:56:50Z

Hi @yuhsaun-t, thank you for your contribution. Could you please provide more details about the output of dump_record? The output format is not clear to me. Also, can we use non-blocking copy (reference) to address the blocking issue?

Hello @ch-wan , sure, here is an attached output json from the dump. Right now there will be one json file for each gpu rank, but the content is the same. I can update the code to dump only on rank 0. Yes, I will update to use the non-blocking copy.
expert_distribution_1738710857.859208.json

This commit lets users turn recording on/off freely so that by default it does not affect performance.

yuhsuan-t · 2025-03-22T01:10:11Z

Hello @ch-wan , I have updated the PR based on the reviews, and added unittests and docs.
Now the dumped file looks like this:

expert_id,count
33,12
4,16
32,8
24,12
40,13
37,11
11,17

Can you review the PR again? Thanks!

ch-wan · 2025-03-22T06:24:39Z

@yuhsaun-t Thank you very much for your great effort! This PR would be very useful for analyzing experts' dynamic workloads during MoE serving. I have added some comments. Could you please take a look?

ch-wan

Another question comes to my mind when I double-check this PR. When DP is enabled, the recorders from different workers will collect different experts distribution. We may need all-reduce to synchronize their results, and only the master worker can dump results.

yuhsuan-t · 2025-03-24T19:19:15Z

Another question comes to my mind when I double-check this PR. When DP is enabled, the recorders from different workers will collect different experts distribution. We may need all-reduce to synchronize their results, and only the master worker can dump results.

The current implementation makes it so that the server will dump one csv file for one rank. I think we can keep it this way so that the server does not have to synchronize on the fly to save performance. We can process the dumped csv files later and aggregate them into one file. Does that sound good to you?

ch-wan · 2025-03-24T20:51:56Z

THank you for your great effort! I have approved the change.

yuhsuan-t requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 14, 2025 19:28

yuhsuan-t added 5 commits March 21, 2025 18:07

Add endpoints to dump selected expert ids

50c0e08

Clean up comments

187e638

Update to non-blocking tensor copy and set start/stop endpoint

894a09b

This commit lets users turn recording on/off freely so that by default it does not affect performance.

Remove logging in deepseek_v2 and simplify the output dump format

b4bf2d9

Add unittest, add doc, format code

2a34323

yuhsuan-t force-pushed the yuhsuan-t/expert_id_dump branch from f603ecf to 2a34323 Compare March 22, 2025 01:08

yuhsuan-t requested review from xiezhq-hermann and zhaochenyang20 as code owners March 22, 2025 01:08

ch-wan self-assigned this Mar 22, 2025

ch-wan requested changes Mar 22, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/utils.py

Comment thread python/sglang/srt/managers/utils.py

Comment thread python/sglang/srt/managers/utils.py

Comment thread python/sglang/srt/managers/utils.py

Comment thread python/sglang/srt/managers/utils.py Outdated

zhyncs added the high priority label Mar 22, 2025

Merge branch 'main' into yuhsuan-t/expert_id_dump

f4243f5

ch-wan requested changes Mar 24, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/utils.py Outdated

Update logging messages in utils.py

6241a75

ch-wan mentioned this pull request Mar 24, 2025

[Roadmap] EP Enhancement #4734

Closed

18 tasks

ch-wan reviewed Mar 24, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/utils.py Outdated

Add layer id recording back and record rank id

df29eba

ch-wan approved these changes Mar 24, 2025

View reviewed changes

ch-wan and others added 11 commits March 24, 2025 16:52

Merge branch 'main' into yuhsuan-t/expert_id_dump

bdd8cfb

Code formatting

6cbce08

fix test

3a14748

Merge branch 'main' into yuhsuan-t/expert_id_dump

af38ca5

fix test

d1530e8

typo

c05cdee

Merge branch 'main' into yuhsuan-t/expert_id_dump

3f1cd77

format

c9634c9

Merge branch 'main' into yuhsuan-t/expert_id_dump

7a01e9f

upd

e7b459b

Merge branch 'main' into yuhsuan-t/expert_id_dump

c7471e9

zhyncs merged commit 199bb01 into sgl-project:main Mar 25, 2025
1 check failed

yuhsuan-t deleted the yuhsuan-t/expert_id_dump branch April 1, 2025 21:01

yicwang mentioned this pull request Apr 11, 2025

[Feature] Deep EPLB Integration Proposal (Draft) #5309

Closed

2 tasks

ch-wan mentioned this pull request Apr 14, 2025

Expert selection distribution capture results difference between ranks #5275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add endpoints to dump selected expert ids#4435

Add endpoints to dump selected expert ids#4435
zhyncs merged 19 commits intosgl-project:mainfrom
yuhsuan-t:yuhsuan-t/expert_id_dump

yuhsuan-t commented Mar 14, 2025 •

edited

Loading

Uh oh!

yuhsuan-t commented Mar 14, 2025 •

edited

Loading

Uh oh!

ch-wan commented Mar 19, 2025

Uh oh!

yuhsuan-t commented Mar 19, 2025

Uh oh!

yuhsuan-t commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ch-wan commented Mar 22, 2025

Uh oh!

ch-wan left a comment

Uh oh!

Uh oh!

yuhsuan-t commented Mar 24, 2025

Uh oh!

Uh oh!

ch-wan commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuhsuan-t commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

yuhsuan-t commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ch-wan commented Mar 19, 2025

Uh oh!

yuhsuan-t commented Mar 19, 2025

Uh oh!

yuhsuan-t commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ch-wan commented Mar 22, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuhsuan-t commented Mar 24, 2025

Uh oh!

Uh oh!

ch-wan commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuhsuan-t commented Mar 14, 2025 •

edited

Loading

yuhsuan-t commented Mar 14, 2025 •

edited

Loading