Skip to content

Add endpoints to dump selected expert ids#4435

Merged
zhyncs merged 19 commits intosgl-project:mainfrom
yuhsuan-t:yuhsuan-t/expert_id_dump
Mar 25, 2025
Merged

Add endpoints to dump selected expert ids#4435
zhyncs merged 19 commits intosgl-project:mainfrom
yuhsuan-t:yuhsuan-t/expert_id_dump

Conversation

@yuhsuan-t
Copy link
Copy Markdown
Contributor

@yuhsuan-t yuhsuan-t commented Mar 14, 2025

Motivation

When optimizing the performance of MoE models, understanding the expert id distribution helps us to identify the performance bottlenecks and come up with a plan to fix performance issues. Such information can be captured in python/sglang/srt/layers/moe/topk.py.

Modifications

  • Created a singleton class in python/sglang/srt/managers/utils.py to record the layer id, expert id, and the topk id in a data structure.
  • Called the singleton class in python/sglang/srt/models/deepseek_v2.py to recored the layer id into the data structure. (The layer id recording is optional and can be removed.)
  • Called the singleton class in python/sglang/srt/layers/moe/topk.py to record the expert id and the topk id into the data structure.
  • Added two endpoints in python/sglang/srt/entrypoints/http_server.py to dump the information captured. All the other changes under python/sglang/srt/managers are related to the two endpoints added.

Checklist

@yuhsuan-t
Copy link
Copy Markdown
Contributor Author

yuhsuan-t commented Mar 14, 2025

Note that this expert id recording involves copying tensors from gpu to cpu during the model execution so will affect throughput and latency significantly. I will add an argument to turn off the recording by default.

@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Mar 19, 2025

Hi @yuhsaun-t, thank you for your contribution. Could you please provide more details about the output of dump_record? The output format is not clear to me. Also, can we use non-blocking copy (reference) to address the blocking issue?

@yuhsuan-t
Copy link
Copy Markdown
Contributor Author

Hi @yuhsaun-t, thank you for your contribution. Could you please provide more details about the output of dump_record? The output format is not clear to me. Also, can we use non-blocking copy (reference) to address the blocking issue?

Hello @ch-wan , sure, here is an attached output json from the dump. Right now there will be one json file for each gpu rank, but the content is the same. I can update the code to dump only on rank 0. Yes, I will update to use the non-blocking copy.
expert_distribution_1738710857.859208.json

@yuhsuan-t
Copy link
Copy Markdown
Contributor Author

Hello @ch-wan , I have updated the PR based on the reviews, and added unittests and docs.
Now the dumped file looks like this:

expert_id,count
33,12
4,16
32,8
24,12
40,13
37,11
11,17

Can you review the PR again? Thanks!

@ch-wan ch-wan self-assigned this Mar 22, 2025
Comment thread python/sglang/srt/managers/utils.py
Comment thread python/sglang/srt/managers/utils.py
Comment thread python/sglang/srt/managers/utils.py
Comment thread python/sglang/srt/managers/utils.py
Comment thread python/sglang/srt/managers/utils.py Outdated
@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Mar 22, 2025

@yuhsaun-t Thank you very much for your great effort! This PR would be very useful for analyzing experts' dynamic workloads during MoE serving. I have added some comments. Could you please take a look?

Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question comes to my mind when I double-check this PR. When DP is enabled, the recorders from different workers will collect different experts distribution. We may need all-reduce to synchronize their results, and only the master worker can dump results.

Comment thread python/sglang/srt/managers/utils.py Outdated
@ch-wan ch-wan mentioned this pull request Mar 24, 2025
18 tasks
@yuhsuan-t
Copy link
Copy Markdown
Contributor Author

Another question comes to my mind when I double-check this PR. When DP is enabled, the recorders from different workers will collect different experts distribution. We may need all-reduce to synchronize their results, and only the master worker can dump results.

The current implementation makes it so that the server will dump one csv file for one rank. I think we can keep it this way so that the server does not have to synchronize on the fly to save performance. We can process the dumped csv files later and aggregate them into one file. Does that sound good to you?

Comment thread python/sglang/srt/managers/utils.py Outdated
@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Mar 24, 2025

THank you for your great effort! I have approved the change.

@zhyncs zhyncs merged commit 199bb01 into sgl-project:main Mar 25, 2025
1 check failed
@yuhsuan-t yuhsuan-t deleted the yuhsuan-t/expert_id_dump branch April 1, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants