Skip to content

[Fix] Add EPLB rebalance support for Kimi K2.5#21004

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yafengio:fix/k2.5-eplb-rebalance
Mar 26, 2026
Merged

[Fix] Add EPLB rebalance support for Kimi K2.5#21004
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yafengio:fix/k2.5-eplb-rebalance

Conversation

@yafengio
Copy link
Copy Markdown
Contributor

Motivation

Add routed_experts_weights_of_layer property to KimiK25ForConditionalGeneration to enable EPLB (Expert Parallel Load Balancing) rebalance support for Kimi K2.5 models.

EPLB Rebalance error:

[2026-03-19 13:32:54] INFO:     10.121.36.12:39116 - "POST /generate HTTP/1.1" 200 OK
[2026-03-19 13:32:54 DP3 PP0 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2717, in run_scheduler_process
    scheduler.event_loop_pp_disagg_prefill()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 253, in event_loop_pp_disagg_prefill
    result = self.run_batch(self.cur_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2055, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 452, in forward_batch_generation
    pp_proxy_tensors, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2679, in forward
    self.eplb_manager.on_forward_pass_end()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 62, in on_forward_pass_end
    next(self._main_generator)
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 172, in _entrypoint
    yield from self.rebalance()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 220, in rebalance
    yield from self.transfer_parameter()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 347, in transfer_parameter
    update_layer_ids_chunks = self._compute_update_layer_ids_chunks()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 374, in _compute_update_layer_ids_chunks
    list(self._model_runner.model.routed_experts_weights_of_layer.keys())
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1940, in __getattr__
    raise AttributeError(
AttributeError: 'KimiK25ForConditionalGeneration' object has no attribute 'routed_experts_weights_of_layer'

Modifications

Added routed_experts_weights_of_layer property in KimiK25ForConditionalGeneration class.

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical AttributeError that prevented Expert Parallel Load Balancing (EPLB) rebalancing from functioning correctly with Kimi K2.5 models. By introducing a specific property to the model class, it ensures that the EPLB manager can properly access and utilize expert weight information, thereby restoring and enabling this essential load balancing mechanism.

Highlights

  • EPLB Rebalance Support: Added the routed_experts_weights_of_layer property to the KimiK25ForConditionalGeneration class to enable Expert Parallel Load Balancing (EPLB) rebalance functionality for Kimi K2.5 models, resolving an AttributeError.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request adds the routed_experts_weights_of_layer property to KimiK25ForConditionalGeneration to fix an AttributeError related to EPLB rebalance support. The change is straightforward and correctly delegates the property access to the underlying language model. I have one suggestion to improve code consistency by adding a return type hint to the new property.

return self.language_model.end_layer

@property
def routed_experts_weights_of_layer(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other properties in this class (like start_layer and end_layer) and for improved code clarity, please add a return type hint. Since this property appears to return a dictionary, -> dict would be an appropriate type hint.

Suggested change
def routed_experts_weights_of_layer(self):
def routed_experts_weights_of_layer(self) -> dict:

@yafengio
Copy link
Copy Markdown
Contributor Author

Could you please take a look at this PR when you have a moment? Thanks! @yeahdongcn

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/tag-run-ci-label

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 01ccdb9 into sgl-project:main Mar 26, 2026
116 of 132 checks passed
satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
@Xiaoctw
Copy link
Copy Markdown

Xiaoctw commented Apr 9, 2026

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000

Why does the server crash when receiving a request? h20 * 8

@yafengio
Copy link
Copy Markdown
Contributor Author

yafengio commented Apr 9, 2026

Why does the server crash when receiving a request? h20 * 8

Could you share the error logs or stack trace when the server crashes? Thanks.

@yafengio
Copy link
Copy Markdown
Contributor Author

yafengio commented Apr 9, 2026

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000

Why does the server crash when receiving a request? h20 * 8

I tested it locally and everything works fine.

image

image: lmsysorg/sglang:v0.5.10.post1

nvidia-smi:

Thu Apr  9 06:41:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20-3e                  On  |   00000000:18:00.0 Off |                    0 |
| N/A   38C    P0             96W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20-3e                  On  |   00000000:38:00.0 Off |                    0 |
| N/A   34C    P0            114W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H20-3e                  On  |   00000000:49:00.0 Off |                    0 |
| N/A   37C    P0             82W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H20-3e                  On  |   00000000:59:00.0 Off |                    0 |
| N/A   31C    P0             75W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H20-3e                  On  |   00000000:9B:00.0 Off |                    0 |
| N/A   31C    P0             76W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H20-3e                  On  |   00000000:BB:00.0 Off |                    0 |
| N/A   38C    P0            126W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H20-3e                  On  |   00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0             75W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H20-3e                  On  |   00000000:DA:00.0 Off |                    0 |
| N/A   37C    P0            103W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@Xiaoctw
Copy link
Copy Markdown

Xiaoctw commented Apr 10, 2026

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000

Why does the server crash when receiving a request? h20 * 8

@Xiaoctw
Copy link
Copy Markdown

Xiaoctw commented Apr 10, 2026

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000

Why does the server crash when receiving a request? h20 * 8

[2026-04-10 15:39:20 TP0 EP0] Decode batch, #running-req: 1, #token: 1938, token usage: 0.00, cuda graph: True, gen throughput (token/s): 94.64, #queue-req: 0
[2026-04-10 15:39:20 TP0 EP0] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP4 EP4] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP1 EP1] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP3 EP3] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP7 EP7] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP5 EP5] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP6 EP6] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP2 EP2] [EPLBManager] rebalance start
[2026-04-10 15:39:21 TP0 EP0] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP7 EP7] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP3 EP3] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP1 EP1] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP6 EP6] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP4 EP4] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP5 EP5] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP2 EP2] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:42] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:21. last_heartbeat time: 15:39:20
[2026-04-10 15:39:52] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:32. last_heartbeat time: 15:39:20
[2026-04-10 15:40:03] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:43. last_heartbeat time: 15:39:20
[2026-04-10 15:40:14] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:54. last_heartbeat time: 15:39:20
[2026-04-10 15:40:25] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:40:05. last_heartbeat time: 15:39:20
[2026-04-10 15:40:36] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:40:16. last_heartbeat time: 15:39:20

this is error logs with 0.5.10.post1, thanks @yafengio .

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants