[Fix] Add EPLB rebalance support for Kimi K2.5 by yafengio · Pull Request #21004 · sgl-project/sglang

yafengio · 2026-03-20T09:19:42Z

Motivation

Add routed_experts_weights_of_layer property to KimiK25ForConditionalGeneration to enable EPLB (Expert Parallel Load Balancing) rebalance support for Kimi K2.5 models.

EPLB Rebalance error：

[2026-03-19 13:32:54] INFO:     10.121.36.12:39116 - "POST /generate HTTP/1.1" 200 OK
[2026-03-19 13:32:54 DP3 PP0 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2717, in run_scheduler_process
    scheduler.event_loop_pp_disagg_prefill()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 253, in event_loop_pp_disagg_prefill
    result = self.run_batch(self.cur_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2055, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 452, in forward_batch_generation
    pp_proxy_tensors, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2679, in forward
    self.eplb_manager.on_forward_pass_end()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 62, in on_forward_pass_end
    next(self._main_generator)
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 172, in _entrypoint
    yield from self.rebalance()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 220, in rebalance
    yield from self.transfer_parameter()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 347, in transfer_parameter
    update_layer_ids_chunks = self._compute_update_layer_ids_chunks()
  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 374, in _compute_update_layer_ids_chunks
    list(self._model_runner.model.routed_experts_weights_of_layer.keys())
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1940, in __getattr__
    raise AttributeError(
AttributeError: 'KimiK25ForConditionalGeneration' object has no attribute 'routed_experts_weights_of_layer'

Modifications

Added routed_experts_weights_of_layer property in KimiK25ForConditionalGeneration class.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

gemini-code-assist · 2026-03-20T09:19:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical AttributeError that prevented Expert Parallel Load Balancing (EPLB) rebalancing from functioning correctly with Kimi K2.5 models. By introducing a specific property to the model class, it ensures that the EPLB manager can properly access and utilize expert weight information, thereby restoring and enabling this essential load balancing mechanism.

Highlights

EPLB Rebalance Support: Added the routed_experts_weights_of_layer property to the KimiK25ForConditionalGeneration class to enable Expert Parallel Load Balancing (EPLB) rebalance functionality for Kimi K2.5 models, resolving an AttributeError.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request adds the routed_experts_weights_of_layer property to KimiK25ForConditionalGeneration to fix an AttributeError related to EPLB rebalance support. The change is straightforward and correctly delegates the property access to the underlying language model. I have one suggestion to improve code consistency by adding a return type hint to the new property.

gemini-code-assist · 2026-03-20T09:27:54Z

        return self.language_model.end_layer

+    @property
+    def routed_experts_weights_of_layer(self):


For consistency with other properties in this class (like start_layer and end_layer) and for improved code clarity, please add a return type hint. Since this property appears to return a dictionary, -> dict would be an appropriate type hint.

Suggested change

def routed_experts_weights_of_layer(self):

def routed_experts_weights_of_layer(self) -> dict:

yafengio · 2026-03-24T02:20:53Z

Could you please take a look at this PR when you have a moment? Thanks! @yeahdongcn

yeahdongcn · 2026-03-24T02:24:54Z

/tag-run-ci-label

yeahdongcn · 2026-03-24T02:37:12Z

/tag-and-rerun-ci

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

Xiaoctw · 2026-04-09T03:53:05Z

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000

Why does the server crash when receiving a request? h20 * 8

yafengio · 2026-04-09T04:35:01Z

Why does the server crash when receiving a request? h20 * 8

Could you share the error logs or stack trace when the server crashes? Thanks.

yafengio · 2026-04-09T06:57:11Z

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000
Why does the server crash when receiving a request? h20 * 8

I tested it locally and everything works fine.

image: lmsysorg/sglang:v0.5.10.post1

nvidia-smi:

Thu Apr  9 06:41:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20-3e                  On  |   00000000:18:00.0 Off |                    0 |
| N/A   38C    P0             96W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20-3e                  On  |   00000000:38:00.0 Off |                    0 |
| N/A   34C    P0            114W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H20-3e                  On  |   00000000:49:00.0 Off |                    0 |
| N/A   37C    P0             82W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H20-3e                  On  |   00000000:59:00.0 Off |                    0 |
| N/A   31C    P0             75W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H20-3e                  On  |   00000000:9B:00.0 Off |                    0 |
| N/A   31C    P0             76W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H20-3e                  On  |   00000000:BB:00.0 Off |                    0 |
| N/A   38C    P0            126W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H20-3e                  On  |   00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0             75W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H20-3e                  On  |   00000000:DA:00.0 Off |                    0 |
| N/A   37C    P0            103W /  500W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Xiaoctw · 2026-04-10T07:49:44Z

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000
Why does the server crash when receiving a request? h20 * 8

Xiaoctw · 2026-04-10T07:54:17Z

python -m sglang.launch_server \ --model-path ...\ --served-model-name Kimi-K2.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 3001 \ --tp-size 8 \ --ep-size 8 \ --enable-eplb \ --mem-fraction-static 0.8 \ --cuda-graph-max-bs 64 \ --max-running-requests 64 \ --enable-mixed-chunk \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --watchdog-timeout 20000
Why does the server crash when receiving a request? h20 * 8

[2026-04-10 15:39:20 TP0 EP0] Decode batch, #running-req: 1, #token: 1938, token usage: 0.00, cuda graph: True, gen throughput (token/s): 94.64, #queue-req: 0
[2026-04-10 15:39:20 TP0 EP0] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP4 EP4] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP1 EP1] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP3 EP3] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP7 EP7] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP5 EP5] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP6 EP6] [EPLBManager] rebalance start
[2026-04-10 15:39:20 TP2 EP2] [EPLBManager] rebalance start
[2026-04-10 15:39:21 TP0 EP0] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP7 EP7] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP3 EP3] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP1 EP1] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP6 EP6] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP4 EP4] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP5 EP5] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:21 TP2 EP2] Resetting ExpertDistributionRecorder...
[2026-04-10 15:39:42] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:21. last_heartbeat time: 15:39:20
[2026-04-10 15:39:52] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:32. last_heartbeat time: 15:39:20
[2026-04-10 15:40:03] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:43. last_heartbeat time: 15:39:20
[2026-04-10 15:40:14] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:39:54. last_heartbeat time: 15:39:20
[2026-04-10 15:40:25] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:40:05. last_heartbeat time: 15:39:20
[2026-04-10 15:40:36] Health check failed. Server couldn't get a response from detokenizer for last 20 seconds. tic start time: 15:40:16. last_heartbeat time: 15:39:20

this is error logs with 0.5.10.post1, thanks @yafengio .

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

Support eplb rebalance for kimi-k2.5

78672b9

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

yeahdongcn approved these changes Mar 24, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 24, 2026

yeahdongcn mentioned this pull request Mar 24, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

yeahdongcn requested a review from yhyang201 March 24, 2026 02:30

Kangyan-Zhou merged commit 01ccdb9 into sgl-project:main Mar 26, 2026
116 of 132 checks passed

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[Fix] Add EPLB rebalance support for Kimi K2.5 (sgl-project#21004)

dc5b02a

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Fix] Add EPLB rebalance support for Kimi K2.5 (sgl-project#21004)

31a9ff1

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Fix] Add EPLB rebalance support for Kimi K2.5 (sgl-project#21004)

bc92393

Signed-off-by: yafeng.li <yafeng.li@mthreads.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Add EPLB rebalance support for Kimi K2.5#21004

[Fix] Add EPLB rebalance support for Kimi K2.5#21004
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yafengio:fix/k2.5-eplb-rebalance

yafengio commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 20, 2026

Uh oh!

yafengio commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

Uh oh!

Xiaoctw commented Apr 9, 2026

Uh oh!

yafengio commented Apr 9, 2026

Uh oh!

yafengio commented Apr 9, 2026

Uh oh!

Xiaoctw commented Apr 10, 2026

Uh oh!

Xiaoctw commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def routed_experts_weights_of_layer(self):
	def routed_experts_weights_of_layer(self) -> dict:

Conversation

yafengio commented Mar 20, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

yafengio commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

Uh oh!

Xiaoctw commented Apr 9, 2026

Uh oh!

yafengio commented Apr 9, 2026

Uh oh!

yafengio commented Apr 9, 2026

Uh oh!

Xiaoctw commented Apr 10, 2026

Uh oh!

Xiaoctw commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants