[bug fix] use npu phy id in container env by jinke446 · Pull Request #14266 · sgl-project/sglang

jinke446 · 2025-12-02T03:22:35Z

Motivation

for ascend env, npu phy id should be used for mooncake backend when running in container env

Modifications

use phy id from ASCEND_NPU_PHY_ID in container env, otherwise use gpu_id

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-02T03:22:38Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ShangmingCai · 2025-12-02T06:11:29Z

+            npu_phy_id = get_int_env_var("ASCEND_NPU_PHY_ID", -1)
+            if npu_phy_id == -1:
+                hostname += f":{get_free_port()}:npu_{self.gpu_id}"
+            else:
+                hostname += f":{get_free_port()}:npu_{npu_phy_id}"


If you have initialized ASCEND_NPU_PHY_ID = EnvInt(-1), then maybe you should use

from sglang.srt.environ import envs npu_phy_id = envs.ASCEND_NPU_PHY_ID

here.

thanks for comments, done

ShangmingCai · 2025-12-02T06:12:08Z

+from sglang.srt.utils import (
+    get_bool_env_var,
+    get_free_port,
+    get_int_env_var,


And this is no longer needed.

ShangmingCai

@iforgetmyname @ping1jing2 since this is an ascend PR, do you have time to review and give comments?

ShangmingCai

How about change .value to .get()?

ShangmingCai

LGTM. But need an approval by @ping1jing2 or @iforgetmyname before merging.

ShangmingCai · 2025-12-02T08:58:26Z

/tag-and-rerun-ci

ping1jing2 · 2025-12-02T14:53:47Z

Hi @Vikram111-pix Thanks for your contribution and could you show the comparison between the bug before and after the modifications

iforgetmyname · 2025-12-02T16:37:15Z

could u pls provide a use case for this newly added environmental variable?

for example when deploying 2 cards for a prefill instance and 4 cards for a decode instance, all of them stay inside container environment, how can i assign this env var? if i understand it correctly, i should assign phy id to each card respectively? thus in this case 6 containers should be created?

jinke446 · 2025-12-03T01:46:48Z

@ping1jing2 @iforgetmyname thanks for your comments,

As mentioned in this link https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md, phy id is should be passed to mooncake transfer engine.

here is the case i met, two npu in one container are used for 1P1D as below

#npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0                   Version: 25.2.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 5     910B2               | OK            | 106.9       48                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3406 / 65536         |
+===========================+===============+====================================================+
| 6     910B2               | OK            | 97.3        45                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3417 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+

phy id is 5, 6, logical id is 0, 1, so for P instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID=5
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode prefill --port 30000 --base-gpu-id 0  --host 127.0.0.1 --tp-size 1 --attention-backend ascend

for D instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID=6
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode decode --port 30001 --base-gpu-id 1 --host 127.0.0.1 --tp-size 1  --attention-backend ascend

pls correct me if any my misunderstanding.

iforgetmyname · 2025-12-03T02:31:04Z

for D instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID =5
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode decode --port 30001 --base-gpu-id 1 --host 127.0.0.1 --tp-size 1  --attention-backend ascend

pls correct me if any my misunderstanding.

For D instance, ASCEND_NPU_PHY_ID=5 should this be ASCEND_NPU_PHY_ID=6?

I see what's going on here, within non-privileged containers CANN runtime is only able to see logical ids (in this case 0, 1), however to initialize transfer backend MoonCake needs to call an interface provided by driver which is only able to see physical ids (in this case 5, 6). Using a privileged container can avoid this pain in the butt. Not only MoonCake, we are facing the same issue here with Ascend TransferBackend, this has been reported to driver team already.

Anyway, this is an elegant workaround for now, thx

iforgetmyname

lgtm

iforgetmyname · 2025-12-03T02:33:27Z

cc @ShangmingCai this is ready-to-merge

Co-authored-by: jinke15 <jinke15@jd.com>

[bug fix] use npu phy id in container env

48097aa

jinke446 requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners December 2, 2025 03:22

github-actions Bot added the documentation Improvements or additions to documentation label Dec 2, 2025

ShangmingCai reviewed Dec 2, 2025

View reviewed changes

get env value use envs

2713ca5

ShangmingCai reviewed Dec 2, 2025

View reviewed changes

use get function instead of value property

7bd157d

ShangmingCai approved these changes Dec 2, 2025

View reviewed changes

github-actions Bot added the run-ci label Dec 2, 2025

ping1jing2 approved these changes Dec 2, 2025

View reviewed changes

Merge branch 'main' into fix_mooncake_ascend

d551d04

ping1jing2 self-assigned this Dec 2, 2025

iforgetmyname approved these changes Dec 3, 2025

View reviewed changes

ShangmingCai merged commit 4227137 into sgl-project:main Dec 3, 2025
95 of 100 checks passed

jinke446 deleted the fix_mooncake_ascend branch December 3, 2025 05:00

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

[bug fix] use npu phy id in container env (sgl-project#14266)

d692e41

Co-authored-by: jinke15 <jinke15@jd.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[bug fix] use npu phy id in container env (sgl-project#14266)

857dd56

Co-authored-by: jinke15 <jinke15@jd.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[bug fix] use npu phy id in container env (sgl-project#14266)

364e92b

Co-authored-by: jinke15 <jinke15@jd.com>

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

[bug fix] use npu phy id in container env (sgl-project#14266)

b3476b8

Co-authored-by: jinke15 <jinke15@jd.com>

Conversation

jinke446 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 2, 2025

Uh oh!

ShangmingCai Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jinke446 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jinke446 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Dec 2, 2025

Uh oh!

ping1jing2 commented Dec 2, 2025

Uh oh!

iforgetmyname commented Dec 2, 2025

Uh oh!

jinke446 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iforgetmyname commented Dec 3, 2025

Uh oh!

iforgetmyname left a comment

Choose a reason for hiding this comment

Uh oh!

iforgetmyname commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinke446 commented Dec 2, 2025 •

edited

Loading

jinke446 commented Dec 3, 2025 •

edited

Loading