Skip to content

[bug fix] use npu phy id in container env#14266

Merged
ShangmingCai merged 4 commits intosgl-project:mainfrom
jinke446:fix_mooncake_ascend
Dec 3, 2025
Merged

[bug fix] use npu phy id in container env#14266
ShangmingCai merged 4 commits intosgl-project:mainfrom
jinke446:fix_mooncake_ascend

Conversation

@jinke446
Copy link
Copy Markdown
Contributor

@jinke446 jinke446 commented Dec 2, 2025

Motivation

for ascend env, npu phy id should be used for mooncake backend when running in container env

Modifications

use phy id from ASCEND_NPU_PHY_ID in container env, otherwise use gpu_id

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Dec 2, 2025
Comment on lines +173 to +177
npu_phy_id = get_int_env_var("ASCEND_NPU_PHY_ID", -1)
if npu_phy_id == -1:
hostname += f":{get_free_port()}:npu_{self.gpu_id}"
else:
hostname += f":{get_free_port()}:npu_{npu_phy_id}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have initialized ASCEND_NPU_PHY_ID = EnvInt(-1), then maybe you should use

from sglang.srt.environ import envs
npu_phy_id = envs.ASCEND_NPU_PHY_ID 

here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for comments, done

from sglang.srt.utils import (
get_bool_env_var,
get_free_port,
get_int_env_var,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is no longer needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iforgetmyname @ping1jing2 since this is an ascend PR, do you have time to review and give comments?

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change .value to .get()?

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But need an approval by @ping1jing2 or @iforgetmyname before merging.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Dec 2, 2025
@ping1jing2 ping1jing2 self-assigned this Dec 2, 2025
@ping1jing2
Copy link
Copy Markdown
Collaborator

Hi @Vikram111-pix Thanks for your contribution and could you show the comparison between the bug before and after the modifications

@iforgetmyname
Copy link
Copy Markdown
Collaborator

could u pls provide a use case for this newly added environmental variable?

for example when deploying 2 cards for a prefill instance and 4 cards for a decode instance, all of them stay inside container environment, how can i assign this env var? if i understand it correctly, i should assign phy id to each card respectively? thus in this case 6 containers should be created?

@jinke446
Copy link
Copy Markdown
Contributor Author

jinke446 commented Dec 3, 2025

@ping1jing2 @iforgetmyname thanks for your comments,

As mentioned in this link https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md, phy id is should be passed to mooncake transfer engine.

here is the case i met, two npu in one container are used for 1P1D as below

#npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0                   Version: 25.2.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 5     910B2               | OK            | 106.9       48                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3406 / 65536         |
+===========================+===============+====================================================+
| 6     910B2               | OK            | 97.3        45                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3417 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+

phy id is 5, 6, logical id is 0, 1, so for P instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID=5
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode prefill --port 30000 --base-gpu-id 0  --host 127.0.0.1 --tp-size 1 --attention-backend ascend

for D instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID=6
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode decode --port 30001 --base-gpu-id 1 --host 127.0.0.1 --tp-size 1  --attention-backend ascend

pls correct me if any my misunderstanding.

@iforgetmyname
Copy link
Copy Markdown
Collaborator

for D instance, the launch cmd:

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
export ASCEND_NPU_PHY_ID =5
python -m sglang.launch_server --model-path /root/DeepSeek-R1-Distill-Qwen-1___5B/ --disaggregation-mode decode --port 30001 --base-gpu-id 1 --host 127.0.0.1 --tp-size 1  --attention-backend ascend

pls correct me if any my misunderstanding.

For D instance, ASCEND_NPU_PHY_ID=5 should this be ASCEND_NPU_PHY_ID=6?

I see what's going on here, within non-privileged containers CANN runtime is only able to see logical ids (in this case 0, 1), however to initialize transfer backend MoonCake needs to call an interface provided by driver which is only able to see physical ids (in this case 5, 6). Using a privileged container can avoid this pain in the butt. Not only MoonCake, we are facing the same issue here with Ascend TransferBackend, this has been reported to driver team already.

Anyway, this is an elegant workaround for now, thx

Copy link
Copy Markdown
Collaborator

@iforgetmyname iforgetmyname left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iforgetmyname
Copy link
Copy Markdown
Collaborator

cc @ShangmingCai this is ready-to-merge

@ShangmingCai ShangmingCai merged commit 4227137 into sgl-project:main Dec 3, 2025
95 of 100 checks passed
@jinke446 jinke446 deleted the fix_mooncake_ascend branch December 3, 2025 05:00
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
Co-authored-by: jinke15 <jinke15@jd.com>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants