[bug fix] use npu phy id in container env#14266
[bug fix] use npu phy id in container env#14266ShangmingCai merged 4 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| npu_phy_id = get_int_env_var("ASCEND_NPU_PHY_ID", -1) | ||
| if npu_phy_id == -1: | ||
| hostname += f":{get_free_port()}:npu_{self.gpu_id}" | ||
| else: | ||
| hostname += f":{get_free_port()}:npu_{npu_phy_id}" |
There was a problem hiding this comment.
If you have initialized ASCEND_NPU_PHY_ID = EnvInt(-1), then maybe you should use
from sglang.srt.environ import envs
npu_phy_id = envs.ASCEND_NPU_PHY_ID here.
There was a problem hiding this comment.
thanks for comments, done
| from sglang.srt.utils import ( | ||
| get_bool_env_var, | ||
| get_free_port, | ||
| get_int_env_var, |
There was a problem hiding this comment.
And this is no longer needed.
ShangmingCai
left a comment
There was a problem hiding this comment.
@iforgetmyname @ping1jing2 since this is an ascend PR, do you have time to review and give comments?
ShangmingCai
left a comment
There was a problem hiding this comment.
How about change .value to .get()?
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM. But need an approval by @ping1jing2 or @iforgetmyname before merging.
|
/tag-and-rerun-ci |
|
Hi @Vikram111-pix Thanks for your contribution and could you show the comparison between the bug before and after the modifications |
|
could u pls provide a use case for this newly added environmental variable? for example when deploying 2 cards for a prefill instance and 4 cards for a decode instance, all of them stay inside container environment, how can i assign this env var? if i understand it correctly, i should assign phy id to each card respectively? thus in this case 6 containers should be created? |
|
@ping1jing2 @iforgetmyname thanks for your comments, As mentioned in this link https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md, phy id is should be passed to mooncake transfer engine. here is the case i met, two npu in one container are used for 1P1D as below phy id is 5, 6, logical id is 0, 1, so for P instance, the launch cmd: for D instance, the launch cmd: pls correct me if any my misunderstanding. |
For D instance, I see what's going on here, within non-privileged containers CANN runtime is only able to see logical ids (in this case 0, 1), however to initialize transfer backend MoonCake needs to call an interface provided by driver which is only able to see physical ids (in this case 5, 6). Using a privileged container can avoid this pain in the butt. Not only MoonCake, we are facing the same issue here with Ascend TransferBackend, this has been reported to driver team already. Anyway, this is an elegant workaround for now, thx |
|
cc @ShangmingCai this is ready-to-merge |
Co-authored-by: jinke15 <jinke15@jd.com>
Co-authored-by: jinke15 <jinke15@jd.com>
Co-authored-by: jinke15 <jinke15@jd.com>
Co-authored-by: jinke15 <jinke15@jd.com>
Motivation
for ascend env, npu phy id should be used for mooncake backend when running in container env
Modifications
use phy id from ASCEND_NPU_PHY_ID in container env, otherwise use gpu_id
Accuracy Tests
Benchmarking and Profiling
Checklist