Checklist
Motivation
1. Motivation
After some features had been merged, we could run SGLang on Ascend servers with eager mode. But if we want to get better performance, we need implement ACLGraph or NPUGraph now.
Goals
Goal 1: Define a NPUGraphRunner class for SGLang, which provides basic functions and supports llama or Qwen models.
Goal 2: Adapt to TP/DP , GraphTree and dynamic shape scenarios, including memory reuse.
Goal 3: Improve performance based on torch.compile,
2. Technical Design
3. Roadmap
-
Phase 1: Basic support for dense model
Implement NPUGraphRunner refer to CUDAGraphRunner, but we should handle some special case:
Because we use this torch_npu.npu_fused_infer_attention_score API, which has a host_list input, we have to update its value each time using torch_npu.npu.NPUGraph.update. For more details, please refer to task update.
-
Phase 2: Basic support for moe model
Related resources
No response
Checklist
Motivation
1. Motivation
After some features had been merged, we could run SGLang on Ascend servers with eager mode. But if we want to get better performance, we need implement ACLGraph or NPUGraph now.
Goals
Goal 1: Define a
NPUGraphRunnerclass for SGLang, which provides basic functions and supports llama or Qwen models.Goal 2: Adapt to TP/DP , GraphTree and dynamic shape scenarios, including memory reuse.
Goal 3: Improve performance based on
torch.compile,2. Technical Design
Key messages
we have torch_npu.npu.NPUGraph, which has similar interfaces and functions to torch.cuda.CUDAGraph
Concerning the level of RTS, we can refer to this document.
3. Roadmap
Phase 1: Basic support for dense model
Implement
NPUGraphRunnerrefer toCUDAGraphRunner, but we should handle some special case:Because we use this torch_npu.npu_fused_infer_attention_score API, which has a host_list input, we have to update its value each time using
torch_npu.npu.NPUGraph.update. For more details, please refer to task update.Phase 2: Basic support for moe model
Related resources
No response