[Feature] remove vllm _custom_ops

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.

### Motivation

- [ ] Support for `silu_and_mul` and `gelu_and_mul` in AMD, remove the current dependencies on `vllm ops.silu_and_mul` and `ops.gelu_and_muli`.  Used in `fused_moe_triton.py`. https://github.com/sgl-project/sglang/pull/4150 @yiakwy-xpu-ml-framework-team 
- [ ] remove `from vllm.model_executor.layers.activation import GeluAndMul, SiluAndMul` in `sglang/python/sglang/srt/layers/activation.py`.
- [ ] Support GemmaRMSNorm and RMSNorm in AMD.
- [ ] remove `from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm` in `sglang/python/sglang/srt/layers/layernorm.py`.
- [ ] Support `rotary_embedding` kernel in AMD.
- [ ] Support for `ops.moe_sum` in AMD, remove the dependency on `vllm ops.moe_sum`.  Used in `fused_moe_triton.py`.
- [ ] Benchmark `vllm ops.moe_align_block_size`, `moe_align_block_size_triton`, and `sgl_moe_align_block_size`, and remove the `num_experts=256` limitation in `sgl_moe_align_block_size`. After this, directly select the kernel from `moe_align_block_size_triton` and `sgl_moe_align_block_size`, and remove the dependency on `vllm ops.moe_align_block_size`.  Used in `fused_moe_triton.py`. https://github.com/sgl-project/sglang/pull/4249 & https://github.com/sgl-project/sglang/pull/4327
- [ ] Implement `scaled_int8_quant` in `sgl-kernel` and remove the current dependency on `vllm ops.scaled_int8_quant`.  Used in `fused_moe_triton.py`. @zcnrex 
- [ ] Implement `per_token_group_quant_int8` in CUDA, replacing the current `per_token_group_quant_int8 triton` implementation.  Used in `fused_moe_triton.py`. @zcnrex 
- [ ] Support `sglang_per_token_group_quant_fp8` in AMD.  Used in `fused_moe_triton.py`. https://github.com/sgl-project/sglang/pull/3959 https://github.com/sgl-project/sglang/pull/3702 @yiakwy-xpu-ml-framework-team 
- [x] Implement `scaled_fp8_quant` kernel and remove the current dependency on `vllm ops.scaled_fp8_quant`. (This is in progress, 50% complete—see [[link](https://github.com/sgl-project/sglang/pull/3786)](https://github.com/sgl-project/sglang/pull/3786) for per-tensor support, and @hebiao064  is working on per-token support. `vllm ops.scaled_fp8_quant` will support both per-tensor and per-token.). Used in `fused_moe_triton.py`, `layer.py` and `fp8.py`.   @BBuf @hebiao064  https://github.com/sgl-project/sglang/pull/4089 https://github.com/sgl-project/sglang/pull/4163.  https://github.com/sgl-project/sglang/pull/4231。https://github.com/sgl-project/sglang/pull/4215
- [ ]  Support for `apply_rope_with_cos_sin_cache_inplace` kernel in AMD, remove the  current dependencies on  `vllm os.rotary_embedding` .  Used in `rotary_embedding.py`.
- [x] Implement `topk_softmax` kernel and remove the current dependency on `vllm.ops.topk_softmax`. Used in `topk.py`.  https://github.com/sgl-project/sglang/pull/4302
- [x] Support `topk_softmax` in amd. https://github.com/sgl-project/sglang/pull/4448
- [x] Remove  vllm `ops.topk_softmax` in `python/sglang/srt/layers/moe/topk.py`. https://github.com/sgl-project/sglang/pull/4498
- [x] Implement `awq_dequantize` kernel and remove the current dependency on `vllm ops.awq_dequantize`. Used in `deepseek_nextn.py` and `deepseek_v2.py`. https://github.com/sgl-project/sglang/pull/4104 @zcnrex


### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] remove vllm _custom_ops #2965

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] remove vllm _custom_ops #2965

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions