Add MlxAttnBackend for macOS by yeahdongcn · Pull Request #20221 · sgl-project/sglang

yeahdongcn · 2026-03-10T01:55:56Z

Motivation

This PR adds an MlxAttnBackend for macOS using mx.fast.scaled_dot_product_attention.

Modifications

Accuracy Tests

Benchmarking and Profiling

The following tests were run on an M1 MacBook Pro

SDPA

Device	Backend	SDPA Latency (100 runs)
MPS	torch_native	0.046s
CPU	mlx	0.011s
CPU	torch_native	0.141s

E2E

With SGLANG_USE_MLX_ATTENTION=0 (default):

> uv run python -m sglang.bench_one_batch --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B --trust-remote-code --disable-radix-cache --disable-cuda-graph --tp-size 1 --batch-size 1 --input-len 60 --output-len 100 --port 43440
W0311 10:39:37.161000 29488 sglang-diffusion/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")
[2026-03-11 10:39:39] INFO server_args.py:2154: Attention backend not specified. Use torch_native backend by default.
[2026-03-11 10:39:39] WARNING server_args.py:2160: Cuda graph is disabled because of using torch native attention backend
[2026-03-11 10:39:39] WARNING common.py:1221: Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-03-11 10:39:39 TP0] Init torch distributed begin.
[2026-03-11 10:39:39 TP0] Init torch distributed ends. elapsed=0.05 s, mem usage=0.00 GB
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_linear: No module named 'vllm'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_nextn: No module named 'vllm'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/sglang-diffusion/lib/python3.11/site-packages/transformers/__init__.py)
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
[2026-03-11 10:39:39 TP0] Load weight begin. avail mem=3.55 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2026-03-11 10:39:40 TP0] Parameter lm_head.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]

[2026-03-11 10:39:41 TP0] Load weight end. elapsed=2.07 s, type=Qwen3ForCausalLM, avail mem=2.22 GB, mem usage=1.32 GB.
[2026-03-11 10:39:41 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-11 10:39:42 TP0] KV Cache is allocated. #tokens: 20033, K size: 1.07 GB, V size: 1.07 GB
[2026-03-11 10:39:42 TP0] Memory pool end. avail mem=1.78 GB
[2026-03-11 10:39:42 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
max_total_num_tokens=20033
Warmup ...
Prefill. latency: 0.46183 s, throughput:    129.92 token/s
Decode 0. Batch size: 1, latency: 1.48535 s, throughput:      0.67 token/s
Decode 1. Batch size: 1, latency: 0.10879 s, throughput:      9.19 token/s
Decode 2. Batch size: 1, latency: 0.11965 s, throughput:      8.36 token/s
Decode 3. Batch size: 1, latency: 0.19780 s, throughput:      5.06 token/s
Decode 4. Batch size: 1, latency: 0.12490 s, throughput:      8.01 token/s
Decode.  median latency: 0.13355 s, median throughput:      7.49 token/s
Total. latency:  6.230 s, throughput:     14.77 token/s
Benchmark ...
Prefill. latency: 0.23880 s, throughput:    251.26 token/s
Decode 0. Batch size: 1, latency: 0.11311 s, throughput:      8.84 token/s
Decode 1. Batch size: 1, latency: 0.13725 s, throughput:      7.29 token/s
Decode 2. Batch size: 1, latency: 0.11654 s, throughput:      8.58 token/s
Decode 3. Batch size: 1, latency: 0.11223 s, throughput:      8.91 token/s
Decode 4. Batch size: 1, latency: 0.11668 s, throughput:      8.57 token/s
Decode.  median latency: 0.14338 s, median throughput:      6.97 token/s
Total. latency: 14.447 s, throughput:     11.08 token/s

With SGLANG_USE_MLX_ATTENTION=1:

> export SGLANG_USE_MLX_ATTENTION=1
> uv run python -m sglang.bench_one_batch --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B --trust-remote-code --disable-radix-cache --disable-cuda-graph --tp-size 1 --batch-size 1 --input-len 60 --output-len 100 --port 43440
W0311 10:42:13.546000 30323 sglang-diffusion/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")
[2026-03-11 10:42:15] INFO server_args.py:2154: Attention backend not specified. Use mlx backend by default.
[2026-03-11 10:42:15] WARNING server_args.py:2166: Cuda graph is disabled because of using MLX attention backend
[2026-03-11 10:42:15] WARNING common.py:1221: Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-03-11 10:42:15 TP0] Init torch distributed begin.
[2026-03-11 10:42:15 TP0] Init torch distributed ends. elapsed=0.07 s, mem usage=0.01 GB
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_linear: No module named 'vllm'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_nextn: No module named 'vllm'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/sglang-diffusion/lib/python3.11/site-packages/transformers/__init__.py)
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
[2026-03-11 10:42:15 TP0] Load weight begin. avail mem=5.26 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2026-03-11 10:42:15 TP0] Parameter lm_head.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.88s/it]

[2026-03-11 10:42:17 TP0] Load weight end. elapsed=2.05 s, type=Qwen3ForCausalLM, avail mem=2.66 GB, mem usage=2.60 GB.
[2026-03-11 10:42:17 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-11 10:42:18 TP0] KV Cache is allocated. #tokens: 20948, K size: 1.12 GB, V size: 1.12 GB
[2026-03-11 10:42:18 TP0] Memory pool end. avail mem=1.83 GB
[2026-03-11 10:42:18 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
max_total_num_tokens=20948
Warmup ...
Prefill. latency: 0.35949 s, throughput:    166.90 token/s
Decode 0. Batch size: 1, latency: 0.75724 s, throughput:      1.32 token/s
Decode 1. Batch size: 1, latency: 0.15707 s, throughput:      6.37 token/s
Decode 2. Batch size: 1, latency: 0.14944 s, throughput:      6.69 token/s
Decode 3. Batch size: 1, latency: 0.14776 s, throughput:      6.77 token/s
Decode 4. Batch size: 1, latency: 0.15017 s, throughput:      6.66 token/s
Decode.  median latency: 0.16092 s, median throughput:      6.21 token/s
Total. latency:  6.301 s, throughput:     14.60 token/s
Benchmark ...
Prefill. latency: 0.27506 s, throughput:    218.14 token/s
Decode 0. Batch size: 1, latency: 0.15287 s, throughput:      6.54 token/s
Decode 1. Batch size: 1, latency: 0.16120 s, throughput:      6.20 token/s
Decode 2. Batch size: 1, latency: 0.15242 s, throughput:      6.56 token/s
Decode 3. Batch size: 1, latency: 0.15380 s, throughput:      6.50 token/s
Decode 4. Batch size: 1, latency: 0.16570 s, throughput:      6.04 token/s
Decode.  median latency: 0.18469 s, median throughput:      5.41 token/s
Total. latency: 19.160 s, throughput:      8.35 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-10T01:55:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yeahdongcn · 2026-03-10T01:57:11Z

~~Will rebase onto main once #19549 gets merged.~~

Done.

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

yeahdongcn · 2026-03-11T08:36:14Z

The perf is even worse. Closing.

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Mar 10, 2026

Add MlxAttnBackend for macOS

e5588cd

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

yeahdongcn force-pushed the xd/MlxAttnBackend branch from 84d4d61 to e5588cd Compare March 11, 2026 02:37

yeahdongcn mentioned this pull request Mar 11, 2026

[MLX] Add native MLX execution backend for Apple Silicon Mac #20342

Merged

5 tasks

yeahdongcn closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MlxAttnBackend for macOS#20221

Add MlxAttnBackend for macOS#20221
yeahdongcn wants to merge 1 commit intosgl-project:mainfrom
yeahdongcn:xd/MlxAttnBackend

yeahdongcn commented Mar 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Uh oh!

yeahdongcn commented Mar 10, 2026 •

edited

Loading

Uh oh!

yeahdongcn commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yeahdongcn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

SDPA

E2E

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Uh oh!

yeahdongcn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeahdongcn commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yeahdongcn commented Mar 10, 2026 •

edited

Loading

yeahdongcn commented Mar 10, 2026 •

edited

Loading