Skip to content

Add MlxAttnBackend for macOS#20221

Closed
yeahdongcn wants to merge 1 commit intosgl-project:mainfrom
yeahdongcn:xd/MlxAttnBackend
Closed

Add MlxAttnBackend for macOS#20221
yeahdongcn wants to merge 1 commit intosgl-project:mainfrom
yeahdongcn:xd/MlxAttnBackend

Conversation

@yeahdongcn
Copy link
Copy Markdown
Collaborator

@yeahdongcn yeahdongcn commented Mar 10, 2026

Motivation

This PR adds an MlxAttnBackend for macOS using mx.fast.scaled_dot_product_attention.

Modifications

Accuracy Tests

Benchmarking and Profiling

The following tests were run on an M1 MacBook Pro

SDPA

Device Backend SDPA Latency (100 runs)
MPS torch_native 0.046s
CPU mlx 0.011s
CPU torch_native 0.141s

E2E

With SGLANG_USE_MLX_ATTENTION=0 (default):

> uv run python -m sglang.bench_one_batch --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B --trust-remote-code --disable-radix-cache --disable-cuda-graph --tp-size 1 --batch-size 1 --input-len 60 --output-len 100 --port 43440
W0311 10:39:37.161000 29488 sglang-diffusion/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")
[2026-03-11 10:39:39] INFO server_args.py:2154: Attention backend not specified. Use torch_native backend by default.
[2026-03-11 10:39:39] WARNING server_args.py:2160: Cuda graph is disabled because of using torch native attention backend
[2026-03-11 10:39:39] WARNING common.py:1221: Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-03-11 10:39:39 TP0] Init torch distributed begin.
[2026-03-11 10:39:39 TP0] Init torch distributed ends. elapsed=0.05 s, mem usage=0.00 GB
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_linear: No module named 'vllm'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_nextn: No module named 'vllm'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:39:39 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/sglang-diffusion/lib/python3.11/site-packages/transformers/__init__.py)
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
[2026-03-11 10:39:39 TP0] Load weight begin. avail mem=3.55 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2026-03-11 10:39:40 TP0] Parameter lm_head.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]

[2026-03-11 10:39:41 TP0] Load weight end. elapsed=2.07 s, type=Qwen3ForCausalLM, avail mem=2.22 GB, mem usage=1.32 GB.
[2026-03-11 10:39:41 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-11 10:39:42 TP0] KV Cache is allocated. #tokens: 20033, K size: 1.07 GB, V size: 1.07 GB
[2026-03-11 10:39:42 TP0] Memory pool end. avail mem=1.78 GB
[2026-03-11 10:39:42 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
max_total_num_tokens=20033
Warmup ...
Prefill. latency: 0.46183 s, throughput:    129.92 token/s
Decode 0. Batch size: 1, latency: 1.48535 s, throughput:      0.67 token/s
Decode 1. Batch size: 1, latency: 0.10879 s, throughput:      9.19 token/s
Decode 2. Batch size: 1, latency: 0.11965 s, throughput:      8.36 token/s
Decode 3. Batch size: 1, latency: 0.19780 s, throughput:      5.06 token/s
Decode 4. Batch size: 1, latency: 0.12490 s, throughput:      8.01 token/s
Decode.  median latency: 0.13355 s, median throughput:      7.49 token/s
Total. latency:  6.230 s, throughput:     14.77 token/s
Benchmark ...
Prefill. latency: 0.23880 s, throughput:    251.26 token/s
Decode 0. Batch size: 1, latency: 0.11311 s, throughput:      8.84 token/s
Decode 1. Batch size: 1, latency: 0.13725 s, throughput:      7.29 token/s
Decode 2. Batch size: 1, latency: 0.11654 s, throughput:      8.58 token/s
Decode 3. Batch size: 1, latency: 0.11223 s, throughput:      8.91 token/s
Decode 4. Batch size: 1, latency: 0.11668 s, throughput:      8.57 token/s
Decode.  median latency: 0.14338 s, median throughput:      6.97 token/s
Total. latency: 14.447 s, throughput:     11.08 token/s

With SGLANG_USE_MLX_ATTENTION=1:

> export SGLANG_USE_MLX_ATTENTION=1
> uv run python -m sglang.bench_one_batch --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B --trust-remote-code --disable-radix-cache --disable-cuda-graph --tp-size 1 --batch-size 1 --input-len 60 --output-len 100 --port 43440
W0311 10:42:13.546000 30323 sglang-diffusion/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")
[2026-03-11 10:42:15] INFO server_args.py:2154: Attention backend not specified. Use mlx backend by default.
[2026-03-11 10:42:15] WARNING server_args.py:2166: Cuda graph is disabled because of using MLX attention backend
[2026-03-11 10:42:15] WARNING common.py:1221: Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-03-11 10:42:15 TP0] Init torch distributed begin.
[2026-03-11 10:42:15 TP0] Init torch distributed ends. elapsed=0.07 s, mem usage=0.01 GB
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_linear: No module named 'vllm'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.bailing_moe_nextn: No module named 'vllm'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-11 10:42:15 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/sglang-diffusion/lib/python3.11/site-packages/transformers/__init__.py)
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
[2026-03-11 10:42:15 TP0] Load weight begin. avail mem=5.26 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2026-03-11 10:42:15 TP0] Parameter lm_head.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.88s/it]

[2026-03-11 10:42:17 TP0] Load weight end. elapsed=2.05 s, type=Qwen3ForCausalLM, avail mem=2.66 GB, mem usage=2.60 GB.
[2026-03-11 10:42:17 TP0] Using KV cache dtype: torch.bfloat16
[2026-03-11 10:42:18 TP0] KV Cache is allocated. #tokens: 20948, K size: 1.12 GB, V size: 1.12 GB
[2026-03-11 10:42:18 TP0] Memory pool end. avail mem=1.83 GB
[2026-03-11 10:42:18 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
max_total_num_tokens=20948
Warmup ...
Prefill. latency: 0.35949 s, throughput:    166.90 token/s
Decode 0. Batch size: 1, latency: 0.75724 s, throughput:      1.32 token/s
Decode 1. Batch size: 1, latency: 0.15707 s, throughput:      6.37 token/s
Decode 2. Batch size: 1, latency: 0.14944 s, throughput:      6.69 token/s
Decode 3. Batch size: 1, latency: 0.14776 s, throughput:      6.77 token/s
Decode 4. Batch size: 1, latency: 0.15017 s, throughput:      6.66 token/s
Decode.  median latency: 0.16092 s, median throughput:      6.21 token/s
Total. latency:  6.301 s, throughput:     14.60 token/s
Benchmark ...
Prefill. latency: 0.27506 s, throughput:    218.14 token/s
Decode 0. Batch size: 1, latency: 0.15287 s, throughput:      6.54 token/s
Decode 1. Batch size: 1, latency: 0.16120 s, throughput:      6.20 token/s
Decode 2. Batch size: 1, latency: 0.15242 s, throughput:      6.56 token/s
Decode 3. Batch size: 1, latency: 0.15380 s, throughput:      6.50 token/s
Decode 4. Batch size: 1, latency: 0.16570 s, throughput:      6.04 token/s
Decode.  median latency: 0.18469 s, median throughput:      5.41 token/s
Total. latency: 19.160 s, throughput:      8.35 token/s

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Mar 10, 2026
@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 10, 2026

Will rebase onto main once #19549 gets merged.

Done.

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

The perf is even worse. Closing.

@yeahdongcn yeahdongcn closed this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant