[diffusion][llm] macOS support#19549
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Current progress: weight loading is working. Since my MacBook Pro has only 16GB of RAM, I’m looking for a more suitable test environment. <===Click to expand log details===>> sglang generate --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
warnings.warn(
W0228 14:56:53.862000 21179 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
[2026-02-28 14:56:55] INFO hf_diffusers_utils.py:518: Diffusers version: 0.37.0.dev0
[02-28 14:56:55] Enabling all offloading for GPU with low device memory
[02-28 14:56:55] server_args: {"model_path": "/Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image", "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": false, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": false, "master_port": 30082, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5653, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[02-28 14:56:55] Local mode: True
[02-28 14:56:55] Starting server...
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
warnings.warn(
W0228 14:56:59.421000 21198 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
[02-28 14:57:01] Scheduler bind at endpoint: tcp://127.0.0.1:5653
[02-28 14:57:01] Initializing distributed environment with world_size=1, device=mps, timeout=3600
[02-28 14:57:01] Setting distributed timeout to 3600 seconds
[02-28 14:57:01] No pipeline_class_name specified, using model_index.json
[02-28 14:57:01] Diffusers version: 0.37.0.dev0
[02-28 14:57:01] Using pipeline from model_index.json: ZImagePipeline
[02-28 14:57:01] Loading pipeline modules...
[02-28 14:57:01] Model already exists locally and is complete
[02-28 14:57:01] Model path: /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image
[02-28 14:57:01] Diffusers version: 0.37.0.dev0
[02-28 14:57:01] Loading pipeline modules from config: {'_class_name': 'ZImagePipeline', '_diffusers_version': '0.37.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'Qwen3Model'], 'tokenizer': ['transformers', 'Qwen2Tokenizer'], 'transformer': ['diffusers', 'ZImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[02-28 14:57:01] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules: 0%| | 0/5 [00:00<?, ?it/s][02-28 14:57:01] Loading text_encoder from /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image/text_encoder. avail mem: 5.47 GB
[02-28 14:57:01] Using Torch SDPA backend for MPS.
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:12<00:24, 12.41s/it]
^CTraceback (most recent call last):s: 67% Completed | 2/3 [00:12<00:05, 5.27s/it] |
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
…ntext) on MPS Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
|
looks like we're good to go. any other concerns? |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
The XPU CI failure is a known one and will be fixed along with: #13881 |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
@mickqian Nvidia CI passed and PR is approved, ready for merge — SGLDHelper bot |
|
/rerun-failed-ci |
5 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
brilliant |
|
/tag-and-rerun-ci |
@yhyang201 Please let me know if you'd like me to rebase this onto the latest upstream |
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>


Motivation
Make SGLang run natively on macOS: #19185
Modifications
Since Triton is not available on macOS, Triton kernels cannot be used at all — not even as a fallback. There are 200+ places in the codebase that import Triton, so a Triton stub (
_triton_stub.py) is introduced to keep those import paths working without errors.For the actual kernel implementations, the NPU backend already handles similar cases where Triton is not working by providing pure-PyTorch native fallbacks. I followed the same pattern for MPS, with platform-specific fallback modules (
npu_fallback.pyfor NPU,mps_fallback.pyfor MPS) that replace the Triton kernels at import time.On the MPS side, norm ops can optionally be accelerated using MLX (
mx.fast.rms_norm/mx.fast.layer_norm), which are single fused Metal kernels that avoid the multi-step decomposition PyTorch MPS performs. This is gated behindSGLANG_USE_MLX=1and yields ~13% faster denoising steps in end-to-end benchmarks.Additionally,
torch.mpsexposes a more limited API surface compared totorch.cuda,torch.npu, ortorch.musa. Some APIs (e.g.,Stream,set_device,get_device_properties, and memory tracking) need to be stubbed via_mps_stub.pyto ensure the code runs correctly on macOS.MLX Related Issues
#17846 #17846 #19137
References
Accuracy Tests
Verified
black-forest-labs/FLUX.1-devon macOS (with and withoutSGLANG_USE_MLX=1), the image generation works correctly. Appreciate @weiqiangt's help!Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci