Skip to content

[diffusion][llm] macOS support#19549

Merged
hnyls2002 merged 19 commits intosgl-project:mainfrom
yeahdongcn:xd/mps
Mar 10, 2026
Merged

[diffusion][llm] macOS support#19549
hnyls2002 merged 19 commits intosgl-project:mainfrom
yeahdongcn:xd/mps

Conversation

@yeahdongcn
Copy link
Copy Markdown
Collaborator

@yeahdongcn yeahdongcn commented Feb 28, 2026

Motivation

Make SGLang run natively on macOS: #19185

Modifications

Since Triton is not available on macOS, Triton kernels cannot be used at all — not even as a fallback. There are 200+ places in the codebase that import Triton, so a Triton stub (_triton_stub.py) is introduced to keep those import paths working without errors.

For the actual kernel implementations, the NPU backend already handles similar cases where Triton is not working by providing pure-PyTorch native fallbacks. I followed the same pattern for MPS, with platform-specific fallback modules (npu_fallback.py for NPU, mps_fallback.py for MPS) that replace the Triton kernels at import time.

On the MPS side, norm ops can optionally be accelerated using MLX (mx.fast.rms_norm / mx.fast.layer_norm), which are single fused Metal kernels that avoid the multi-step decomposition PyTorch MPS performs. This is gated behind SGLANG_USE_MLX=1 and yields ~13% faster denoising steps in end-to-end benchmarks.

Additionally, torch.mps exposes a more limited API surface compared to torch.cuda, torch.npu, or torch.musa. Some APIs (e.g., Stream, set_device, get_device_properties, and memory tracking) need to be stubbed via _mps_stub.py to ensure the code runs correctly on macOS.

MLX Related Issues

#17846 #17846 #19137

References

Accuracy Tests

Verified black-forest-labs/FLUX.1-dev on macOS (with and without SGLANG_USE_MLX=1), the image generation works correctly. Appreciate @weiqiangt's help!

gen
> sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A logo With Bold Large text: SGL Diffusion" \
    --save-output
/Users/weiqiangt/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0303 21:40:35.983000 66330 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/weiqiangt/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/weiqiangt/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/Users/weiqiangt/sglang/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[03-03 21:40:43] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 21:40:43] server_args: {"model_path": "black-forest-labs/FLUX.1-dev", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": false, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30008, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5631, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 21:40:43] Local mode: True
[03-03 21:40:43] Starting server...
/Users/weiqiangt/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0303 21:40:46.232000 66352 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/weiqiangt/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/weiqiangt/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/Users/weiqiangt/sglang/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[03-03 21:40:47] Scheduler bind at endpoint: tcp://127.0.0.1:5631
[03-03 21:40:47] Initializing distributed environment with world_size=1, device=mps, timeout=3600
[03-03 21:40:47] Setting distributed timeout to 3600 seconds
[03-03 21:40:47] No pipeline_class_name specified, using model_index.json
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: d768f3d5-f7d6-4573-8120-f257d2934bc4)')' thrown while requesting HEAD https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/model_index.json
[03-03 21:40:57] '(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: d768f3d5-f7d6-4573-8120-f257d2934bc4)')' thrown while requesting HEAD https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/model_index.json
Retrying in 1s [Retry 1/5].
[03-03 21:40:57] Retrying in 1s [Retry 1/5].
[03-03 21:41:03] Using pipeline from model_index.json: FluxPipeline
[03-03 21:41:03] Loading pipeline modules...
[03-03 21:41:03] Checking for cached model in HF Hub cache for black-forest-labs/FLUX.1-dev...
[03-03 21:41:03] Found complete model in cache at /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21
[03-03 21:41:03] Model path: /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21
[03-03 21:41:03] Diffusers version: 0.30.0.dev0
[03-03 21:41:03] Loading pipeline modules from config: {'_class_name': 'FluxPipeline', '_diffusers_version': '0.30.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 21:41:03] Loading required components: ['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                                                                                                                                                                         | 0/7 [00:00<?, ?it/s][03-03 21:41:03] Loading text_encoder from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/text_encoder. avail mem: 94.66 GB
[03-03 21:41:03] Using Torch SDPA backend for MPS.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.21it/s]

[03-03 21:41:04] Disabling FSDP sharding for MPS platform as it's not compatible
[03-03 21:41:04] Loaded text_encoder: CLIPTextModel (sgl-diffusion version). model size: 0.23 GB, consumed GPU mem: 1.44 GB, avail GPU mem: 93.22 GB
Loading required modules:  14%|█████████████████████████▎                                                                                                                                                       | 1/7 [00:00<00:01,  4.98it/s][03-03 21:41:04] Loading text_encoder_2 from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/text_encoder_2. avail mem: 93.22 GB

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.04s/it]

[03-03 21:41:08] Disabling FSDP sharding for MPS platform as it's not compatible
[03-03 21:41:08] Loaded text_encoder_2: T5EncoderModel (sgl-diffusion version). model size: 8.87 GB, consumed GPU mem: 12.71 GB, avail GPU mem: 80.51 GB
Loading required modules:  29%|██████████████████████████████████████████████████▌                                                                                                                              | 2/7 [00:04<00:12,  2.49s/it][03-03 21:41:08] Loading tokenizer from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/tokenizer. avail mem: 80.51 GB
[03-03 21:41:08] Loaded tokenizer: CLIPTokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.02 GB, avail GPU mem: 80.49 GB
[03-03 21:41:08] Loading tokenizer_2 from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/tokenizer_2. avail mem: 80.49 GB
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
[03-03 21:41:08] Loaded tokenizer_2: T5TokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: -3.95 GB, avail GPU mem: 84.44 GB
Loading required modules:  57%|█████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                                           | 4/7 [00:04<00:03,  1.09s/it][03-03 21:41:08] Loading vae from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/vae. avail mem: 84.44 GB
[03-03 21:41:08] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.31 GB, consumed GPU mem: 0.73 GB, avail GPU mem: 83.72 GB
Loading required modules:  71%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                  | 5/7 [00:04<00:01,  1.26it/s][03-03 21:41:08] Loading transformer from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/transformer. avail mem: 83.72 GB
[03-03 21:41:08] Loading FluxTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 21:41:08] Using Torch SDPA backend for MPS.
[03-03 21:41:09] Disabling FSDP for MPS platform as it's not compatible

Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 33.80it/s]

[03-03 21:41:20] Loaded model with 11.90B parameters
[03-03 21:41:20] Loaded transformer: FluxTransformer2DModel (sgl-diffusion version). model size: 22.17 GB, consumed GPU mem: 7.89 GB, avail GPU mem: 75.83 GB
Loading required modules:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                         | 6/7 [00:16<00:04,  4.22s/it][03-03 21:41:20] Loading scheduler from /Users/weiqiangt/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev/snapshots/3de623fc3c33e44ffbe2bad470d0f45bccf2eb21/scheduler. avail mem: 75.83 GB
[03-03 21:41:20] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: -0.00 GB, avail GPU mem: 75.83 GB
Loading required modules: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:16<00:00,  2.42s/it]
[03-03 21:41:20] Creating pipeline stages...
[03-03 21:41:20] Using Torch SDPA backend for MPS.
[03-03 21:41:20] Pipeline instantiated
[03-03 21:41:20] Worker 0: Initialized device, model, and distributed environment.
[03-03 21:41:20] Worker 0: Scheduler loop started.
[03-03 21:41:21] Processing prompt 1/1: A logo With Bold Large text: SGL Diffusion
[03-03 21:41:21] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage_primary', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[03-03 21:41:21] [InputValidationStage] started...
[03-03 21:41:21] [InputValidationStage] finished in 0.0000 seconds
[03-03 21:41:21] [TextEncodingStage] started...
[03-03 21:41:21] [TextEncodingStage] finished in 0.1337 seconds
[03-03 21:41:21] [TimestepPreparationStage] started...
[03-03 21:41:21] [TimestepPreparationStage] finished in 0.0027 seconds
[03-03 21:41:21] [LatentPreparationStage] started...
[03-03 21:41:21] [LatentPreparationStage] finished in 0.0020 seconds
[03-03 21:41:21] [DenoisingStage] started...
  0%|                                                                                                                                                                                                                  | 0/50 [00:00<?, ?it/s]/Users/weiqiangt/sglang/python/sglang/multimodal_gen/runtime/models/dits/flux.py:393: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [06:25<00:00,  7.71s/it]
[03-03 21:47:47] [DenoisingStage] average time per step: 7.7094 seconds
[03-03 21:47:47] Memory before deallocating transformer: 33963322624
[03-03 21:47:47] Memory after deallocating transformer: 33963322624
[03-03 21:47:47] [DenoisingStage] finished in 385.4929 seconds
[03-03 21:47:47] [DecodingStage] started...
/Users/weiqiangt/sglang/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py:350: UserWarning: In MPS autocast, but the target dtype is not supported. Disabling autocast.
MPS Autocast only supports dtype of torch.bfloat16 and torch.float16 currently.
  warnings.warn(error_message)
[03-03 21:47:49] [DecodingStage] finished in 2.0959 seconds
[03-03 21:47:49] Peak GPU memory: 43.13 GB, Peak allocated: 31.65 GB, Memory pool overhead: 11.49 GB (26.6%), Remaining GPU memory at peak: 84.87 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'text_encoder_2', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 21:47:51] Output saved to outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260303-214121_84e56531.png
[03-03 21:47:51] Pixel data generated successfully in 390.42 seconds
[03-03 21:47:51] Completed batch processing. Generated 1 outputs in 390.42 seconds
[03-03 21:47:51] Memory usage - Max peak: 44167.20 MB, Avg peak: 44167.20 MB
[03-03 21:47:51] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
[03-03 21:47:51] Worker 0: Shutdown complete.

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file labels Feb 28, 2026
@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Feb 28, 2026

Current progress: weight loading is working. Since my MacBook Pro has only 16GB of RAM, I’m looking for a more suitable test environment.

<===Click to expand log details===>
> sglang generate --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0228 14:56:53.862000 21179 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[2026-02-28 14:56:55] INFO hf_diffusers_utils.py:518: Diffusers version: 0.37.0.dev0
[02-28 14:56:55] Enabling all offloading for GPU with low device memory
[02-28 14:56:55] server_args: {"model_path": "/Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image", "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": false, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": false, "master_port": 30082, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5653, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[02-28 14:56:55] Local mode: True
[02-28 14:56:55] Starting server...
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0228 14:56:59.421000 21198 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[02-28 14:57:01] Scheduler bind at endpoint: tcp://127.0.0.1:5653
[02-28 14:57:01] Initializing distributed environment with world_size=1, device=mps, timeout=3600
[02-28 14:57:01] Setting distributed timeout to 3600 seconds
[02-28 14:57:01] No pipeline_class_name specified, using model_index.json
[02-28 14:57:01] Diffusers version: 0.37.0.dev0
[02-28 14:57:01] Using pipeline from model_index.json: ZImagePipeline
[02-28 14:57:01] Loading pipeline modules...
[02-28 14:57:01] Model already exists locally and is complete
[02-28 14:57:01] Model path: /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image
[02-28 14:57:01] Diffusers version: 0.37.0.dev0
[02-28 14:57:01] Loading pipeline modules from config: {'_class_name': 'ZImagePipeline', '_diffusers_version': '0.37.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'Qwen3Model'], 'tokenizer': ['transformers', 'Qwen2Tokenizer'], 'transformer': ['diffusers', 'ZImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[02-28 14:57:01] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                                                                       | 0/5 [00:00<?, ?it/s][02-28 14:57:01] Loading text_encoder from /Users/yexiaodong/.cache/modelscope/hub/models/Tongyi-MAI/Z-Image/text_encoder. avail mem: 5.47 GB
[02-28 14:57:01] Using Torch SDPA backend for MPS.

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:12<00:24, 12.41s/it]
^CTraceback (most recent call last):s:  67% Completed | 2/3 [00:12<00:05,  5.27s/it]

@yeahdongcn yeahdongcn added the diffusion SGLang Diffusion label Feb 28, 2026
@github-actions github-actions Bot added the npu label Mar 2, 2026
@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 2, 2026

So far, I’ve managed to get FLUX.1-dev running on my M1 MBP (16GB) by manually setting num_layers to 1 in all config.json (LOL).

I can see GPU activity in Activity Monitor:
m1_gpu

<===Click to expand log details===>
> sglang generate --model-path /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output --num-inference-steps=1 
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0302 15:02:25.423000 44664 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[2026-03-02 15:02:29] INFO hf_diffusers_utils.py:518: Diffusers version: 0.30.0.dev0
[03-02 15:02:29] Enabling all offloading for GPU with low device memory
[03-02 15:02:29] server_args: {"model_path": "/Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev", "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": false, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30035, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5586, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-02 15:02:29] Local mode: True
[03-02 15:02:29] Starting server...
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0302 15:02:33.966000 44719 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:270: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[03-02 15:02:36] Scheduler bind at endpoint: tcp://127.0.0.1:5586
[03-02 15:02:36] Initializing distributed environment with world_size=1, device=mps, timeout=3600
[03-02 15:02:36] Setting distributed timeout to 3600 seconds
[03-02 15:02:36] No pipeline_class_name specified, using model_index.json
[03-02 15:02:36] Diffusers version: 0.30.0.dev0
[03-02 15:02:36] Using pipeline from model_index.json: FluxPipeline
[03-02 15:02:36] Loading pipeline modules...
[03-02 15:02:36] Model already exists locally and is complete
[03-02 15:02:36] Model path: /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev
[03-02 15:02:36] Diffusers version: 0.30.0.dev0
[03-02 15:02:36] Loading pipeline modules from config: {'_class_name': 'FluxPipeline', '_diffusers_version': '0.30.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-02 15:02:36] Loading required components: ['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                                                                     | 0/7 [00:00<?, ?it/s][03-02 15:02:36] Loading text_encoder from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/text_encoder. avail mem: 4.32 GB
[03-02 15:02:36] Using Torch SDPA backend for MPS.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.14it/s]

[03-02 15:02:37] Disabling FSDP sharding for MPS platform as it's not compatible
[03-02 15:02:37] Loaded text_encoder: CLIPTextModel (sgl-diffusion version). model size: 0.08 GB, consumed GPU mem: 0.84 GB, avail GPU mem: 3.48 GB
Loading required modules:  14%|███████████                                                                  | 1/7 [00:00<00:02,  2.04it/s][03-02 15:02:37] Loading text_encoder_2 from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/text_encoder_2. avail mem: 3.48 GB

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.14it/s]

[03-02 15:02:37] Disabling FSDP sharding for MPS platform as it's not compatible
[03-02 15:02:37] Loaded text_encoder_2: T5EncoderModel (sgl-diffusion version). model size: 0.6 GB, consumed GPU mem: 0.55 GB, avail GPU mem: 2.93 GB
Loading required modules:  29%|██████████████████████                                                       | 2/7 [00:01<00:02,  1.70it/s][03-02 15:02:37] Loading tokenizer from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/tokenizer. avail mem: 2.93 GB
[03-02 15:02:37] Loaded tokenizer: CLIPTokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: -0.52 GB, avail GPU mem: 3.45 GB
[03-02 15:02:37] Loading tokenizer_2 from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/tokenizer_2. avail mem: 3.45 GB
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
[03-02 15:02:37] Loaded tokenizer_2: T5TokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.03 GB, avail GPU mem: 3.42 GB
Loading required modules:  57%|████████████████████████████████████████████                                 | 4/7 [00:01<00:00,  3.30it/s][03-02 15:02:37] Loading vae from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/vae. avail mem: 3.42 GB
[03-02 15:02:38] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.31 GB, consumed GPU mem: 0.19 GB, avail GPU mem: 3.23 GB
Loading required modules:  71%|███████████████████████████████████████████████████████                      | 5/7 [00:01<00:00,  3.00it/s][03-02 15:02:38] Loading transformer from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/transformer. avail mem: 3.23 GB
[03-02 15:02:38] Loading FluxTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-02 15:02:38] Using Torch SDPA backend for MPS.
[03-02 15:02:38] Disabling FSDP for MPS platform as it's not compatible

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 18.39it/s]

[03-02 15:02:40] Checkpoint keys not loaded (no matching model parameter) ['single_transformer_blocks.10.attn.norm_k.weight', 'single_transformer_blocks.10.attn.norm_q.weight', 'single_transformer_blocks.10.attn.to_k.bias', 'single_transformer_blocks.10.attn.to_k.weight', 'single_transformer_blocks.10.attn.to_q.bias', 'single_transformer_blocks.10.attn.to_q.weight', 'single_transformer_blocks.10.attn.to_v.bias', 'single_transformer_blocks.10.attn.to_v.weight', 'single_transformer_blocks.10.norm.linear.bias', 'single_transformer_blocks.10.norm.linear.weight', 'single_transformer_blocks.10.proj_mlp.bias', 'single_transformer_blocks.10.proj_mlp.weight', 'single_transformer_blocks.10.proj_out.bias', 'single_transformer_blocks.10.proj_out.weight', 'single_transformer_blocks.11.attn.norm_k.weight', 'single_transformer_blocks.11.attn.norm_q.weight', 'single_transformer_blocks.11.attn.to_k.bias', 'single_transformer_blocks.11.attn.to_k.weight', 'single_transformer_blocks.11.attn.to_q.bias', 'single_transformer_blocks.11.attn.to_q.weight']
[03-02 15:02:40] ... and 1060 more skipped keys.
[03-02 15:02:40] Loaded model with 0.69B parameters
[03-02 15:02:40] Loaded transformer: FluxTransformer2DModel (sgl-diffusion version). model size: 1.28 GB, consumed GPU mem: -0.16 GB, avail GPU mem: 3.39 GB
Loading required modules:  86%|██████████████████████████████████████████████████████████████████           | 6/7 [00:04<00:01,  1.01s/it][03-02 15:02:40] Loading scheduler from /Users/yexiaodong/.cache/modelscope/hub/models/black-forest-labs/FLUX.1-dev/scheduler. avail mem: 3.39 GB
[03-02 15:02:40] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.02 GB, avail GPU mem: 3.38 GB
Loading required modules: 100%|█████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.59it/s]
[03-02 15:02:40] Creating pipeline stages...
[03-02 15:02:40] Using Torch SDPA backend for MPS.
[03-02 15:02:40] Pipeline instantiated
[03-02 15:02:40] Worker 0: Initialized device, model, and distributed environment.
[03-02 15:02:40] Worker 0: Scheduler loop started.
[03-02 15:02:40] Diffusers version: 0.30.0.dev0
[03-02 15:02:40] Processing prompt 1/1: A logo With Bold Large text: SGL Diffusion
[03-02 15:02:40] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage_primary', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[03-02 15:02:40] [InputValidationStage] started...
[03-02 15:02:40] [InputValidationStage] finished in 0.0012 seconds
[03-02 15:02:40] [TextEncodingStage] started...
[03-02 15:02:41] [TextEncodingStage] finished in 0.6860 seconds
[03-02 15:02:41] [TimestepPreparationStage] started...
[03-02 15:02:41] [TimestepPreparationStage] finished in 0.0077 seconds
[03-02 15:02:41] [LatentPreparationStage] started...
[03-02 15:02:41] [LatentPreparationStage] finished in 0.0051 seconds
[03-02 15:02:41] [DenoisingStage] started...
  0%|                                                                                                               | 0/1 [00:00<?, ?it/s]/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/multimodal_gen/runtime/models/dits/flux.py:393: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.38it/s]
[03-02 15:02:42] [DenoisingStage] average time per step: 0.7256 seconds
[03-02 15:02:42] Memory before deallocating transformer: 2523118848
[03-02 15:02:42] Memory after deallocating transformer: 2523118848
[03-02 15:02:42] [DenoisingStage] finished in 0.7843 seconds
[03-02 15:02:42] [DecodingStage] started...
/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/torch/amp/autocast_mode.py:350: UserWarning: In MPS autocast, but the target dtype is not supported. Disabling autocast.
MPS Autocast only supports dtype of torch.bfloat16 and torch.float16 currently.
  warnings.warn(error_message)
[03-02 15:02:54] [DecodingStage] finished in 12.1077 seconds
[03-02 15:02:54] Peak GPU memory: 13.19 GB, Peak allocated: 2.05 GB, Memory pool overhead: 11.13 GB (84.4%), Remaining GPU memory at peak: 2.81 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'text_encoder_2', 'vae', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload, --vae-cpu-offload
[03-02 15:02:59] Output saved to outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260302-150240_51f80fa8.png
[03-02 15:02:59] Pixel data generated successfully in 18.53 seconds
[03-02 15:02:59] Completed batch processing. Generated 1 outputs in 18.61 seconds
[03-02 15:02:59] Memory usage - Max peak: 13502.12 MB, Avg peak: 13502.12 MB
[03-02 15:02:59] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
[03-02 15:03:00] Worker 0: Shutdown complete.

@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 3, 2026

LLM also works on macOS now:

llm-mps
<===Click to expand log details===>
> python -m sglang.launch_server --model-path /Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B --host 0.0.0.0 --trust-remote-code --disable-radix-cache --disable-cuda-graph --tp-size 1 --port 43436
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0303 10:32:54.139000 66050 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[2026-03-03 10:32:54] INFO server_args.py:1859: Attention backend not specified. Use torch_native backend by default.
[2026-03-03 10:32:54] WARNING server_args.py:1865: Cuda graph is disabled because of using torch native attention backend
[2026-03-03 10:32:55] Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-03-03 10:32:55] server_args=ServerArgs(model_path='/Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B', tokenizer_path='/Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=43436, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.88, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='mps', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=392111665, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='/Users/yexiaodong/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='torch_native', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=160, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160], disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-03 10:32:55] Using default HuggingFace chat template with detected content format: string
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
W0303 10:32:59.893000 66116 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
W0303 10:32:59.893000 66115 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/awq.py:80: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[2026-03-03 10:33:01] WARNING common.py:2456: kill_itself_when_parent_died is only supported in linux.
[2026-03-03 10:33:01] WARNING common.py:2456: kill_itself_when_parent_died is only supported in linux.
[2026-03-03 10:33:01] Mamba selective_state_update backend initialized: triton
[2026-03-03 10:33:01] Init torch distributed begin.
[2026-03-03 10:33:01] Init torch distributed ends. elapsed=0.05 s, mem usage=0.00 GB
[2026-03-03 10:33:02] Ignore import error when loading sglang.srt.models.bailing_moe_linear: No module named 'vllm'
[2026-03-03 10:33:02] Ignore import error when loading sglang.srt.models.bailing_moe_nextn: No module named 'vllm'
[2026-03-03 10:33:02] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-03 10:33:02] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-03 10:33:02] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/opt/homebrew/Caskroom/miniconda/base/envs/sglang/lib/python3.11/site-packages/transformers/__init__.py)
/Users/yexiaodong/go/src/github.com/yeahdongcn/sglang/python/sglang/srt/layers/attention/fla/utils.py:212: UserWarning: Triton is not supported on current platform, roll back to CPU.
  warnings.warn(
[2026-03-03 10:33:02] Load weight begin. avail mem=4.38 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2026-03-03 10:33:02] Parameter lm_head.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.44s/it]

[2026-03-03 10:33:05] Load weight end. elapsed=2.79 s, type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=2.08 GB, mem usage=2.30 GB.
[2026-03-03 10:33:05] Using KV cache dtype: torch.bfloat16
[2026-03-03 10:33:05] KV Cache is allocated. #tokens: 19909, K size: 1.06 GB, V size: 1.06 GB
[2026-03-03 10:33:05] Memory pool end. avail mem=1.70 GB
[2026-03-03 10:33:06] max_total_num_tokens=19909, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=1.78 GB
[2026-03-03 10:33:07] INFO:     Started server process [66050]
[2026-03-03 10:33:07] INFO:     Waiting for application startup.
[2026-03-03 10:33:07] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2026-03-03 10:33:07] INFO:     Application startup complete.
[2026-03-03 10:33:07] INFO:     Uvicorn running on http://0.0.0.0:43436 (Press CTRL+C to quit)
[2026-03-03 10:33:08] INFO:     127.0.0.1:52577 - "GET /model_info HTTP/1.1" 200 OK
[2026-03-03 10:33:10] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-03-03 10:33:12] INFO:     127.0.0.1:52578 - "POST /generate HTTP/1.1" 200 OK
[2026-03-03 10:33:12] The server is fired up and ready to roll!
[2026-03-03 10:34:02] INFO:     127.0.0.1:52751 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-03-03 10:34:04] Prefill batch, #new-seq: 1, #new-token: 60, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.11, cuda graph: False
[2026-03-03 10:34:10] Decode batch, #running-req: 1, #token: 93, token usage: 0.00, cuda graph: False, gen throughput (token/s): 0.58, #queue-req: 0
[2026-03-03 10:34:16] Decode batch, #running-req: 1, #token: 133, token usage: 0.01, cuda graph: False, gen throughput (token/s): 6.26, #queue-req: 0

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
…ntext) on MPS

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
@yeahdongcn yeahdongcn marked this pull request as ready for review March 4, 2026 02:14
@github-actions github-actions Bot added the run-ci label Mar 6, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Mar 6, 2026

looks like we're good to go. any other concerns?

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 7, 2026

The XPU CI failure is a known one and will be fixed along with: #13881

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

@mickqian Nvidia CI passed and PR is approved, ready for merge

— SGLDHelper bot

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

5 similar comments
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Mar 9, 2026

brilliant

@yeahdongcn yeahdongcn mentioned this pull request Mar 10, 2026
5 tasks
@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 10, 2026

The XPU CI failure is a known one and will be fixed along with: #13881

@yhyang201 Please let me know if you'd like me to rebase this onto the latest upstream main to help get the AMD CI passing (not sure if there have been any CI updates). Thanks!

@hnyls2002 hnyls2002 merged commit db97f19 into sgl-project:main Mar 10, 2026
704 of 765 checks passed
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation macos run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants