[Diffusion] Support peak memory record in offline generate and serving by BBuf · Pull Request #15610 · sgl-project/sglang

BBuf · 2025-12-22T07:50:27Z

Image Generation

CUDA_VISIBLE_DEVICES=7 sglang generate  --pin-cpu-memory --prompt='A curious raccoon' --save-output --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=/home/lmsys/bbuf/FLUX.1-dev

[12-22 05:28:45] [TimestepPreparationStage] finished in 0.0202 seconds
[12-22 05:28:45] [LatentPreparationStage] started...
[12-22 05:28:45] [LatentPreparationStage] finished in 0.0003 seconds
[12-22 05:28:45] [DenoisingStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:05<00:00,  9.60it/s]
[12-22 05:28:50] [DenoisingStage] average time per step: 0.1041 seconds
[12-22 05:29:00] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 05:29:00] [DenoisingStage] finished in 15.1199 seconds
[12-22 05:29:00] [DecodingStage] started...
[12-22 05:29:00] [DecodingStage] finished in 0.6206 seconds
[12-22 05:29:01] Output saved to outputs/A_curious_raccoon_20251222-052844_eed73624.jpg
[12-22 05:29:01] Pixel data generated successfully in 16.21 seconds
[12-22 05:29:01] Completed batch processing. Generated 1 outputs in 16.22 seconds.
[12-22 05:29:01] Memory usage - Max peak: 23239.76 MB, Avg peak: 23239.76 MB
[12-22 05:29:01] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.

CUDA_VISIBLE_DEVICES=6,7 sglang generate  --pin-cpu-memory --prompt='A curious raccoon' --save-output --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=/home/lmsys/bbuf/FLUX.1-dev --tp-size 2

[12-22 05:27:33] [TimestepPreparationStage] finished in 0.0580 seconds
[12-22 05:27:33] [LatentPreparationStage] started...
[12-22 05:27:33] [LatentPreparationStage] finished in 0.0003 seconds
[12-22 05:27:33] [DenoisingStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:04<00:00, 10.09it/s]
[12-22 05:27:38] [DenoisingStage] average time per step: 0.0991 seconds
[12-22 05:27:50] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 05:27:50] [DenoisingStage] finished in 16.9346 seconds
[12-22 05:27:50] [DecodingStage] started...
[12-22 05:27:50] [DecodingStage] finished in 0.5279 seconds
[12-22 05:27:50] Output saved to outputs/A_curious_raccoon_20251222-052731_eed73624.jpg
[12-22 05:27:50] Pixel data generated successfully in 18.91 seconds
[12-22 05:27:50] Completed batch processing. Generated 1 outputs in 18.91 seconds.
[12-22 05:27:50] Memory usage - Max peak: 23105.06 MB, Avg peak: 23105.06 MB
[12-22 05:27:50] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

CUDA_VISIBLE_DEVICES=7 sglang serve \
    --model-path /home/lmsys/bbuf/FLUX.1-dev \
    --pin-cpu-memory \
    --log-level debug \
    --host 0.0.0.0 \
    --port 30000

python -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image \
    --host localhost \
    --port 30000 \
    --dataset vbench \
    --task t2v \
    --num-prompts 20 \
    --width 720 \
    --height 720 \
    --max-concurrency 2

================= Serving Benchmark Result =================
Backend:                                 sglang-image   
Model:                                   /home/lmsys/bbuf/FLUX.1-dev
Dataset:                                 vbench         
Task:                                    t2v            
--------------------------------------------------
Benchmark duration (s):                  98.25          
Request rate:                            inf            
Max request concurrency:                 2              
Successful requests:                     20/20             
--------------------------------------------------
Request throughput (req/s):              0.20           
Latency Mean (s):                        9.5812         
Latency Median (s):                      9.6837         
Latency P99 (s):                         10.8238        
--------------------------------------------------
Peak Memory Max (MB):                    25470.11       
Peak Memory Mean (MB):                   25469.28       
Peak Memory Median (MB):                 25469.23       

============================================================

Video Generation

sglang generate   --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers   --text-encoder-cpu-offload   --pin-cpu-memory   --num-gpus 8   --ulysses-degree 8 --attention-backend sage_attn  --enable-torch-compile --prompt "A cat walks on the grass, realistic" --num-frames 81 --height 720 --width 1280 --num-inference-steps 27 --guidance-scale 3.5 --guidance-scale-2 4.0

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [02:14<00:00,  4.98s/it]
[12-22 06:11:13] [DenoisingStage] average time per step: 4.9841 seconds
[12-22 06:11:28] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 06:11:28] [DenoisingStage] finished in 149.9252 seconds
[12-22 06:11:28] [DecodingStage] started...
[12-22 06:11:38] [DecodingStage] finished in 9.9124 seconds
[12-22 06:11:44] Output saved to outputs/A_cat_walks_on_the_grass_realistic_20251222-060856_f4e677f2.mp4
[12-22 06:11:44] Pixel data generated successfully in 167.79 seconds
[12-22 06:11:44] Completed batch processing. Generated 1 outputs in 167.79 seconds.
[12-22 06:11:44] Memory usage - Max peak: 62651.28 MB, Avg peak: 62651.28 MB
[12-22 06:11:44] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

sglang generate   --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers   --text-encoder-cpu-offload   --pin-cpu-memory   --num-gpus 8   --ulysses-degree 8 --attention-backend sage_attn  --enable-torch-compile --prompt "A cat walks on the grass, realistic" --num-frames 81 --height 720 --width 1280 --num-inference-steps 27 --guidance-scale 3.5 --guidance-scale-2 4.0 --dit-layerwise-offload true

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [01:36<00:00,  3.56s/it]
[12-22 06:23:27] [DenoisingStage] average time per step: 3.5597 seconds
[12-22 06:23:29] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 06:23:29] [DenoisingStage] finished in 97.8069 seconds
[12-22 06:23:29] [DecodingStage] started...
[12-22 06:23:39] [DecodingStage] finished in 10.1658 seconds
[12-22 06:23:46] Output saved to outputs/A_cat_walks_on_the_grass_realistic_20251222-062148_f4e677f2.mp4
[12-22 06:23:46] Pixel data generated successfully in 117.56 seconds
[12-22 06:23:46] Completed batch processing. Generated 1 outputs in 117.56 seconds.
[12-22 06:23:46] Memory usage - Max peak: 21394.29 MB, Avg peak: 21394.29 MB
[12-22 06:23:46] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 sglang serve \
    --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --text-encoder-cpu-offload \
    --pin-cpu-memory \
    --num-gpus 8 \
    --ulysses-degree 8 \
    --attention-backend sage_attn \
    --enable-torch-compile \
    --host 0.0.0.0 \
    --port 30000

python -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-video \
    --host localhost \
    --port 30000 \
    --dataset vbench \
    --task t2v \
    --num-prompts 2 \
    --max-concurrency 1

================= Serving Benchmark Result =================
Backend:                                 sglang-video   
Model:                                   Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 vbench         
Task:                                    t2v            
--------------------------------------------------
Benchmark duration (s):                  395.74         
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     2/2              
--------------------------------------------------
Request throughput (req/s):              0.01           
Latency Mean (s):                        197.8686       
Latency Median (s):                      197.8686       
Latency P99 (s):                         198.3657       
--------------------------------------------------
Peak Memory Max (MB):                    62650.28       
Peak Memory Mean (MB):                   55689.21       
Peak Memory Median (MB):                 55689.21       

============================================================

gemini-code-assist · 2025-12-22T07:50:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…n_eagle3_dp * 'main' of https://github.com/sgl-project/sglang: (208 commits) MoE: Skip SiLU/GELU activation for masked experts (sgl-project#15539) [GLM-ASR] GLM-ASR Support (sgl-project#15570) Improve engine customization interface (sgl-project#15635) chore: bump sgl-kernel version to 0.3.20 (sgl-project#15590) bugfix[schedule]: Refactor sort method and add related UT (sgl-project#13576) Adjust wrong `mtp` meaning introduce by mimo (sgl-project#15632) Tiny add back missing router per attempt response metric (sgl-project#15621) Fix router gRPC mode launch error caused by async loading (sgl-project#15368) [model-gateway] return 503 when all workers are circuit-broken (sgl-project#15611) [Diffusion] Support peak memory record in offline generate and serving (sgl-project#15610) [VLM] Tiny: Unify VLM environment variables (sgl-project#15572) [diffusion] chore: remove default post-denoising dit offload in local mode (sgl-project#15573) Tiny enable soft watchdog in CI for stuck without logs (sgl-project#15616) Tiny add stuck simulation (sgl-project#15613) Support soft watchdog for tokenizer/detokenizer/dp-controller processes (sgl-project#15607) Tiny avoid EnvField misuse (sgl-project#15612) add decode round robin policy (sgl-project#15164) Add glm-4.6-fp8 with/without mtp in nightly ci (sgl-project#15566) Adapt fixture-kit to gsm8k mixin (sgl-project#15599) [model-gateway] add retry support to OpenAI router chat endpoint (sgl-project#15589) ...

sgl-project#15610)

BBuf added 5 commits December 22, 2025 13:12

upd

78f8818

upd

f9fe483

upd

99dd71c

upd

09794ef

upd

f4939a9

BBuf requested review from mickqian and yhyang201 as code owners December 22, 2025 07:50

github-actions Bot added the diffusion SGLang Diffusion label Dec 22, 2025

mickqian approved these changes Dec 22, 2025

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/entrypoints/openai/image_api.py Outdated

upd

8801a3e

BBuf added the run-ci label Dec 22, 2025

Merge branch 'main' into diffusion_support_peak_memory

19e3f3a

BBuf merged commit d77f3fc into main Dec 22, 2025
100 of 102 checks passed

BBuf deleted the diffusion_support_peak_memory branch December 22, 2025 13:21

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[Diffusion] Support peak memory record in offline generate and serving (

e6d7d42

sgl-project#15610)

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[Diffusion] Support peak memory record in offline generate and serving (

584ddb7

sgl-project#15610)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Support peak memory record in offline generate and serving#15610

[Diffusion] Support peak memory record in offline generate and serving#15610
BBuf merged 7 commits intomainfrom
diffusion_support_peak_memory

BBuf commented Dec 22, 2025

Uh oh!

gemini-code-assist Bot commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BBuf commented Dec 22, 2025

Image Generation

Video Generation

Uh oh!

gemini-code-assist Bot commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants