Skip to content

[Diffusion] Support peak memory record in offline generate and serving#15610

Merged
BBuf merged 7 commits intomainfrom
diffusion_support_peak_memory
Dec 22, 2025
Merged

[Diffusion] Support peak memory record in offline generate and serving#15610
BBuf merged 7 commits intomainfrom
diffusion_support_peak_memory

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Dec 22, 2025

Image Generation

CUDA_VISIBLE_DEVICES=7 sglang generate  --pin-cpu-memory --prompt='A curious raccoon' --save-output --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=/home/lmsys/bbuf/FLUX.1-dev
[12-22 05:28:45] [TimestepPreparationStage] finished in 0.0202 seconds
[12-22 05:28:45] [LatentPreparationStage] started...
[12-22 05:28:45] [LatentPreparationStage] finished in 0.0003 seconds
[12-22 05:28:45] [DenoisingStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:05<00:00,  9.60it/s]
[12-22 05:28:50] [DenoisingStage] average time per step: 0.1041 seconds
[12-22 05:29:00] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 05:29:00] [DenoisingStage] finished in 15.1199 seconds
[12-22 05:29:00] [DecodingStage] started...
[12-22 05:29:00] [DecodingStage] finished in 0.6206 seconds
[12-22 05:29:01] Output saved to outputs/A_curious_raccoon_20251222-052844_eed73624.jpg
[12-22 05:29:01] Pixel data generated successfully in 16.21 seconds
[12-22 05:29:01] Completed batch processing. Generated 1 outputs in 16.22 seconds.
[12-22 05:29:01] Memory usage - Max peak: 23239.76 MB, Avg peak: 23239.76 MB
[12-22 05:29:01] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
CUDA_VISIBLE_DEVICES=6,7 sglang generate  --pin-cpu-memory --prompt='A curious raccoon' --save-output --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=/home/lmsys/bbuf/FLUX.1-dev --tp-size 2
[12-22 05:27:33] [TimestepPreparationStage] finished in 0.0580 seconds
[12-22 05:27:33] [LatentPreparationStage] started...
[12-22 05:27:33] [LatentPreparationStage] finished in 0.0003 seconds
[12-22 05:27:33] [DenoisingStage] started...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:04<00:00, 10.09it/s]
[12-22 05:27:38] [DenoisingStage] average time per step: 0.0991 seconds
[12-22 05:27:50] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 05:27:50] [DenoisingStage] finished in 16.9346 seconds
[12-22 05:27:50] [DecodingStage] started...
[12-22 05:27:50] [DecodingStage] finished in 0.5279 seconds
[12-22 05:27:50] Output saved to outputs/A_curious_raccoon_20251222-052731_eed73624.jpg
[12-22 05:27:50] Pixel data generated successfully in 18.91 seconds
[12-22 05:27:50] Completed batch processing. Generated 1 outputs in 18.91 seconds.
[12-22 05:27:50] Memory usage - Max peak: 23105.06 MB, Avg peak: 23105.06 MB
[12-22 05:27:50] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
CUDA_VISIBLE_DEVICES=7 sglang serve \
    --model-path /home/lmsys/bbuf/FLUX.1-dev \
    --pin-cpu-memory \
    --log-level debug \
    --host 0.0.0.0 \
    --port 30000

python -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image \
    --host localhost \
    --port 30000 \
    --dataset vbench \
    --task t2v \
    --num-prompts 20 \
    --width 720 \
    --height 720 \
    --max-concurrency 2
================= Serving Benchmark Result =================
Backend:                                 sglang-image   
Model:                                   /home/lmsys/bbuf/FLUX.1-dev
Dataset:                                 vbench         
Task:                                    t2v            
--------------------------------------------------
Benchmark duration (s):                  98.25          
Request rate:                            inf            
Max request concurrency:                 2              
Successful requests:                     20/20             
--------------------------------------------------
Request throughput (req/s):              0.20           
Latency Mean (s):                        9.5812         
Latency Median (s):                      9.6837         
Latency P99 (s):                         10.8238        
--------------------------------------------------
Peak Memory Max (MB):                    25470.11       
Peak Memory Mean (MB):                   25469.28       
Peak Memory Median (MB):                 25469.23       

============================================================

Video Generation

sglang generate   --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers   --text-encoder-cpu-offload   --pin-cpu-memory   --num-gpus 8   --ulysses-degree 8 --attention-backend sage_attn  --enable-torch-compile --prompt "A cat walks on the grass, realistic" --num-frames 81 --height 720 --width 1280 --num-inference-steps 27 --guidance-scale 3.5 --guidance-scale-2 4.0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [02:14<00:00,  4.98s/it]
[12-22 06:11:13] [DenoisingStage] average time per step: 4.9841 seconds
[12-22 06:11:28] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 06:11:28] [DenoisingStage] finished in 149.9252 seconds
[12-22 06:11:28] [DecodingStage] started...
[12-22 06:11:38] [DecodingStage] finished in 9.9124 seconds
[12-22 06:11:44] Output saved to outputs/A_cat_walks_on_the_grass_realistic_20251222-060856_f4e677f2.mp4
[12-22 06:11:44] Pixel data generated successfully in 167.79 seconds
[12-22 06:11:44] Completed batch processing. Generated 1 outputs in 167.79 seconds.
[12-22 06:11:44] Memory usage - Max peak: 62651.28 MB, Avg peak: 62651.28 MB
[12-22 06:11:44] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
sglang generate   --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers   --text-encoder-cpu-offload   --pin-cpu-memory   --num-gpus 8   --ulysses-degree 8 --attention-backend sage_attn  --enable-torch-compile --prompt "A cat walks on the grass, realistic" --num-frames 81 --height 720 --width 1280 --num-inference-steps 27 --guidance-scale 3.5 --guidance-scale-2 4.0 --dit-layerwise-offload true
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [01:36<00:00,  3.56s/it]
[12-22 06:23:27] [DenoisingStage] average time per step: 3.5597 seconds
[12-22 06:23:29] Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding.
[12-22 06:23:29] [DenoisingStage] finished in 97.8069 seconds
[12-22 06:23:29] [DecodingStage] started...
[12-22 06:23:39] [DecodingStage] finished in 10.1658 seconds
[12-22 06:23:46] Output saved to outputs/A_cat_walks_on_the_grass_realistic_20251222-062148_f4e677f2.mp4
[12-22 06:23:46] Pixel data generated successfully in 117.56 seconds
[12-22 06:23:46] Completed batch processing. Generated 1 outputs in 117.56 seconds.
[12-22 06:23:46] Memory usage - Max peak: 21394.29 MB, Avg peak: 21394.29 MB
[12-22 06:23:46] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 sglang serve \
    --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --text-encoder-cpu-offload \
    --pin-cpu-memory \
    --num-gpus 8 \
    --ulysses-degree 8 \
    --attention-backend sage_attn \
    --enable-torch-compile \
    --host 0.0.0.0 \
    --port 30000

python -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-video \
    --host localhost \
    --port 30000 \
    --dataset vbench \
    --task t2v \
    --num-prompts 2 \
    --max-concurrency 1
================= Serving Benchmark Result =================
Backend:                                 sglang-video   
Model:                                   Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 vbench         
Task:                                    t2v            
--------------------------------------------------
Benchmark duration (s):                  395.74         
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     2/2              
--------------------------------------------------
Request throughput (req/s):              0.01           
Latency Mean (s):                        197.8686       
Latency Median (s):                      197.8686       
Latency P99 (s):                         198.3657       
--------------------------------------------------
Peak Memory Max (MB):                    62650.28       
Peak Memory Mean (MB):                   55689.21       
Peak Memory Median (MB):                 55689.21       

============================================================

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Dec 22, 2025
Comment thread python/sglang/multimodal_gen/runtime/entrypoints/openai/image_api.py Outdated
@BBuf BBuf added the run-ci label Dec 22, 2025
@BBuf BBuf merged commit d77f3fc into main Dec 22, 2025
100 of 102 checks passed
@BBuf BBuf deleted the diffusion_support_peak_memory branch December 22, 2025 13:21
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 23, 2025
…n_eagle3_dp

* 'main' of https://github.com/sgl-project/sglang: (208 commits)
  MoE: Skip SiLU/GELU activation for masked experts (sgl-project#15539)
  [GLM-ASR] GLM-ASR Support  (sgl-project#15570)
  Improve engine customization interface (sgl-project#15635)
  chore: bump sgl-kernel version to 0.3.20 (sgl-project#15590)
  bugfix[schedule]: Refactor sort method and add related UT (sgl-project#13576)
  Adjust wrong `mtp` meaning introduce by mimo (sgl-project#15632)
  Tiny add back missing router per attempt response metric (sgl-project#15621)
  Fix router gRPC mode launch error caused by async loading (sgl-project#15368)
  [model-gateway] return 503 when all workers are circuit-broken (sgl-project#15611)
  [Diffusion] Support peak memory record in offline generate and serving (sgl-project#15610)
  [VLM] Tiny: Unify VLM environment variables (sgl-project#15572)
  [diffusion] chore: remove default post-denoising dit offload in local mode (sgl-project#15573)
  Tiny enable soft watchdog in CI for stuck without logs (sgl-project#15616)
  Tiny add stuck simulation (sgl-project#15613)
  Support soft watchdog for tokenizer/detokenizer/dp-controller processes (sgl-project#15607)
  Tiny avoid EnvField misuse (sgl-project#15612)
  add decode round robin policy (sgl-project#15164)
  Add glm-4.6-fp8 with/without mtp in nightly ci (sgl-project#15566)
  Adapt fixture-kit to gsm8k mixin (sgl-project#15599)
  [model-gateway] add retry support to OpenAI router chat endpoint (sgl-project#15589)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants