Skip to content

MultiGPU Work Units For Accelerated Sampling#7063

Open
Kosinkadink wants to merge 113 commits intomasterfrom
worksplit-multigpu
Open

MultiGPU Work Units For Accelerated Sampling#7063
Kosinkadink wants to merge 113 commits intomasterfrom
worksplit-multigpu

Conversation

@Kosinkadink
Copy link
Copy Markdown
Member

@Kosinkadink Kosinkadink commented Mar 4, 2025

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:
image

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:
image

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:
image
image
This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.
image
image

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090
image
image

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
@monstari
Copy link
Copy Markdown

I noticed that there’s no speed boost when using distilled models with CFG=1. Since Normalized Attention Guidance already provides similar negative conditioning at CFG=1, would it be possible to explore solutions similar to XDit’s parallel processing in the future?

Also, if we have more than two GPUs, I assume this solution wouldn’t be as useful, since we can only apply two conditioning streams.

Thanks for all your work still!

@QUTGXX
Copy link
Copy Markdown

QUTGXX commented Aug 29, 2025

我使用 RTX 5090 单独测试了这一点,使用 Wan 14B t2v 型号的显卡,生成耗时约 19.54 秒。当使用多 GPU 工作单元拆分将 RTX 5090 与 RTX 4090 组合使用时,时间缩短至 11.52 秒。

速度大约提高 1.7 倍(RTX 5090 + RTX 4090)。

也就是说,我还无法让 Sage Attention 和 Torch Compiler 与 MultiGPU 设置正常工作,我希望这个问题能尽快得到解决。

总体而言,此功能很有前景,特别是对于运行混合 GPU 配置的用户而言。

运行 Sage 时遇到的一些错误注意:

!!! 处理过程中出现异常 !!! 无法从 Triton(cpu 张量?)访问指针参数(位于 0)
回溯(最近一次调用最后一次):
文件“/home/rtl-6/execution.py”,第 496 行,执行
output_data、output_ui、has_subgraph、has_pending_tasks = await get_output_data(prompt_id、unique_id、obj、input_data_all、execution_block_cb=execution_block_cb、pre_execute_cb=pre_execute_cb、hidden_​​inputs=hidden_​​inputs)
^
...
​执行块cb=执行块cb,pre_execute_cb=pre_execute_cb,隐藏输入=隐藏输入)
^
...
​process_inputs(input_dict,i)
文件“/home/rtl-6/execution.py”,第 277 行,在 process_inputs
result = f(**inputs)
^^^^^^^^^^^
文件“/home/rtl-6/nodes.py”,第 1521 行,在 sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/nodes.py”,第 1488 行,在 common_ksampler 中,
samples = comfy.sample.sample(model、noise、steps、cfg、sampler_name、scheduler、positive、negative、latent_image、
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/sample.py”,第 45 行,在 sample 中,
samples = sampler.sample(noise、positive、negative、cfg=cfg、latent_image=latent_image、start_step=start_step、last_step=last_step, force_full_denoise=force_full_denoise,denoise_mask=noise_mask,sigmas=sigmas,callback=callback,disable_pbar=disable_pbar,seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1355 行,在示例中
返回样本(self.model、噪声、正、负、cfg、self.device、采样器、sigmas、self.model_options、latent_image=latent_image、denoise_mask=denoise_mask、callback=callback、disable_pbar=disable_pbar、seed=seed)
^
...
​cfg_guider.sample(噪声,latent_image,采样器,sigmas,denoise_mask,回调,disable_pbar,种子)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1230 行,样本
输出 = executor.execute(噪声,latent_image,采样器,sigmas,denoise_mask,回调,disable_pbar,种子)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”,第 113 行,在执行中
返回 self.original(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1196 行,在 outer_sample 中
输出 = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar,种子)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1175 行,在 inner_sample
样本 = executor.execute(self,sigmas,extra_args,callback,noise,latent_image,denoise_mask,disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”,第 113 行,在执行中
返回 self.original(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 954 行,在样本中
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args,callback=k_callback, disable=disable_pbar,self.extra_options

^
...
^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/k_diffusion/sampling.py”,第 190 行,在 sample_euler 中
denoised = model(x, sigma_hat * s_in, extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 604 行,通话中
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1155 行,在
call

return self.predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 1158 行,在 predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed
)
^
... conds、x、timestep、model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 211 行,在 calc_cond_batch 中
返回 executor.execute(model、conds、x_in、timestep、model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”,第 113 行,在执行中
返回 self.original(*args,**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 215 行,在 _calc_cond_batch 中
返回 _calc_cond_batch_multigpu(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”,第 530 行,在 _calc_cond_batch_multigpu 中
引发错误
文件“/usr/local/lib/python3.12/threading.py”,第 1052 行,在_bootstrap_inner
self.run()
文件“/usr/local/lib/python3.12/threading.py”,第 989 行,运行
self._target(*self.args, **self.kwargs)
文件“/home/rtl-6/comfy/samplers.py”,第 511 行,在_handle_batch __
output = model_current.apply_model(input_x, timestep
, **c).to(output_device).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/model_base.py”,第 152 行,在 apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”,第 113 行,在执行中
返回 self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/model_base.py”,第 190 行,在 apply_model
model_output = self.diffusion_model(xc, t, context=context,控制=控制,transformer_options=transformer_options,**extra_conds).float()

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件
“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1773 行,在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^ ... “/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1784 行,在 _call_impl 中
返回 forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”,第 580 行,在 forward 中
返回 self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transform_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”,第 550 行,在 forward_orig
x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1773 行,在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1784 行,在 _call_impl 中
返回 forward_call(*args,**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”,第 221 行,在 forward
y = self.self_attn(
^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1773 行,在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”,第 1784 行,在 _call_impl 中
返回 forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”,第 72 行,在 forward
x = optimal_attention(
^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py”,第 899 行,在 _fn 中
返回 fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/custom_nodes/comfyui-kjnodes/nodes/model_optimization_nodes.py”,第 81 行,在 attention_sage 中
out = sage_func(q, k, v, attn_mask=mask, is_causal=False,tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/custom_nodes/comfyui-kjnodes/nodes/model_optimization_nodes.py”,第 36 行,在 func
return sageattn(q, k, v, is_causal=is_causal, attn_mask=attn_mask, tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/sageattention/core.py”,第 105 行,在 sageattn 中
q_int8、q_scale、k_int8、k_scale = per_block_int8(q、k、sm_scale=sm_scale、tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/sageattention/quant_per_block.py”,第 63 行,在 per_block_int8 中
quantum_per_block_int8_kernel[grid](
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/runtime/jit.py”,第 347 行,在
return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args,kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/runtime/jit.py”,第591行,运行
kernel.run(grid_0,grid_1,grid_2,stream,kernel.function,kernel.packed_metadata,
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py”,第529行,在
调用

self.launch(gridX,gridY,gridZ,stream,function, self.launch_cooperative_grid,global_scratch,*args)
ValueError:无法从Triton(cpu张量?)访问指针参数(在0处)

当我将 max_gpus 设置为 1 时,Sage Attention 和 Torch Compiler 可以工作,但不适用于多个 GPU。

Hello, I also want to run this model with 4 4090 gpus. Can u share the workflow?

@DKingAlpha
Copy link
Copy Markdown

DKingAlpha commented Aug 30, 2025

@Kosinkadink AMD users will need this patch

diff --git a/comfy/model_management.py b/comfy/model_management.py
index 4ac04b8b..b396a034 100644
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -194,7 +194,7 @@ def get_all_torch_devices(exclude_current=False):
     global cpu_state
     devices = []
     if cpu_state == CPUState.GPU:
-        if is_nvidia():
+        if is_nvidia() or is_amd():
             for i in range(torch.cuda.device_count()):
                 devices.append(torch.device(i))
         elif is_intel_xpu():

couldn't confirm further. my multigpu setup broke recently and I can't confirm whether its in the repo or my rocm stack broken.
Oh I got it working. Turned out my local (outdated) flash attention installation broken to some extend.

@jkyamog
Copy link
Copy Markdown

jkyamog commented Sep 3, 2025

Thanks for all the work done here. I added 1 more GPU to my 3x 3090 setup. I was trying with WAN video models but it only used 2 GPUs because 3 is not a binary number. So I took a smaller 3060 12GB GPU from another system, so I can run 4 GPUs. I then downgraded to WAN2.1 t2v 1.3B so it fit on all GPU VRAM including the 3060. But it seems it behaves similarly to running 3 GPUs, only 2 GPUs are actually doing work even all of the GPU has had VRAM loaded in. Is this expected? Here is what the typical load looks like through a video generation where only 2 GPUs are doing work. Btw I did reorder the GPUs using CUDA_VISIBLE_DEVICES.
Screenshot 2025-09-03 at 2 27 22 PM

@Kosinkadink
Copy link
Copy Markdown
Member Author

Kosinkadink commented Sep 10, 2025

Thank you for the additional info!

@DKingAlpha thanks for the heads up!

Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they?

Secondly, @jkyamog this PR currently only does conditioning splitting - making conditioning run on separate GPUs. Wan2.1 has only two conditioning (positive and negative) without masking, so you are only able to accelerate it 2x with 2 GPUs - the other GPUs will have no work to be split for them. This is also the issue with using models that only have one conditioning - there is nothing to split. I will be looking at some parallel attention schemes to try to overcome this limitation soon.

I did not have as much time to look into the remaining issues as I thought, I apologize for the delay. I will keep looking into it + accelerating without just conditioning soon.

@ExpandedMancho
Copy link
Copy Markdown

Hi, in order to make this setup work with 2 GPUs do you need enough VRAM to be able to run the Wan model 2 times on your first GPU?

I noticed I get OOM errors when the deep.clone part starts, I'm guessing that the clone requires the full model to load and then also the copy of the model before it can paste it into the 2nd GPU?

Thanks.

@Kosinkadink
Copy link
Copy Markdown
Member Author

That should not be a requirement. What are your exact errors? (post full stack trace + workflow)

@ExpandedMancho
Copy link
Copy Markdown

That should not be a requirement. What are your exact errors? (post full stack trace + workflow)

Hey, I found out that I'm going OOM when I use load_device = main_device in the WanVideo Model Loader and multi GPU Work Unite WAN node set at 2 max_gpus. However, when I do 1 GPU and this exact setup (load_device = main_device) it does work.

  • I’ve tried with and without block swap, accelerator LoRAs, and it didn't make a difference for me.
  • If I use the load_device = offload_device instead of the main_device the workflow works and I get no OOM, but then there’s no deep cloning happening (checked the CLI) and the 2nd GPU doesn’t get used at all.
  • image

Specs

GPU: 2x RTX 5090 (32GB VRAM 2x)
CUDA Version: 12.8
RAM: 186 GB
OS: Linux
Python version: Python 3.11.11

Relevant libs

nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.3
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvshmem-cu12==3.2.5
nvidia-nvtx-cu12==12.8.90
open_clip_torch==2.32.0
pytorch-triton==3.3.1+gitc8757738
rotary-embedding-torch==0.8.8
torch==2.9.0.dev20250629+cu128
torchaudio==2.8.0.dev20250629+cu128
torchsde==0.2.6
torchvision==0.23.0.dev20250629+cu128

ComfyUI startup (pytorch attention)

Set cuda device to: 0,1
Total VRAM 32120 MB, total RAM 386469 MB
pytorch version: 2.9.0.dev20250629+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5090 : cudaMallocAsync
Device: cuda:1 NVIDIA GeForce RTX 5090 : cudaMallocAsync
Using pytorch attention
Python version: 3.11.11 (main, Dec 4 2024, 08:55:07) [GCC 11.4.0]
ComfyUI version: 0.3.46
ComfyUI frontend version: 1.23.4

Workflow

https://pastebin.com/g2xnixXt

Stack Trace

==========SERVER got prompt==========
prompt event_id: []
Failed to copy /tmp/models/diffusion_models/MelBandRoformer_fp16.safetensors to temp dir: '/tmp/models/diffusion_models/MelBandRoformer_fp16.safetensors' is not in the subpath of '/workspace/ComfyUI' OR one path is relative and the other is absolute. falling back to original path
Converted mono input to stereo.
Resampling input 8000 to 44100
Processing chunks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.97it/s]
[MultiTalk] --- Raw speaker lengths (samples) ---
  speaker 1: 244000 samples (shape: torch.Size([1, 1, 244000]))
[MultiTalk] total raw duration = 15.250s
[MultiTalk] multi_audio_type=para | final waveform shape=torch.Size([1, 1, 244000]) | length=244000 samples | seconds=15.250s (expected max of raw)
Failed to copy /tmp/models/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors to temp dir: '/tmp/models/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors' is not in the subpath of '/workspace/ComfyUI' OR one path is relative and the other is absolute. falling back to original path
CLIP layer names written to clip_layers.txt
clip_target: <comfy.sd.load_text_encoder_state_dicts.<locals>.EmptyClass object at 0x7785e868ee10> parameters: 5685458817 model_options: {'load_device': device(type='cpu'), 'offload_device': device(type='cpu')}
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cpu, offload device: cpu, current: cuda:0, dtype: torch.float16
Requested to load WanTEModel
loaded completely 9.5367431640625e+25 6419.477203369141 True
Requested to load CLIPVisionModelProjection
loaded completely 28918.5119140625 1208.09814453125 True
Clip embeds shape: torch.Size([1, 257, 1280]), dtype: torch.float32
Combined clip embeds shape: torch.Size([1, 257, 1280])
CUDA Compute Capability: 12.0
Detected model in_channels: 36
Model cross attention type: i2v, num_heads: 40, num_layers: 40
Model variant detected: i2v_480
InfiniteTalk detected, patching model...
model_type FLOW
Creating deepclone of WanVideoModel for cuda:1.
!!! Exception during processing !!! class_type: MultiGPU_WorkUnitsWAN node_id: 377 ex: Allocation on device 
Traceback (most recent call last):
  File "/workspace/ComfyUI/execution.py", line 482, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/execution.py", line 292, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/execution.py", line 266, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/workspace/ComfyUI/execution.py", line 254, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy_extras/nodes_multigpu.py", line 71, in init_multigpu
    model = comfy.multigpu.create_multigpu_deepclones(model, max_gpus, gpu_options, reuse_loaded=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy/multigpu.py", line 90, in create_multigpu_deepclones
    device_patcher = model.deepclone_multigpu(new_load_device=device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy/model_patcher.py", line 349, in deepclone_multigpu
    n.model = copy.deepcopy(n.model)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/_tensor.py", line 178, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 1139, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 243, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 257, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device 

Got an OOM, unloading all loaded models. node_id: 377 class_type: MultiGPU_WorkUnitsWAN class_def: <class '/workspace/ComfyUI/comfy_extras/nodes_multigpu.MultiGPUWorkUnitsNodeWAN'>
Prompt executed in 32.86 seconds

@Kosinkadink
Copy link
Copy Markdown
Member Author

Kosinkadink commented Sep 17, 2025

Multigpu work units are a feature only for nodes that use native sampling or specifically reimplement support - the node you're looking at is a wrapper custom node that does not use native sampling.

@edflyer
Copy link
Copy Markdown

edflyer commented Oct 19, 2025

Do you have a workflow that I can test to see if I installed this correctly?

@edflyer
Copy link
Copy Markdown

edflyer commented Oct 19, 2025

Thanks for all the work done here. I added 1 more GPU to my 3x 3090 setup. I was trying with WAN video models but it only used 2 GPUs because 3 is not a binary number. So I took a smaller 3060 12GB GPU from another system, so I can run 4 GPUs. I then downgraded to WAN2.1 t2v 1.3B so it fit on all GPU VRAM including the 3060. But it seems it behaves similarly to running 3 GPUs, only 2 GPUs are actually doing work even all of the GPU has had VRAM loaded in. Is this expected? Here is what the typical load looks like through a video generation where only 2 GPUs are doing work. Btw I did reorder the GPUs using CUDA_VISIBLE_DEVICES. Screenshot 2025-09-03 at 2 27 22 PM

What workflow are you using?

@rattus128
Copy link
Copy Markdown
Contributor

Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they?

I think I have it working with a fix.

2xA40 on a runpod. I reproduced black outputs, and colorful noise in flux-dev fp8, cfg=1.1.

root@f00a481f73a0:~/ComfyUI# cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"

I got a black screen on some async tensor casting experiments I was doing for another change, and debugged it to be a race between the cuda streams and the pytorch garbage collector, so I thought id check for the same bug here. I remember @Kosinkadink saying this was blocked by black screens in a discord post.

So I think something similar is going on here, where the GPU->GPU ops are asynchronous WRT to the CPU and the CPU is able to run ahead and queue a cudaAsyncFree on one GPU while the other is still bus mastering the .to transfers, depending on who is the bus master and tensor owner. In the case of pull DMA this can easily be a race that corrupts tensors before transfer completes. Pytorch documentation is sparse on this so its all theory.

So if i'm right, this can be fixed by always bounce buffering through RAM which syncs the CPU:

diff --git a/comfy/samplers.py b/comfy/samplers.py
index ed702304..a93dbde4 100755
--- a/comfy/samplers.py
+++ b/comfy/samplers.py
@@ -158,7 +158,7 @@ def cond_cat(c_list, device=None):
         conds = temp[k]
         out[k] = conds[0].concat(conds[1:])
         if device is not None and hasattr(out[k], 'to'):
-            out[k] = out[k].to(device)
+            out[k] = out[k].cpu().to(device)
 
     return out
 
@@ -470,7 +470,7 @@ def _calc_cond_batch_multigpu(model: BaseModel, co
nds: list[list[dict]], x_in: t
                         patches = p.patches
 
                     batch_chunks = len(cond_or_uncond)
-                    input_x = torch.cat(input_x).to(device)
+                    input_x = torch.cat(input_x).cpu().to(device)
                     c = cond_cat(c, device=device)
                     timestep_ = torch.cat([timestep.to(device)] * batch_chunk
s)
 
@@ -500,9 +500,9 @@ def _calc_cond_batch_multigpu(model: BaseModel, co
nds: list[list[dict]], x_in: t
                         c['control'] = device_control.get_control(input_x, ti
mestep_, c, len(cond_or_uncond), transformer_options)
 
                     if 'model_function_wrapper' in model_options:
-                        output = model_options['model_function_wrapper'](
model_current.apply_model, {"input": input_x, "timestep": timestep_, "c": c, "
cond_or_uncond": cond_or_uncond}).to(output_device).chunk(batch_chunks)
+                        output = model_options['model_function_wrap
per'](model_current.apply_model, {"input": input_x, "timestep": timestep_, "c"
: c, "cond_or_uncond": cond_or_uncond}).cpu().to(output_device).chunk(batch_ch
unks)
                     else:
-                        output = model_current.apply_model(input_x, times
tep_, **c).to(output_device).chunk(batch_chunks)
+                        output = model_current.apply_model(input_x,
 timestep_, **c).cpu().to(output_device).chunk(batch_chunks)
                     results.append(thread_result(output, mult, area, batch_ch
unks, cond_or_uncond))
         except Exception as e:
             results.append(thread_result(None, None, None, None, None, error=
e))

There is in theory a performance penalty here as it changes the DMA path from master-slave to master-RAM-master, but i'm not observing any penalty in my initial tests.

Here is B=4 1024x1024 cfg=1.1 Flux dev speeds:

1 GPU

100%|████████████████████████████████████████████████████████| 20/20 [02:20<00:00,  7.00s/it]

2 GPUs - This branch unchanged (corrupted output)

100%|███████████████████████████████████████████████████| 20/20 [01:22<00:00,  4.11s/it]

2 GPUs - With above fix

100%|███████████████████████████████████████████████████| 20/20 [01:22<00:00,  4.11s/it]


Properly syncing the GPU->GPU DMA is a complex web of driver specifics, so this is a lot easier.

If this ends up being slow for other use cases (very large latents), you could chunk the .to as a series of queued copies instead, so the two bus masters start overlapping work and the performance will likely converge on something very close to master-slave give the above.

Screenshot from 2025-10-26 18-20-53

@comfy-pr-bot
Copy link
Copy Markdown
Member

Test Evidence Check

Kosinkadink and others added 3 commits February 17, 2026 02:53
…clone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive multi-GPU support throughout the codebase. The --cuda-device CLI argument type changes from int to str to support multiple device specification. New infrastructure includes a multigpu module with GPU options management and load balancing utilities. ControlNet and ModelPatcher classes gain deep cloning methods for per-device instances. Sampling flows are refactored to parallelize conditioning across devices using threading. Model management is extended to handle unloading across multiple devices. New node classes expose multi-GPU configuration in workflows. The CUDA backend in quant_ops is unconditionally disabled. Changes span model management, sampling orchestration, patcher logic, and checkpoint loading initialization.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and clearly summarizes the main change: introducing MultiGPU Work Units for accelerated sampling, which is the primary objective of this PR.
Description check ✅ Passed The description provides comprehensive context about the MultiGPU acceleration feature, implementation details, performance metrics, and hardware compatibility, all of which directly relate to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_multigpu.py`:
- Around line 66-69: The GPUOptionsGroup.clone() return value is being discarded
in create_gpu_options; capture and use the cloned object so we don't mutate the
caller-supplied gpu_options. Change the behavior in create_gpu_options to assign
the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

In `@comfy/cli_args.py`:
- Line 52: The --cuda-device argument currently only accepts a single token;
update the parser.add_argument call for "--cuda-device" to accept multiple
space-separated device IDs by adding nargs='+' (and set type=int if you want
integer IDs) so that invocations like "--cuda-device 0 1" parse correctly;
alternatively, if you prefer comma-separated input, change the help text to
explicitly state the required format instead of implying plural support.

In `@comfy/controlnet.py`:
- Around line 322-328: The multigpu clone path in deepclone_multigpu currently
builds c = self.copy() which does not carry the previous_controlnet chain,
causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update
deepclone_multigpu to copy previous_controlnet (and any linked
.previous_controlnet chain) from self to c after c = self.copy() so that the
full chain is preserved, then continue deep-copying control_model and wrapping
it as before (ensure multigpu_clones[load_device] assignment remains unchanged);
apply the same preservation of previous_controlnet chaining to the similar clone
code paths that use copy_to()/get_instance_for_device() so all per-device clones
keep the full previous_controlnet chain.

In `@comfy/model_management.py`:
- Around line 214-231: The function get_all_torch_devices currently only handles
NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and
unload_all_models); update it to (1) add detection for other common backends
(e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks
such as torch.cuda.device_count() and torch.backends.mps.is_available() and
append appropriate torch.device entries, (2) if after all backend checks devices
is still empty, append get_torch_device() as a safe fallback so callers always
get at least the current device, and (3) make the exclude_current branch robust
by checking membership before calling devices.remove(get_torch_device()); refer
to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

In `@comfy/model_patcher.py`:
- Around line 1315-1321: The ON_PREPARE_STATE callbacks are being invoked with
four positional args in prepare_state, breaking backward-compatibility for
callbacks that expect three; update prepare_state to detect each callback's
accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount)
and call either callback(self, timestep, model_options, ignore_multigpu) if it
accepts 4 args or callback(self, timestep, model_options) if it only accepts 3
(or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the
same arity-gated invocation when recursing into multigpu clones; reference
prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the
callsite.

In `@comfy/multigpu.py`:
- Around line 60-112: create_multigpu_deepclones clones existing "multigpu"
additional models but never removes ones that exceed the new max_gpus; to fix,
after computing limit_extra_devices (the allowed device list) retrieve
model.get_additional_models_with_key("multigpu"), filter out any clone whose
load_device is not in ([model.load_device] + limit_extra_devices) (use each
ModelPatcher.load_device to decide), then call
model.set_additional_models("multigpu", filtered_list) before
match_multigpu_clones()/gpu_options.register; ensure reuse_loaded logic still
can find matching clones and that is_multigpu_base_clone flags remain correct
for retained clones.

In `@comfy/quant_ops.py`:
- Line 23: The unconditional call ck.registry.disable("cuda") in
comfy/quant_ops.py should be removed and only invoked when the unsupported
multigpu+cuda combination is actually active; locate the
ck.registry.disable("cuda") invocation and wrap it with a guard that checks the
real multigpu/backend state (for example an existing multigpu flag or function
like is_multigpu_enabled(), a config/ENV check, or the code path that handles
multigpu setup) so that CUDA is only disabled when multigpu is enabled and the
specific backend combination is unsupported, otherwise leave CUDA enabled for
normal single-GPU runs.

In `@comfy/sampler_helpers.py`:
- Line 200: Add the missing BaseModel import used in the type annotation for
real_model (the line "real_model: BaseModel = model.model") by adding "from
typing import TYPE_CHECKING" already present and then inside the existing
TYPE_CHECKING block import BaseModel from its module (e.g., "from <module>
import BaseModel") so the annotation is defined at type-check time;
alternatively remove the BaseModel annotation if you prefer not to add the
import.

In `@comfy/samplers.py`:
- Around line 391-397: The multigpu scheduler currently ignores multigpu_options
and uses integer floor division (//) inside math.ceil, producing coarse,
incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
- Around line 847-850: The code calls x['control'].pre_run(model, ...) for the
base control and then calls device_cnet.pre_run(model, ...) for each control
clone, incorrectly passing the base model to per-device controls; update the
loop to pass the matching per-device model clone instead. Specifically, when
iterating x['control'].multigpu_clones (the device_cnet clones), look up the
corresponding model clone (e.g., from model.multigpu_clones using the same
keys/ids) and call device_cnet.pre_run(model_clone,
percent_to_timestep_function) so each control clone receives its matching model
clone; keep the initial x['control'].pre_run(model, ...) for the base control.

In `@comfy/sd.py`:
- Line 1557: The assignment to out[0].cached_patcher_init can raise when out[0]
is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so
guard it: check that out[0] is not None before assigning to
out[0].cached_patcher_init and, if the CLIP patcher is created separately for
checkpoint-backed models, set its own cached_patcher_init instead (or attach the
init tuple to the patcher instance). Locate the assignment line and ensure both
code paths (when out[0] exists and when only a CLIP patcher exists) receive the
same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed
CLIP patchers keep their cached_patcher_init.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acbb973a-f861-46f7-805a-a50660ad4671

📥 Commits

Reviewing files that changed from the base of the PR and between b941913 and f410d28.

📒 Files selected for processing (12)
  • comfy/cli_args.py
  • comfy/controlnet.py
  • comfy/model_management.py
  • comfy/model_patcher.py
  • comfy/multigpu.py
  • comfy/patcher_extension.py
  • comfy/quant_ops.py
  • comfy/sampler_helpers.py
  • comfy/samplers.py
  • comfy/sd.py
  • comfy_extras/nodes_multigpu.py
  • nodes.py

Comment on lines +66 to +69
def create_gpu_options(self, device_index: int, relative_speed: float, gpu_options: comfy.multigpu.GPUOptionsGroup=None):
if not gpu_options:
gpu_options = comfy.multigpu.GPUOptionsGroup()
gpu_options.clone()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e
FILE=$(fd '^multigpu\.py$' comfy)
rg -n -C4 'class GPUOptionsGroup|def clone\(' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 793


🏁 Script executed:

#!/bin/bash
set -e
FILE=$(fd '^nodes_multigpu\.py$' comfy_extras)
sed -n '66,80p' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 529


Capture the result of gpu_options.clone() on line 69.

The cloned object is being discarded. Since GPUOptionsGroup.clone() returns a new object, the current code ignores it and then mutates the caller-supplied gpu_options directly. When multiple MultiGPU Options nodes share the same GPU options group, this causes unintended state sharing across the node chain.

Change line 69 from:

gpu_options.clone()

to:

gpu_options = gpu_options.clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_multigpu.py` around lines 66 - 69, The
GPUOptionsGroup.clone() return value is being discarded in create_gpu_options;
capture and use the cloned object so we don't mutate the caller-supplied
gpu_options. Change the behavior in create_gpu_options to assign the result of
GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.")
parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.")
parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.")
parser.add_argument("--cuda-device", type=str, default=None, metavar="DEVICE_ID", help="Set the ids of cuda devices this instance will use. All other devices will not be visible.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--cuda-device", type=str, default=None)

for argv in (
    ["--cuda-device", "0,1"],
    ["--cuda-device", "0", "1"],
):
    try:
        ns = parser.parse_args(argv)
        print(argv, "->", ns.cuda_device)
    except SystemExit as e:
        print(argv, "-> parse failed with exit", e.code)
PY

Repository: Comfy-Org/ComfyUI

Length of output: 243


🏁 Script executed:

rg "cuda_device" -A 3 -B 3 --max-count=20

Repository: Comfy-Org/ComfyUI

Length of output: 702


🏁 Script executed:

rg "(cuda-device|cuda_device)" -l

Repository: Comfy-Org/ComfyUI

Length of output: 86


🏁 Script executed:

git log --oneline --all -n 1

Repository: Comfy-Org/ComfyUI

Length of output: 113


🏁 Script executed:

git diff HEAD~1 HEAD -- comfy/cli_args.py 2>/dev/null || git show HEAD:comfy/cli_args.py 2>/dev/null | grep -A 5 -B 5 "cuda-device"

Repository: Comfy-Org/ComfyUI

Length of output: 1367


--cuda-device 0 1 still won't parse.

Changing the type to str only changes the value type, not the option arity. argparse still accepts one token here, so --cuda-device 0 1 fails with "unrecognized arguments: 1". The help text mentions "ids" (plural), implying multi-device support, but the current implementation requires comma-separated format: --cuda-device 0,1. Either add nargs='+' to accept space-separated device IDs or clarify the help text to document the required comma-separated input format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/cli_args.py` at line 52, The --cuda-device argument currently only
accepts a single token; update the parser.add_argument call for "--cuda-device"
to accept multiple space-separated device IDs by adding nargs='+' (and set
type=int if you want integer IDs) so that invocations like "--cuda-device 0 1"
parse correctly; alternatively, if you prefer comma-separated input, change the
help text to explicitly state the required format instead of implying plural
support.

Comment on lines +322 to +328
def deepclone_multigpu(self, load_device, autoregister=False):
c = self.copy()
c.control_model = copy.deepcopy(c.control_model)
c.control_model_wrapped = comfy.model_patcher.ModelPatcher(c.control_model, load_device=load_device, offload_device=comfy.model_management.unet_offload_device())
if autoregister:
self.multigpu_clones[load_device] = c
return c
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the previous_controlnet chain in multigpu clones.

These new clone paths build c from copy(), but copy_to() does not carry previous_controlnet. Once get_instance_for_device() returns the per-device clone, stacked ControlNets/T2IAdapters on earlier links are silently dropped on secondary GPUs.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

Also applies to: 952-958

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/controlnet.py` around lines 322 - 328, The multigpu clone path in
deepclone_multigpu currently builds c = self.copy() which does not carry the
previous_controlnet chain, causing stacked ControlNets/T2IAdapters to be lost on
secondary GPUs; update deepclone_multigpu to copy previous_controlnet (and any
linked .previous_controlnet chain) from self to c after c = self.copy() so that
the full chain is preserved, then continue deep-copying control_model and
wrapping it as before (ensure multigpu_clones[load_device] assignment remains
unchanged); apply the same preservation of previous_controlnet chaining to the
similar clone code paths that use copy_to()/get_instance_for_device() so all
per-device clones keep the full previous_controlnet chain.

Comment on lines +214 to +231
def get_all_torch_devices(exclude_current=False):
global cpu_state
devices = []
if cpu_state == CPUState.GPU:
if is_nvidia():
for i in range(torch.cuda.device_count()):
devices.append(torch.device(i))
elif is_intel_xpu():
for i in range(torch.xpu.device_count()):
devices.append(torch.device(i))
elif is_ascend_npu():
for i in range(torch.npu.device_count()):
devices.append(torch.device(i))
else:
devices.append(get_torch_device())
if exclude_current:
devices.remove(get_torch_device())
return devices
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle non-CUDA backends in get_all_torch_devices.

This helper only enumerates CUDA/XPU/NPU devices, so ROCm/DirectML/MLU-style paths leave devices empty. With exclude_current=True that turns into a remove() failure, and unload_all_models() also stops freeing anything on those backends because it now routes through this helper.

As per coding guidelines, comfy/** changes should focus on backward compatibility and memory management/GPU resource handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 214 - 231, The function
get_all_torch_devices currently only handles NVIDIA/Intel/Ascend and can return
an empty list (breaking exclude_current and unload_all_models); update it to (1)
add detection for other common backends (e.g., ROCm/DirectML/MLU/MPS) or at
minimum attempt generic torch backend checks such as torch.cuda.device_count()
and torch.backends.mps.is_available() and append appropriate torch.device
entries, (2) if after all backend checks devices is still empty, append
get_torch_device() as a safe fallback so callers always get at least the current
device, and (3) make the exclude_current branch robust by checking membership
before calling devices.remove(get_torch_device()); refer to
get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

Comment on lines +1315 to +1321
def prepare_state(self, timestep, model_options, ignore_multigpu=False):
for callback in self.get_all_callbacks(CallbacksMP.ON_PREPARE_STATE):
callback(self, timestep)
callback(self, timestep, model_options, ignore_multigpu)
if not ignore_multigpu and "multigpu_clones" in model_options:
for p in model_options["multigpu_clones"].values():
p: ModelPatcher
p.prepare_state(timestep, model_options, ignore_multigpu=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep ON_PREPARE_STATE callback arity backward-compatible.

This now passes ignore_multigpu as a fourth positional argument to every existing ON_PREPARE_STATE callback. Any custom node still implementing the old (patcher, timestep, model_options) signature will fail with TypeError during sampling unless this is gated behind an arity check or opt-in keyword path.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_patcher.py` around lines 1315 - 1321, The ON_PREPARE_STATE
callbacks are being invoked with four positional args in prepare_state, breaking
backward-compatibility for callbacks that expect three; update prepare_state to
detect each callback's accepted arity (e.g., via inspect.signature or
callable.__code__.co_argcount) and call either callback(self, timestep,
model_options, ignore_multigpu) if it accepts 4 args or callback(self, timestep,
model_options) if it only accepts 3 (or attempt the 4-arg call and fall back to
3-arg on TypeError), and apply the same arity-gated invocation when recursing
into multigpu clones; reference prepare_state and CallbacksMP.ON_PREPARE_STATE
to locate where to change the callsite.

ck.registry.disable("cuda")
logging.warning("WARNING: You need pytorch with cu130 or higher to use optimized CUDA operations.")

ck.registry.disable("cuda") # multigpu will not work rn with comfy-kitchen on cuda backend
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see the full context of the file around line 23
cat -n comfy/quant_ops.py | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1910


🏁 Script executed:

# Let's also check if there's any git context to see if this is a new change
git diff HEAD -- comfy/quant_ops.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Search for multigpu-related logic in the codebase to understand the gating/conditions
rg -n "multigpu" --type py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 25235


Don't disable the comfy-kitchen CUDA backend globally.

Line 23 disables it unconditionally at import time, which affects all CUDA runs including single-GPU and non-multigpu workflows. This regresses quantization performance for users not using multigpu. Gate the disable behind actual multigpu activation or the specific unsupported backend combination instead of doing it unconditionally here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/quant_ops.py` at line 23, The unconditional call
ck.registry.disable("cuda") in comfy/quant_ops.py should be removed and only
invoked when the unsupported multigpu+cuda combination is actually active;
locate the ck.registry.disable("cuda") invocation and wrap it with a guard that
checks the real multigpu/backend state (for example an existing multigpu flag or
function like is_multigpu_enabled(), a config/ENV check, or the code path that
handles multigpu setup) so that CUDA is only disabled when multigpu is enabled
and the specific backend combination is unsupported, otherwise leave CUDA
enabled for normal single-GPU runs.

minimum_memory_required += inference_memory
comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required, force_full_load=force_full_load)
real_model = model.model
real_model: BaseModel = model.model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check the imports and line 200 in the file
head -20 comfy/sampler_helpers.py && echo "---" && sed -n '195,205p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 1130


🏁 Script executed:

# Search for all imports of BaseModel or where it might come from
rg -i "basemodel|from.*import|import" comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 463


🏁 Script executed:

# Check if there's a BaseModel defined or imported elsewhere that should be in this file
rg "class BaseModel|from.*BaseModel" comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 50375


🏁 Script executed:

# Check git diff to see if line 200 is new
git diff HEAD^ comfy/sampler_helpers.py | grep -A 5 -B 5 "BaseModel"

Repository: Comfy-Org/ComfyUI

Length of output: 246


🏁 Script executed:

# Look for similar type annotations in the file to understand the pattern
rg ":\s*\w+\s*=" comfy/sampler_helpers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 100


🏁 Script executed:

# Check if there are any existing TYPE_CHECKING imports or patterns
sed -n '1,30p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 894


Add the missing BaseModel import to fix undefined name.

Line 200 uses BaseModel in the type annotation, but it's not imported. Add it to the TYPE_CHECKING block to match the file's existing pattern, or remove the annotation.

Suggested fix
 if TYPE_CHECKING:
     from comfy.model_patcher import ModelPatcher
     from comfy.controlnet import ControlBase
+    from comfy.model_base import BaseModel
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
real_model: BaseModel = model.model
if TYPE_CHECKING:
from comfy.model_patcher import ModelPatcher
from comfy.controlnet import ControlBase
from comfy.model_base import BaseModel
🧰 Tools
🪛 GitHub Actions: Python Linting

[error] 200-200: F821 Undefined name BaseModel detected by ruff check. Ensure BaseModel is imported or defined.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/sampler_helpers.py` at line 200, Add the missing BaseModel import used
in the type annotation for real_model (the line "real_model: BaseModel =
model.model") by adding "from typing import TYPE_CHECKING" already present and
then inside the existing TYPE_CHECKING block import BaseModel from its module
(e.g., "from <module> import BaseModel") so the annotation is defined at
type-check time; alternatively remove the BaseModel annotation if you prefer not
to add the import.

Comment on lines +391 to +397
devices = [dev_m for dev_m in model_options['multigpu_clones'].keys()]
device_batched_hooked_to_run: dict[torch.device, list[tuple[comfy.hooks.HookGroup, tuple]]] = {}

total_conds = 0
for to_run in hooked_to_run.values():
total_conds += len(to_run)
conds_per_device = max(1, math.ceil(total_conds//len(devices)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

relative_speed is not used by the multigpu scheduler.

This branch still computes a fixed conds_per_device and round-robins by raw condition count; multigpu_options is never consulted here. The new MultiGPU Options node therefore has no effect on work distribution, and the // inside math.ceil(...) makes the split even coarser on uneven counts.

As per coding guidelines, comfy/** changes should focus on performance implications in hot paths.

Also applies to: 403-416, 433-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 391 - 397, The multigpu scheduler currently
ignores multigpu_options and uses integer floor division (//) inside math.ceil,
producing coarse, incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.

Comment on lines 847 to +850
if 'control' in x:
x['control'].pre_run(model, percent_to_timestep_function)
for device_cnet in x['control'].multigpu_clones.values():
device_cnet.pre_run(model, percent_to_timestep_function)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run per-device controls against the matching model clone.

These new pre_run() calls feed every device clone the base model. Any control that snapshots model-specific state during pre_run() will capture the wrong device/model; QwenFunControlNet.pre_run() in this file already stores model.diffusion_model, so its multigpu clone will still point at the base UNet.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 847 - 850, The code calls
x['control'].pre_run(model, ...) for the base control and then calls
device_cnet.pre_run(model, ...) for each control clone, incorrectly passing the
base model to per-device controls; update the loop to pass the matching
per-device model clone instead. Specifically, when iterating
x['control'].multigpu_clones (the device_cnet clones), look up the corresponding
model clone (e.g., from model.multigpu_clones using the same keys/ids) and call
device_cnet.pre_run(model_clone, percent_to_timestep_function) so each control
clone receives its matching model clone; keep the initial
x['control'].pre_run(model, ...) for the base control.

out[0].cached_patcher_init = (load_checkpoint_guess_config_model_only, (ckpt_path, embedding_directory, model_options, te_model_options))
if output_clip and out[1] is not None:
out[1].patcher.cached_patcher_init = (load_checkpoint_guess_config_clip_only, (ckpt_path, embedding_directory, model_options, te_model_options))
out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the model-side cache assignment.

load_checkpoint_guess_config_clip_only() reaches this path with output_model=False, so out[0] is None and this line raises before the CLIP patcher can be returned. It also leaves checkpoint-backed CLIP patchers without their own cached_patcher_init.

Possible fix
-    out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)
+    if out[0] is not None:
+        out[0].cached_patcher_init = (
+            load_checkpoint_guess_config,
+            (ckpt_path, False, False, False, embedding_directory, True, model_options, te_model_options),
+            0,
+        )
+    if out[1] is not None:
+        out[1].patcher.cached_patcher_init = (
+            load_checkpoint_guess_config,
+            (ckpt_path, False, True, False, embedding_directory, False, model_options, te_model_options),
+            1,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/sd.py` at line 1557, The assignment to out[0].cached_patcher_init can
raise when out[0] is None (e.g. when called from
load_checkpoint_guess_config_clip_only()), so guard it: check that out[0] is not
None before assigning to out[0].cached_patcher_init and, if the CLIP patcher is
created separately for checkpoint-backed models, set its own cached_patcher_init
instead (or attach the init tuple to the patcher instance). Locate the
assignment line and ensure both code paths (when out[0] exists and when only a
CLIP patcher exists) receive the same cached_patcher_init tuple so no
NullReference occurs and checkpoint-backed CLIP patchers keep their
cached_patcher_init.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.