MultiGPU Work Units For Accelerated Sampling by Kosinkadink · Pull Request #7063 · Comfy-Org/ComfyUI

Kosinkadink · 2025-03-04T06:37:49Z

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:

This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time

…nsHook are not yet operational

…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)

…added some doc strings and removed a so-far unused variable

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

…t torch hardware device

…ltiple GPUs

…ction

…nit__.py

monstari · 2025-08-17T15:58:38Z

I noticed that there’s no speed boost when using distilled models with CFG=1. Since Normalized Attention Guidance already provides similar negative conditioning at CFG=1, would it be possible to explore solutions similar to XDit’s parallel processing in the future?

Also, if we have more than two GPUs, I assume this solution wouldn’t be as useful, since we can only apply two conditioning streams.

Thanks for all your work still!

QUTGXX · 2025-08-29T01:56:34Z

我使用 RTX 5090 单独测试了这一点，使用 Wan 14B t2v 型号的显卡，生成耗时约 19.54 秒。当使用多 GPU 工作单元拆分将 RTX 5090 与 RTX 4090 组合使用时，时间缩短至 11.52 秒。

速度大约提高 1.7 倍（RTX 5090 + RTX 4090）。

也就是说，我还无法让 Sage Attention 和 Torch Compiler 与 MultiGPU 设置正常工作，我希望这个问题能尽快得到解决。

总体而言，此功能很有前景，特别是对于运行混合 GPU 配置的用户而言。

运行 Sage 时遇到的一些错误注意：

!!! 处理过程中出现异常 !!! 无法从 Triton（cpu 张量？）访问指针参数（位于 0）
回溯（最近一次调用最后一次）：
文件“/home/rtl-6/execution.py”，第 496 行，执行
output_data、output_ui、has_subgraph、has_pending_tasks = await get_output_data(prompt_id、unique_id、obj、input_data_all、execution_block_cb=execution_block_cb、pre_execute_cb=pre_execute_cb、hidden_inputs=hidden_inputs)
^
...
执行块cb=执行块cb，pre_execute_cb=pre_execute_cb，隐藏输入=隐藏输入）
^
...
process_inputs(input_dict，i)
文件“/home/rtl-6/execution.py”，第 277 行，在 process_inputs
result = f(**inputs)
^^^^^^^^^^^
文件“/home/rtl-6/nodes.py”，第 1521 行，在 sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/nodes.py”，第 1488 行，在 common_ksampler 中，
samples = comfy.sample.sample(model、noise、steps、cfg、sampler_name、scheduler、positive、negative、latent_image、
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/sample.py”，第 45 行，在 sample 中，
samples = sampler.sample(noise、positive、negative、cfg=cfg、latent_image=latent_image、start_step=start_step、last_step=last_step， force_full_denoise=force_full_denoise，denoise_mask=noise_mask，sigmas=sigmas，callback=callback，disable_pbar=disable_pbar，seed=seed）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1355 行，在示例中
返回样本（self.model、噪声、正、负、cfg、self.device、采样器、sigmas、self.model_options、latent_image=latent_image、denoise_mask=denoise_mask、callback=callback、disable_pbar=disable_pbar、seed=seed）
^
...
cfg_guider.sample（噪声，latent_image，采样器，sigmas，denoise_mask，回调，disable_pbar，种子）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1230 行，样本
输出 = executor.execute（噪声，latent_image，采样器，sigmas，denoise_mask，回调，disable_pbar，种子）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”，第 113 行，在执行中
返回 self.original(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1196 行，在 outer_sample 中
输出 = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar，种子）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1175 行，在 inner_sample
样本 = executor.execute（self，sigmas，extra_args，callback，noise，latent_image，denoise_mask，disable_pbar）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”，第 113 行，在执行中
返回 self.original(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 954 行，在样本中
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args,callback=k_callback, disable=disable_pbar，self.extra_options
）
^
...
^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/k_diffusion/sampling.py”，第 190 行，在 sample_euler 中
denoised = model(x, sigma_hat * s_in, extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 604 行，通话中
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1155 行，在call
return self.predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 1158 行，在 predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed
)
^
... conds、x、timestep、model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 211 行，在 calc_cond_batch 中
返回 executor.execute(model、conds、x_in、timestep、model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”，第 113 行，在执行中
返回 self.original(*args，**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 215 行，在 _calc_cond_batch 中
返回 _calc_cond_batch_multigpu(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/samplers.py”，第 530 行，在 _calc_cond_batch_multigpu 中
引发错误
文件“/usr/local/lib/python3.12/threading.py”，第 1052 行，在_bootstrap_inner
self.run()
文件“/usr/local/lib/python3.12/threading.py”，第 989 行，运行
self._target(*self.args, **self.kwargs)
文件“/home/rtl-6/comfy/samplers.py”，第 511 行，在_handle_batch __
output = model_current.apply_model(input_x, timestep , **c).to(output_device).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/model_base.py”，第 152 行，在 apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/patcher_extension.py”，第 113 行，在执行中
返回 self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/model_base.py”，第 190 行，在 apply_model
model_output = self.diffusion_model(xc, t, context=context,控制=控制，transformer_options=transformer_options，**extra_conds).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件
“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1773 行，在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^ ... “/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1784 行，在 _call_impl 中
返回 forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”，第 580 行，在 forward 中
返回 self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transform_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”，第 550 行，在 forward_orig
x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1773 行，在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1784 行，在 _call_impl 中
返回 forward_call（*args，**kwargs）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”，第 221 行，在 forward
y = self.self_attn（
^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1773 行，在 _wrapped_call_impl 中
返回 self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/nn/modules/module.py”，第 1784 行，在 _call_impl 中
返回 forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/comfy/ldm/wan/model.py”，第 72 行，在 forward
x = optimal_attention(
^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py”，第 899 行，在 _fn 中
返回 fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/custom_nodes/comfyui-kjnodes/nodes/model_optimization_nodes.py”，第 81 行，在 attention_sage 中
out = sage_func(q, k, v, attn_mask=mask, is_causal=False,tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/custom_nodes/comfyui-kjnodes/nodes/model_optimization_nodes.py”，第 36 行，在 func
return sageattn(q, k, v, is_causal=is_causal, attn_mask=attn_mask, tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/sageattention/core.py”，第 105 行，在 sageattn 中
q_int8、q_scale、k_int8、k_scale = per_block_int8(q、k、sm_scale=sm_scale、tensor_layout=tensor_layout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/sageattention/quant_per_block.py”，第 63 行，在 per_block_int8 中
quantum_per_block_int8_kernel[grid](
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/runtime/jit.py”，第 347 行，在
return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args，kwargs）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/runtime/jit.py”，第591行，运行
kernel.run（grid_0，grid_1，grid_2，stream，kernel.function，kernel.packed_metadata，
文件“/home/rtl-6/Python-3.12.0/comfy-env-3.12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py”，第529行，在调用
self.launch（gridX，gridY，gridZ，stream，function， self.launch_cooperative_grid，global_scratch，*args）
ValueError：无法从Triton（cpu张量？）访问指针参数（在0处）

当我将 max_gpus 设置为 1 时，Sage Attention 和 Torch Compiler 可以工作，但不适用于多个 GPU。

Hello, I also want to run this model with 4 4090 gpus. Can u share the workflow?

DKingAlpha · 2025-08-30T12:44:28Z

@Kosinkadink AMD users will need this patch

diff --git a/comfy/model_management.py b/comfy/model_management.py
index 4ac04b8b..b396a034 100644
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -194,7 +194,7 @@ def get_all_torch_devices(exclude_current=False):
     global cpu_state
     devices = []
     if cpu_state == CPUState.GPU:
-        if is_nvidia():
+        if is_nvidia() or is_amd():
             for i in range(torch.cuda.device_count()):
                 devices.append(torch.device(i))
         elif is_intel_xpu():

~~couldn't confirm further. my multigpu setup broke recently and I can't confirm whether its in the repo or my rocm stack broken.~~
Oh I got it working. Turned out my local (outdated) flash attention installation broken to some extend.

jkyamog · 2025-09-03T02:56:24Z

Thanks for all the work done here. I added 1 more GPU to my 3x 3090 setup. I was trying with WAN video models but it only used 2 GPUs because 3 is not a binary number. So I took a smaller 3060 12GB GPU from another system, so I can run 4 GPUs. I then downgraded to WAN2.1 t2v 1.3B so it fit on all GPU VRAM including the 3060. But it seems it behaves similarly to running 3 GPUs, only 2 GPUs are actually doing work even all of the GPU has had VRAM loaded in. Is this expected? Here is what the typical load looks like through a video generation where only 2 GPUs are doing work. Btw I did reorder the GPUs using CUDA_VISIBLE_DEVICES.

Kosinkadink · 2025-09-10T04:58:33Z

Thank you for the additional info!

@DKingAlpha thanks for the heads up!

Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they?

Secondly, @jkyamog this PR currently only does conditioning splitting - making conditioning run on separate GPUs. Wan2.1 has only two conditioning (positive and negative) without masking, so you are only able to accelerate it 2x with 2 GPUs - the other GPUs will have no work to be split for them. This is also the issue with using models that only have one conditioning - there is nothing to split. I will be looking at some parallel attention schemes to try to overcome this limitation soon.

I did not have as much time to look into the remaining issues as I thought, I apologize for the delay. I will keep looking into it + accelerating without just conditioning soon.

ExpandedMancho · 2025-09-11T18:24:46Z

Hi, in order to make this setup work with 2 GPUs do you need enough VRAM to be able to run the Wan model 2 times on your first GPU?

I noticed I get OOM errors when the deep.clone part starts, I'm guessing that the clone requires the full model to load and then also the copy of the model before it can paste it into the 2nd GPU?

Thanks.

Kosinkadink · 2025-09-17T04:09:05Z

That should not be a requirement. What are your exact errors? (post full stack trace + workflow)

ExpandedMancho · 2025-09-17T10:25:34Z

That should not be a requirement. What are your exact errors? (post full stack trace + workflow)

Hey, I found out that I'm going OOM when I use load_device = main_device in the WanVideo Model Loader and multi GPU Work Unite WAN node set at 2 max_gpus. However, when I do 1 GPU and this exact setup (load_device = main_device) it does work.

I’ve tried with and without block swap, accelerator LoRAs, and it didn't make a difference for me.
If I use the load_device = offload_device instead of the main_device the workflow works and I get no OOM, but then there’s no deep cloning happening (checked the CLI) and the 2nd GPU doesn’t get used at all.

Specs

GPU: 2x RTX 5090 (32GB VRAM 2x)
CUDA Version: 12.8
RAM: 186 GB
OS: Linux
Python version: Python 3.11.11

Relevant libs

nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.3
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvshmem-cu12==3.2.5
nvidia-nvtx-cu12==12.8.90
open_clip_torch==2.32.0
pytorch-triton==3.3.1+gitc8757738
rotary-embedding-torch==0.8.8
torch==2.9.0.dev20250629+cu128
torchaudio==2.8.0.dev20250629+cu128
torchsde==0.2.6
torchvision==0.23.0.dev20250629+cu128

ComfyUI startup (pytorch attention)

Set cuda device to: 0,1
Total VRAM 32120 MB, total RAM 386469 MB
pytorch version: 2.9.0.dev20250629+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5090 : cudaMallocAsync
Device: cuda:1 NVIDIA GeForce RTX 5090 : cudaMallocAsync
Using pytorch attention
Python version: 3.11.11 (main, Dec 4 2024, 08:55:07) [GCC 11.4.0]
ComfyUI version: 0.3.46
ComfyUI frontend version: 1.23.4

Workflow

https://pastebin.com/g2xnixXt

Stack Trace

==========SERVER got prompt==========
prompt event_id: []
Failed to copy /tmp/models/diffusion_models/MelBandRoformer_fp16.safetensors to temp dir: '/tmp/models/diffusion_models/MelBandRoformer_fp16.safetensors' is not in the subpath of '/workspace/ComfyUI' OR one path is relative and the other is absolute. falling back to original path
Converted mono input to stereo.
Resampling input 8000 to 44100
Processing chunks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.97it/s]
[MultiTalk] --- Raw speaker lengths (samples) ---
  speaker 1: 244000 samples (shape: torch.Size([1, 1, 244000]))
[MultiTalk] total raw duration = 15.250s
[MultiTalk] multi_audio_type=para | final waveform shape=torch.Size([1, 1, 244000]) | length=244000 samples | seconds=15.250s (expected max of raw)
Failed to copy /tmp/models/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors to temp dir: '/tmp/models/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors' is not in the subpath of '/workspace/ComfyUI' OR one path is relative and the other is absolute. falling back to original path
CLIP layer names written to clip_layers.txt
clip_target: <comfy.sd.load_text_encoder_state_dicts.<locals>.EmptyClass object at 0x7785e868ee10> parameters: 5685458817 model_options: {'load_device': device(type='cpu'), 'offload_device': device(type='cpu')}
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cpu, offload device: cpu, current: cuda:0, dtype: torch.float16
Requested to load WanTEModel
loaded completely 9.5367431640625e+25 6419.477203369141 True
Requested to load CLIPVisionModelProjection
loaded completely 28918.5119140625 1208.09814453125 True
Clip embeds shape: torch.Size([1, 257, 1280]), dtype: torch.float32
Combined clip embeds shape: torch.Size([1, 257, 1280])
CUDA Compute Capability: 12.0
Detected model in_channels: 36
Model cross attention type: i2v, num_heads: 40, num_layers: 40
Model variant detected: i2v_480
InfiniteTalk detected, patching model...
model_type FLOW
Creating deepclone of WanVideoModel for cuda:1.
!!! Exception during processing !!! class_type: MultiGPU_WorkUnitsWAN node_id: 377 ex: Allocation on device 
Traceback (most recent call last):
  File "/workspace/ComfyUI/execution.py", line 482, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/execution.py", line 292, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/execution.py", line 266, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/workspace/ComfyUI/execution.py", line 254, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy_extras/nodes_multigpu.py", line 71, in init_multigpu
    model = comfy.multigpu.create_multigpu_deepclones(model, max_gpus, gpu_options, reuse_loaded=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy/multigpu.py", line 90, in create_multigpu_deepclones
    device_patcher = model.deepclone_multigpu(new_load_device=device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ComfyUI/comfy/model_patcher.py", line 349, in deepclone_multigpu
    n.model = copy.deepcopy(n.model)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/_tensor.py", line 178, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 1139, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 243, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/workspace/venv_cu128/lib/python3.11/site-packages/torch/storage.py", line 257, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device 

Got an OOM, unloading all loaded models. node_id: 377 class_type: MultiGPU_WorkUnitsWAN class_def: <class '/workspace/ComfyUI/comfy_extras/nodes_multigpu.MultiGPUWorkUnitsNodeWAN'>
Prompt executed in 32.86 seconds

Kosinkadink · 2025-09-17T17:40:50Z

Multigpu work units are a feature only for nodes that use native sampling or specifically reimplement support - the node you're looking at is a wrapper custom node that does not use native sampling.

…tigpu

edflyer · 2025-10-19T18:22:15Z

Do you have a workflow that I can test to see if I installed this correctly?

edflyer · 2025-10-19T18:48:29Z

Thanks for all the work done here. I added 1 more GPU to my 3x 3090 setup. I was trying with WAN video models but it only used 2 GPUs because 3 is not a binary number. So I took a smaller 3060 12GB GPU from another system, so I can run 4 GPUs. I then downgraded to WAN2.1 t2v 1.3B so it fit on all GPU VRAM including the 3060. But it seems it behaves similarly to running 3 GPUs, only 2 GPUs are actually doing work even all of the GPU has had VRAM loaded in. Is this expected? Here is what the typical load looks like through a video generation where only 2 GPUs are doing work. Btw I did reorder the GPUs using CUDA_VISIBLE_DEVICES.

What workflow are you using?

rattus128 · 2025-10-26T08:22:26Z

Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they?

I think I have it working with a fix.

2xA40 on a runpod. I reproduced black outputs, and colorful noise in flux-dev fp8, cfg=1.1.

root@f00a481f73a0:~/ComfyUI# cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"

I got a black screen on some async tensor casting experiments I was doing for another change, and debugged it to be a race between the cuda streams and the pytorch garbage collector, so I thought id check for the same bug here. I remember @Kosinkadink saying this was blocked by black screens in a discord post.

So I think something similar is going on here, where the GPU->GPU ops are asynchronous WRT to the CPU and the CPU is able to run ahead and queue a cudaAsyncFree on one GPU while the other is still bus mastering the .to transfers, depending on who is the bus master and tensor owner. In the case of pull DMA this can easily be a race that corrupts tensors before transfer completes. Pytorch documentation is sparse on this so its all theory.

So if i'm right, this can be fixed by always bounce buffering through RAM which syncs the CPU:

diff --git a/comfy/samplers.py b/comfy/samplers.py
index ed702304..a93dbde4 100755
--- a/comfy/samplers.py
+++ b/comfy/samplers.py
@@ -158,7 +158,7 @@ def cond_cat(c_list, device=None):
         conds = temp[k]
         out[k] = conds[0].concat(conds[1:])
         if device is not None and hasattr(out[k], 'to'):
-            out[k] = out[k].to(device)
+            out[k] = out[k].cpu().to(device)
 
     return out
 
@@ -470,7 +470,7 @@ def _calc_cond_batch_multigpu(model: BaseModel, co
nds: list[list[dict]], x_in: t
                         patches = p.patches
 
                     batch_chunks = len(cond_or_uncond)
-                    input_x = torch.cat(input_x).to(device)
+                    input_x = torch.cat(input_x).cpu().to(device)
                     c = cond_cat(c, device=device)
                     timestep_ = torch.cat([timestep.to(device)] * batch_chunk
s)
 
@@ -500,9 +500,9 @@ def _calc_cond_batch_multigpu(model: BaseModel, co
nds: list[list[dict]], x_in: t
                         c['control'] = device_control.get_control(input_x, ti
mestep_, c, len(cond_or_uncond), transformer_options)
 
                     if 'model_function_wrapper' in model_options:
-                        output = model_options['model_function_wrapper'](
model_current.apply_model, {"input": input_x, "timestep": timestep_, "c": c, "
cond_or_uncond": cond_or_uncond}).to(output_device).chunk(batch_chunks)
+                        output = model_options['model_function_wrap
per'](model_current.apply_model, {"input": input_x, "timestep": timestep_, "c"
: c, "cond_or_uncond": cond_or_uncond}).cpu().to(output_device).chunk(batch_ch
unks)
                     else:
-                        output = model_current.apply_model(input_x, times
tep_, **c).to(output_device).chunk(batch_chunks)
+                        output = model_current.apply_model(input_x,
 timestep_, **c).cpu().to(output_device).chunk(batch_chunks)
                     results.append(thread_result(output, mult, area, batch_ch
unks, cond_or_uncond))
         except Exception as e:
             results.append(thread_result(None, None, None, None, None, error=
e))

There is in theory a performance penalty here as it changes the DMA path from master-slave to master-RAM-master, but i'm not observing any penalty in my initial tests.

Here is B=4 1024x1024 cfg=1.1 Flux dev speeds:

1 GPU

100%|████████████████████████████████████████████████████████| 20/20 [02:20<00:00,  7.00s/it]

2 GPUs - This branch unchanged (corrupted output)

100%|███████████████████████████████████████████████████| 20/20 [01:22<00:00,  4.11s/it]

2 GPUs - With above fix

100%|███████████████████████████████████████████████████| 20/20 [01:22<00:00,  4.11s/it]

Properly syncing the GPU->GPU DMA is a complex web of driver specifics, so this is a lot easier.

If this ends up being slow for other use cases (very large latents), you could chunk the .to as a series of queued copies instead, so the two bus masters start overlapping work and the performance will likely converge on something very close to master-slave give the above.

comfy-pr-bot · 2026-01-22T03:44:43Z

Test Evidence Check

…clone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate

Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>

coderabbitai · 2026-03-18T11:37:19Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive multi-GPU support throughout the codebase. The --cuda-device CLI argument type changes from int to str to support multiple device specification. New infrastructure includes a multigpu module with GPU options management and load balancing utilities. ControlNet and ModelPatcher classes gain deep cloning methods for per-device instances. Sampling flows are refactored to parallelize conditioning across devices using threading. Model management is extended to handle unloading across multiple devices. New node classes expose multi-GPU configuration in workflows. The CUDA backend in quant_ops is unconditionally disabled. Changes span model management, sampling orchestration, patcher logic, and checkpoint loading initialization.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and clearly summarizes the main change: introducing MultiGPU Work Units for accelerated sampling, which is the primary objective of this PR.
Description check	✅ Passed	The description provides comprehensive context about the MultiGPU acceleration feature, implementation details, performance metrics, and hardware compatibility, all of which directly relate to the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

coderabbitai

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_multigpu.py`:
- Around line 66-69: The GPUOptionsGroup.clone() return value is being discarded
in create_gpu_options; capture and use the cloned object so we don't mutate the
caller-supplied gpu_options. Change the behavior in create_gpu_options to assign
the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

In `@comfy/cli_args.py`:
- Line 52: The --cuda-device argument currently only accepts a single token;
update the parser.add_argument call for "--cuda-device" to accept multiple
space-separated device IDs by adding nargs='+' (and set type=int if you want
integer IDs) so that invocations like "--cuda-device 0 1" parse correctly;
alternatively, if you prefer comma-separated input, change the help text to
explicitly state the required format instead of implying plural support.

In `@comfy/controlnet.py`:
- Around line 322-328: The multigpu clone path in deepclone_multigpu currently
builds c = self.copy() which does not carry the previous_controlnet chain,
causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update
deepclone_multigpu to copy previous_controlnet (and any linked
.previous_controlnet chain) from self to c after c = self.copy() so that the
full chain is preserved, then continue deep-copying control_model and wrapping
it as before (ensure multigpu_clones[load_device] assignment remains unchanged);
apply the same preservation of previous_controlnet chaining to the similar clone
code paths that use copy_to()/get_instance_for_device() so all per-device clones
keep the full previous_controlnet chain.

In `@comfy/model_management.py`:
- Around line 214-231: The function get_all_torch_devices currently only handles
NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and
unload_all_models); update it to (1) add detection for other common backends
(e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks
such as torch.cuda.device_count() and torch.backends.mps.is_available() and
append appropriate torch.device entries, (2) if after all backend checks devices
is still empty, append get_torch_device() as a safe fallback so callers always
get at least the current device, and (3) make the exclude_current branch robust
by checking membership before calling devices.remove(get_torch_device()); refer
to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

In `@comfy/model_patcher.py`:
- Around line 1315-1321: The ON_PREPARE_STATE callbacks are being invoked with
four positional args in prepare_state, breaking backward-compatibility for
callbacks that expect three; update prepare_state to detect each callback's
accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount)
and call either callback(self, timestep, model_options, ignore_multigpu) if it
accepts 4 args or callback(self, timestep, model_options) if it only accepts 3
(or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the
same arity-gated invocation when recursing into multigpu clones; reference
prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the
callsite.

In `@comfy/multigpu.py`:
- Around line 60-112: create_multigpu_deepclones clones existing "multigpu"
additional models but never removes ones that exceed the new max_gpus; to fix,
after computing limit_extra_devices (the allowed device list) retrieve
model.get_additional_models_with_key("multigpu"), filter out any clone whose
load_device is not in ([model.load_device] + limit_extra_devices) (use each
ModelPatcher.load_device to decide), then call
model.set_additional_models("multigpu", filtered_list) before
match_multigpu_clones()/gpu_options.register; ensure reuse_loaded logic still
can find matching clones and that is_multigpu_base_clone flags remain correct
for retained clones.

In `@comfy/quant_ops.py`:
- Line 23: The unconditional call ck.registry.disable("cuda") in
comfy/quant_ops.py should be removed and only invoked when the unsupported
multigpu+cuda combination is actually active; locate the
ck.registry.disable("cuda") invocation and wrap it with a guard that checks the
real multigpu/backend state (for example an existing multigpu flag or function
like is_multigpu_enabled(), a config/ENV check, or the code path that handles
multigpu setup) so that CUDA is only disabled when multigpu is enabled and the
specific backend combination is unsupported, otherwise leave CUDA enabled for
normal single-GPU runs.

In `@comfy/sampler_helpers.py`:
- Line 200: Add the missing BaseModel import used in the type annotation for
real_model (the line "real_model: BaseModel = model.model") by adding "from
typing import TYPE_CHECKING" already present and then inside the existing
TYPE_CHECKING block import BaseModel from its module (e.g., "from <module>
import BaseModel") so the annotation is defined at type-check time;
alternatively remove the BaseModel annotation if you prefer not to add the
import.

In `@comfy/samplers.py`:
- Around line 391-397: The multigpu scheduler currently ignores multigpu_options
and uses integer floor division (//) inside math.ceil, producing coarse,
incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
- Around line 847-850: The code calls x['control'].pre_run(model, ...) for the
base control and then calls device_cnet.pre_run(model, ...) for each control
clone, incorrectly passing the base model to per-device controls; update the
loop to pass the matching per-device model clone instead. Specifically, when
iterating x['control'].multigpu_clones (the device_cnet clones), look up the
corresponding model clone (e.g., from model.multigpu_clones using the same
keys/ids) and call device_cnet.pre_run(model_clone,
percent_to_timestep_function) so each control clone receives its matching model
clone; keep the initial x['control'].pre_run(model, ...) for the base control.

In `@comfy/sd.py`:
- Line 1557: The assignment to out[0].cached_patcher_init can raise when out[0]
is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so
guard it: check that out[0] is not None before assigning to
out[0].cached_patcher_init and, if the CLIP patcher is created separately for
checkpoint-backed models, set its own cached_patcher_init instead (or attach the
init tuple to the patcher instance). Locate the assignment line and ensure both
code paths (when out[0] exists and when only a CLIP patcher exists) receive the
same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed
CLIP patchers keep their cached_patcher_init.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acbb973a-f861-46f7-805a-a50660ad4671

📥 Commits

Reviewing files that changed from the base of the PR and between b941913 and f410d28.

📒 Files selected for processing (12)

comfy/cli_args.py
comfy/controlnet.py
comfy/model_management.py
comfy/model_patcher.py
comfy/multigpu.py
comfy/patcher_extension.py
comfy/quant_ops.py
comfy/sampler_helpers.py
comfy/samplers.py
comfy/sd.py
comfy_extras/nodes_multigpu.py
nodes.py

coderabbitai · 2026-03-18T11:37:22Z

comfy_extras/nodes_multigpu.py

+    def create_gpu_options(self, device_index: int, relative_speed: float, gpu_options: comfy.multigpu.GPUOptionsGroup=None):
+        if not gpu_options:
+            gpu_options = comfy.multigpu.GPUOptionsGroup()
+        gpu_options.clone()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e FILE=$(fd '^multigpu\.py$' comfy) rg -n -C4 'class GPUOptionsGroup|def clone\(' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 793

🏁 Script executed:

#!/bin/bash set -e FILE=$(fd '^nodes_multigpu\.py$' comfy_extras) sed -n '66,80p' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 529

Capture the result of gpu_options.clone() on line 69.

The cloned object is being discarded. Since GPUOptionsGroup.clone() returns a new object, the current code ignores it and then mutates the caller-supplied gpu_options directly. When multiple MultiGPU Options nodes share the same GPU options group, this causes unintended state sharing across the node chain.

Change line 69 from:

gpu_options.clone()

to:

gpu_options = gpu_options.clone()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_multigpu.py` around lines 66 - 69, The GPUOptionsGroup.clone() return value is being discarded in create_gpu_options; capture and use the cloned object so we don't mutate the caller-supplied gpu_options. Change the behavior in create_gpu_options to assign the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options = gpu_options.clone()) and then continue using that local gpu_options, ensuring each node gets its own cloned GPUOptionsGroup rather than sharing state.

coderabbitai · 2026-03-18T11:37:22Z

comfy/cli_args.py

 parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.")
 parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.")
-parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.")
+parser.add_argument("--cuda-device", type=str, default=None, metavar="DEVICE_ID", help="Set the ids of cuda devices this instance will use. All other devices will not be visible.")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import argparse parser = argparse.ArgumentParser() parser.add_argument("--cuda-device", type=str, default=None) for argv in ( ["--cuda-device", "0,1"], ["--cuda-device", "0", "1"], ): try: ns = parser.parse_args(argv) print(argv, "->", ns.cuda_device) except SystemExit as e: print(argv, "-> parse failed with exit", e.code) PY

Repository: Comfy-Org/ComfyUI

Length of output: 243

🏁 Script executed:

rg "cuda_device" -A 3 -B 3 --max-count=20

Repository: Comfy-Org/ComfyUI

Length of output: 702

🏁 Script executed:

rg "(cuda-device|cuda_device)" -l

Repository: Comfy-Org/ComfyUI

Length of output: 86

🏁 Script executed:

git log --oneline --all -n 1

Repository: Comfy-Org/ComfyUI

Length of output: 113

🏁 Script executed:

git diff HEAD~1 HEAD -- comfy/cli_args.py 2>/dev/null || git show HEAD:comfy/cli_args.py 2>/dev/null | grep -A 5 -B 5 "cuda-device"

Repository: Comfy-Org/ComfyUI

Length of output: 1367

--cuda-device 0 1 still won't parse.

Changing the type to str only changes the value type, not the option arity. argparse still accepts one token here, so --cuda-device 0 1 fails with "unrecognized arguments: 1". The help text mentions "ids" (plural), implying multi-device support, but the current implementation requires comma-separated format: --cuda-device 0,1. Either add nargs='+' to accept space-separated device IDs or clarify the help text to document the required comma-separated input format.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/cli_args.py` at line 52, The --cuda-device argument currently only accepts a single token; update the parser.add_argument call for "--cuda-device" to accept multiple space-separated device IDs by adding nargs='+' (and set type=int if you want integer IDs) so that invocations like "--cuda-device 0 1" parse correctly; alternatively, if you prefer comma-separated input, change the help text to explicitly state the required format instead of implying plural support.

coderabbitai · 2026-03-18T11:37:22Z

comfy/controlnet.py

+    def deepclone_multigpu(self, load_device, autoregister=False):
+        c = self.copy()
+        c.control_model = copy.deepcopy(c.control_model)
+        c.control_model_wrapped = comfy.model_patcher.ModelPatcher(c.control_model, load_device=load_device, offload_device=comfy.model_management.unet_offload_device())
+        if autoregister:
+            self.multigpu_clones[load_device] = c
+        return c


⚠️ Potential issue | 🟠 Major

Preserve the previous_controlnet chain in multigpu clones.

These new clone paths build c from copy(), but copy_to() does not carry previous_controlnet. Once get_instance_for_device() returns the per-device clone, stacked ControlNets/T2IAdapters on earlier links are silently dropped on secondary GPUs.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

Also applies to: 952-958

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/controlnet.py` around lines 322 - 328, The multigpu clone path in deepclone_multigpu currently builds c = self.copy() which does not carry the previous_controlnet chain, causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update deepclone_multigpu to copy previous_controlnet (and any linked .previous_controlnet chain) from self to c after c = self.copy() so that the full chain is preserved, then continue deep-copying control_model and wrapping it as before (ensure multigpu_clones[load_device] assignment remains unchanged); apply the same preservation of previous_controlnet chaining to the similar clone code paths that use copy_to()/get_instance_for_device() so all per-device clones keep the full previous_controlnet chain.

coderabbitai · 2026-03-18T11:37:22Z

comfy/model_management.py

+def get_all_torch_devices(exclude_current=False):
+    global cpu_state
+    devices = []
+    if cpu_state == CPUState.GPU:
+        if is_nvidia():
+            for i in range(torch.cuda.device_count()):
+                devices.append(torch.device(i))
+        elif is_intel_xpu():
+            for i in range(torch.xpu.device_count()):
+                devices.append(torch.device(i))
+        elif is_ascend_npu():
+            for i in range(torch.npu.device_count()):
+                devices.append(torch.device(i))
+    else:
+        devices.append(get_torch_device())
+    if exclude_current:
+        devices.remove(get_torch_device())
+    return devices


⚠️ Potential issue | 🟠 Major

Handle non-CUDA backends in get_all_torch_devices.

This helper only enumerates CUDA/XPU/NPU devices, so ROCm/DirectML/MLU-style paths leave devices empty. With exclude_current=True that turns into a remove() failure, and unload_all_models() also stops freeing anything on those backends because it now routes through this helper.

As per coding guidelines, comfy/** changes should focus on backward compatibility and memory management/GPU resource handling.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_management.py` around lines 214 - 231, The function get_all_torch_devices currently only handles NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and unload_all_models); update it to (1) add detection for other common backends (e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks such as torch.cuda.device_count() and torch.backends.mps.is_available() and append appropriate torch.device entries, (2) if after all backend checks devices is still empty, append get_torch_device() as a safe fallback so callers always get at least the current device, and (3) make the exclude_current branch robust by checking membership before calling devices.remove(get_torch_device()); refer to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu, is_ascend_npu, and get_torch_device when implementing these fixes.

coderabbitai · 2026-03-18T11:37:22Z

comfy/model_patcher.py

+    def prepare_state(self, timestep, model_options, ignore_multigpu=False):
        for callback in self.get_all_callbacks(CallbacksMP.ON_PREPARE_STATE):
-            callback(self, timestep)
+            callback(self, timestep, model_options, ignore_multigpu)
+        if not ignore_multigpu and "multigpu_clones" in model_options:
+            for p in model_options["multigpu_clones"].values():
+                p: ModelPatcher
+                p.prepare_state(timestep, model_options, ignore_multigpu=True)


⚠️ Potential issue | 🟠 Major

Keep ON_PREPARE_STATE callback arity backward-compatible.

This now passes ignore_multigpu as a fourth positional argument to every existing ON_PREPARE_STATE callback. Any custom node still implementing the old (patcher, timestep, model_options) signature will fail with TypeError during sampling unless this is gated behind an arity check or opt-in keyword path.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_patcher.py` around lines 1315 - 1321, The ON_PREPARE_STATE callbacks are being invoked with four positional args in prepare_state, breaking backward-compatibility for callbacks that expect three; update prepare_state to detect each callback's accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount) and call either callback(self, timestep, model_options, ignore_multigpu) if it accepts 4 args or callback(self, timestep, model_options) if it only accepts 3 (or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the same arity-gated invocation when recursing into multigpu clones; reference prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the callsite.

coderabbitai · 2026-03-18T11:37:22Z

comfy/quant_ops.py

            ck.registry.disable("cuda")
            logging.warning("WARNING: You need pytorch with cu130 or higher to use optimized CUDA operations.")
-
+    ck.registry.disable("cuda") # multigpu will not work rn with comfy-kitchen on cuda backend


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see the full context of the file around line 23 cat -n comfy/quant_ops.py | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1910

🏁 Script executed:

# Let's also check if there's any git context to see if this is a new change git diff HEAD -- comfy/quant_ops.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43

🏁 Script executed:

# Search for multigpu-related logic in the codebase to understand the gating/conditions rg -n "multigpu" --type py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 25235

Don't disable the comfy-kitchen CUDA backend globally.

Line 23 disables it unconditionally at import time, which affects all CUDA runs including single-GPU and non-multigpu workflows. This regresses quantization performance for users not using multigpu. Gate the disable behind actual multigpu activation or the specific unsupported backend combination instead of doing it unconditionally here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/quant_ops.py` at line 23, The unconditional call ck.registry.disable("cuda") in comfy/quant_ops.py should be removed and only invoked when the unsupported multigpu+cuda combination is actually active; locate the ck.registry.disable("cuda") invocation and wrap it with a guard that checks the real multigpu/backend state (for example an existing multigpu flag or function like is_multigpu_enabled(), a config/ENV check, or the code path that handles multigpu setup) so that CUDA is only disabled when multigpu is enabled and the specific backend combination is unsupported, otherwise leave CUDA enabled for normal single-GPU runs.

coderabbitai · 2026-03-18T11:37:23Z

comfy/sampler_helpers.py

        minimum_memory_required += inference_memory
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required, force_full_load=force_full_load)
-    real_model = model.model
+    real_model: BaseModel = model.model


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check the imports and line 200 in the file head -20 comfy/sampler_helpers.py && echo "---" && sed -n '195,205p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 1130

🏁 Script executed:

# Search for all imports of BaseModel or where it might come from rg -i "basemodel|from.*import|import" comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 463

🏁 Script executed:

# Check if there's a BaseModel defined or imported elsewhere that should be in this file rg "class BaseModel|from.*BaseModel" comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 50375

🏁 Script executed:

# Check git diff to see if line 200 is new git diff HEAD^ comfy/sampler_helpers.py | grep -A 5 -B 5 "BaseModel"

Repository: Comfy-Org/ComfyUI

Length of output: 246

🏁 Script executed:

# Look for similar type annotations in the file to understand the pattern rg ":\s*\w+\s*=" comfy/sampler_helpers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 100

🏁 Script executed:

# Check if there are any existing TYPE_CHECKING imports or patterns sed -n '1,30p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 894

Add the missing BaseModel import to fix undefined name.

Line 200 uses BaseModel in the type annotation, but it's not imported. Add it to the TYPE_CHECKING block to match the file's existing pattern, or remove the annotation.

Suggested fix

if TYPE_CHECKING: from comfy.model_patcher import ModelPatcher from comfy.controlnet import ControlBase + from comfy.model_base import BaseModel

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

real_model: BaseModel = model.model

if TYPE_CHECKING:

from comfy.model_patcher import ModelPatcher

from comfy.controlnet import ControlBase

from comfy.model_base import BaseModel

🧰 Tools

🪛 GitHub Actions: Python Linting

[error] 200-200: F821 Undefined name BaseModel detected by ruff check. Ensure BaseModel is imported or defined.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/sampler_helpers.py` at line 200, Add the missing BaseModel import used in the type annotation for real_model (the line "real_model: BaseModel = model.model") by adding "from typing import TYPE_CHECKING" already present and then inside the existing TYPE_CHECKING block import BaseModel from its module (e.g., "from <module> import BaseModel") so the annotation is defined at type-check time; alternatively remove the BaseModel annotation if you prefer not to add the import.

coderabbitai · 2026-03-18T11:37:23Z

comfy/samplers.py

+    devices = [dev_m for dev_m in model_options['multigpu_clones'].keys()]
+    device_batched_hooked_to_run: dict[torch.device, list[tuple[comfy.hooks.HookGroup, tuple]]] = {}
+
+    total_conds = 0
+    for to_run in hooked_to_run.values():
+        total_conds += len(to_run)
+    conds_per_device = max(1, math.ceil(total_conds//len(devices)))


⚠️ Potential issue | 🟠 Major

relative_speed is not used by the multigpu scheduler.

This branch still computes a fixed conds_per_device and round-robins by raw condition count; multigpu_options is never consulted here. The new MultiGPU Options node therefore has no effect on work distribution, and the // inside math.ceil(...) makes the split even coarser on uneven counts.

As per coding guidelines, comfy/** changes should focus on performance implications in hot paths.

Also applies to: 403-416, 433-435

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 391 - 397, The multigpu scheduler currently ignores multigpu_options and uses integer floor division (//) inside math.ceil, producing coarse, incorrect splits; update the batching logic around devices, device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to consult multigpu_options (specifically the relative_speed entry for each device clone) and distribute total_conds proportionally to those relative_speed weights (then ceil each device's share and ensure at least 1 if there are any conditions), replacing the math.ceil(total_conds//len(devices)) approach with a proper float division and per-device allocation; keep device ordering based on model_options['multigpu_clones'].keys() and ensure the same proportional logic is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so the MultiGPU Options node actually affects work distribution.

coderabbitai · 2026-03-18T11:37:23Z

comfy/samplers.py

        if 'control' in x:
            x['control'].pre_run(model, percent_to_timestep_function)
+            for device_cnet in x['control'].multigpu_clones.values():
+                device_cnet.pre_run(model, percent_to_timestep_function)


⚠️ Potential issue | 🟠 Major

Run per-device controls against the matching model clone.

These new pre_run() calls feed every device clone the base model. Any control that snapshots model-specific state during pre_run() will capture the wrong device/model; QwenFunControlNet.pre_run() in this file already stores model.diffusion_model, so its multigpu clone will still point at the base UNet.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 847 - 850, The code calls x['control'].pre_run(model, ...) for the base control and then calls device_cnet.pre_run(model, ...) for each control clone, incorrectly passing the base model to per-device controls; update the loop to pass the matching per-device model clone instead. Specifically, when iterating x['control'].multigpu_clones (the device_cnet clones), look up the corresponding model clone (e.g., from model.multigpu_clones using the same keys/ids) and call device_cnet.pre_run(model_clone, percent_to_timestep_function) so each control clone receives its matching model clone; keep the initial x['control'].pre_run(model, ...) for the base control.

coderabbitai · 2026-03-18T11:37:23Z

comfy/sd.py

-        out[0].cached_patcher_init = (load_checkpoint_guess_config_model_only, (ckpt_path, embedding_directory, model_options, te_model_options))
-    if output_clip and out[1] is not None:
-        out[1].patcher.cached_patcher_init = (load_checkpoint_guess_config_clip_only, (ckpt_path, embedding_directory, model_options, te_model_options))
+    out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)


⚠️ Potential issue | 🔴 Critical

Guard the model-side cache assignment.

load_checkpoint_guess_config_clip_only() reaches this path with output_model=False, so out[0] is None and this line raises before the CLIP patcher can be returned. It also leaves checkpoint-backed CLIP patchers without their own cached_patcher_init.

Possible fix

- out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0) + if out[0] is not None: + out[0].cached_patcher_init = ( + load_checkpoint_guess_config, + (ckpt_path, False, False, False, embedding_directory, True, model_options, te_model_options), + 0, + ) + if out[1] is not None: + out[1].patcher.cached_patcher_init = ( + load_checkpoint_guess_config, + (ckpt_path, False, True, False, embedding_directory, False, model_options, te_model_options), + 1, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/sd.py` at line 1557, The assignment to out[0].cached_patcher_init can raise when out[0] is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so guard it: check that out[0] is not None before assigning to out[0].cached_patcher_init and, if the CLIP patcher is created separately for checkpoint-backed models, set its own cached_patcher_init instead (or attach the init tuple to the patcher instance). Locate the assignment line and ensure both code paths (when out[0] exists and when only a CLIP patcher exists) receive the same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed CLIP patchers keep their cached_patcher_init.

Kosinkadink added 30 commits December 29, 2024 15:49

Add 'sigmas' to transformer_options so that downstream code can know …

72bbf49

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

Merge branch 'master' into hooks_part2

bf21be0

Merge branch 'master' into hooks_part2

d44295e

Cleaned up hooks.py, refactored Hook.should_register and add_hook_pat…

5a2ad03

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

Refactor WrapperHook into TransformerOptionsHook, as there is no need…

776aa73

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

Refactored HookGroup to also store a dictionary of hooks separated by…

111fd0c

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

In inner_sample, change "sigmas" to "sampler_sigmas" in transformer_o…

6620d86

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

Merge branch 'add_sample_sigmas' into hooks_part2

db2d7ad

Made hook clone code sane, made clear ObjectPatchHook and SetInjectio…

4446c86

…nsHook are not yet operational

Filter only registered hooks on self.conds in CFGGuider.sample

0a7e2ae

Merge branch 'master' into hooks_part2

6463c39

Make hook_scope functional for TransformerOptionsHook

f48f90e

Merge branch 'master' into hooks_part2

2724ac4

removed 4 whitespace lines to satisfy Ruff,

1b38f5b

Add a get_injections function to ModelPatcher

58bf881

Made TransformerOptionsHook contribute to registered hooks properly, …

216fea1

…added some doc strings and removed a so-far unused variable

Merge branch 'master' into hooks_part2

11c6d56

Rename AddModelsHooks to AdditionalModelsHook, rename SetInjectionsHo…

3cd4c5c

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

Clean up a typehint

7333281

Merge branch 'comfyanonymous:master' into multigpu_support

66838eb

Add get_all_torch_devices to get detected devices intended for curren…

871258a

…t torch hardware device

Initial proof of concept of giving splitting cond sampling between mu…

7448f02

…ltiple GPUs

Merge branch 'comfyanonymous:master' into multigpu_support

d3cf2b7

Fix cond_cat to not try to cast anything that doesn't have a 'to' fun…

e88c6c0

…ction

Merge branch 'master' into multigpu_support

8d4b501

Make test node for multigpu instead of storing it in just a local __i…

d508807

…nit__.py

Merge branch 'master' into multigpu_support

ec16ee2

Add nodes_multigpu.py to loaded nodes

198953c

Kosinkadink added 3 commits August 17, 2025 16:02

Merge branch 'master' into worksplit-multigpu

383f9b3

Merge branch 'master' into worksplit-multigpu

2c8f485

Merge branch 'master' into worksplit-multigpu

ac14ee6

Merge remote-tracking branch 'origin/master' into worksplit-multigpu

9e9c129

Merge branch 'master' into worksplit-multigpu

efcd828

Kosinkadink added 6 commits September 18, 2025 14:20

Merge branch 'master' into worksplit-multigpu

bb44c2e

Merge branch 'master' into worksplit-multigpu

c2115a4

Merge branch 'master' into worksplit-multigpu

8cbbf0b

Satisfy ruff

d89dd5f

Merge branch 'master' into worksplit-multigpu

b326a54

Bring patches changes from _calc_cond_batch into _calc_cond_batch_mul…

4661d1d

…tigpu

robinjhuang mentioned this pull request Oct 20, 2025

Multiple instances of Desktop native (on different GPUs) #10411

Closed

1 task

Kosinkadink and others added 3 commits February 17, 2026 02:53

Merge branch 'master' into worksplit-multigpu

df2fd4c

Made multigpu deepclone load model from disk to avoid needing to deep…

f4b99bc

…clone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate

Merge origin/master into worksplit-multigpu

f410d28

Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

-    real_model: BaseModel = model.model
+if TYPE_CHECKING:
+    from comfy.model_patcher import ModelPatcher
+    from comfy.controlnet import ControlBase
+    from comfy.model_base import BaseModel

Conversation

Kosinkadink commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation Details

Performance (will add more examples soon)

Uh oh!

monstari commented Aug 17, 2025

Uh oh!

QUTGXX commented Aug 29, 2025

Uh oh!

DKingAlpha commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkyamog commented Sep 3, 2025

Uh oh!

Kosinkadink commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExpandedMancho commented Sep 11, 2025

Uh oh!

Kosinkadink commented Sep 17, 2025

Uh oh!

ExpandedMancho commented Sep 17, 2025

Specs

Relevant libs

ComfyUI startup (pytorch attention)

Workflow

Stack Trace

Uh oh!

Kosinkadink commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edflyer commented Oct 19, 2025

Uh oh!

edflyer commented Oct 19, 2025

Uh oh!

rattus128 commented Oct 26, 2025

Uh oh!

comfy-pr-bot commented Jan 22, 2026

Test Evidence Check

Uh oh!

coderabbitai bot commented Mar 18, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Kosinkadink commented Mar 4, 2025 •

edited

Loading

DKingAlpha commented Aug 30, 2025 •

edited

Loading

Kosinkadink commented Sep 10, 2025 •

edited

Loading

Kosinkadink commented Sep 17, 2025 •

edited

Loading