MultiGPU Work Units For Accelerated Sampling#7063
MultiGPU Work Units For Accelerated Sampling#7063Kosinkadink wants to merge 113 commits intomasterfrom
Conversation
…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…nsHook are not yet operational
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
…t torch hardware device
|
I noticed that there’s no speed boost when using distilled models with CFG=1. Since Normalized Attention Guidance already provides similar negative conditioning at CFG=1, would it be possible to explore solutions similar to XDit’s parallel processing in the future? Also, if we have more than two GPUs, I assume this solution wouldn’t be as useful, since we can only apply two conditioning streams. Thanks for all your work still! |
Hello, I also want to run this model with 4 4090 gpus. Can u share the workflow? |
|
@Kosinkadink AMD users will need this patch diff --git a/comfy/model_management.py b/comfy/model_management.py
index 4ac04b8b..b396a034 100644
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -194,7 +194,7 @@ def get_all_torch_devices(exclude_current=False):
global cpu_state
devices = []
if cpu_state == CPUState.GPU:
- if is_nvidia():
+ if is_nvidia() or is_amd():
for i in range(torch.cuda.device_count()):
devices.append(torch.device(i))
elif is_intel_xpu():
|
|
Thank you for the additional info! @DKingAlpha thanks for the heads up! Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they? Secondly, @jkyamog this PR currently only does conditioning splitting - making conditioning run on separate GPUs. Wan2.1 has only two conditioning (positive and negative) without masking, so you are only able to accelerate it 2x with 2 GPUs - the other GPUs will have no work to be split for them. This is also the issue with using models that only have one conditioning - there is nothing to split. I will be looking at some parallel attention schemes to try to overcome this limitation soon. I did not have as much time to look into the remaining issues as I thought, I apologize for the delay. I will keep looking into it + accelerating without just conditioning soon. |
|
Hi, in order to make this setup work with 2 GPUs do you need enough VRAM to be able to run the Wan model 2 times on your first GPU? I noticed I get OOM errors when the deep.clone part starts, I'm guessing that the clone requires the full model to load and then also the copy of the model before it can paste it into the 2nd GPU? Thanks. |
|
That should not be a requirement. What are your exact errors? (post full stack trace + workflow) |
|
Multigpu work units are a feature only for nodes that use native sampling or specifically reimplement support - the node you're looking at is a wrapper custom node that does not use native sampling. |
|
Do you have a workflow that I can test to see if I installed this correctly? |
I think I have it working with a fix. 2xA40 on a runpod. I reproduced black outputs, and colorful noise in flux-dev fp8, cfg=1.1. I got a black screen on some async tensor casting experiments I was doing for another change, and debugged it to be a race between the cuda streams and the pytorch garbage collector, so I thought id check for the same bug here. I remember @Kosinkadink saying this was blocked by black screens in a discord post. So I think something similar is going on here, where the GPU->GPU ops are asynchronous WRT to the CPU and the CPU is able to run ahead and queue a cudaAsyncFree on one GPU while the other is still bus mastering the .to transfers, depending on who is the bus master and tensor owner. In the case of pull DMA this can easily be a race that corrupts tensors before transfer completes. Pytorch documentation is sparse on this so its all theory. So if i'm right, this can be fixed by always bounce buffering through RAM which syncs the CPU: There is in theory a performance penalty here as it changes the DMA path from master-slave to master-RAM-master, but i'm not observing any penalty in my initial tests. Here is B=4 1024x1024 cfg=1.1 Flux dev speeds: Properly syncing the GPU->GPU DMA is a complex web of driver specifics, so this is a lot easier. If this ends up being slow for other use cases (very large latents), you could chunk the .to as a series of queued copies instead, so the two bus masters start overlapping work and the performance will likely converge on something very close to master-slave give the above.
|
Test Evidence Check |
…clone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate
Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>
📝 WalkthroughWalkthroughThis PR introduces comprehensive multi-GPU support throughout the codebase. The 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_multigpu.py`:
- Around line 66-69: The GPUOptionsGroup.clone() return value is being discarded
in create_gpu_options; capture and use the cloned object so we don't mutate the
caller-supplied gpu_options. Change the behavior in create_gpu_options to assign
the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.
In `@comfy/cli_args.py`:
- Line 52: The --cuda-device argument currently only accepts a single token;
update the parser.add_argument call for "--cuda-device" to accept multiple
space-separated device IDs by adding nargs='+' (and set type=int if you want
integer IDs) so that invocations like "--cuda-device 0 1" parse correctly;
alternatively, if you prefer comma-separated input, change the help text to
explicitly state the required format instead of implying plural support.
In `@comfy/controlnet.py`:
- Around line 322-328: The multigpu clone path in deepclone_multigpu currently
builds c = self.copy() which does not carry the previous_controlnet chain,
causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update
deepclone_multigpu to copy previous_controlnet (and any linked
.previous_controlnet chain) from self to c after c = self.copy() so that the
full chain is preserved, then continue deep-copying control_model and wrapping
it as before (ensure multigpu_clones[load_device] assignment remains unchanged);
apply the same preservation of previous_controlnet chaining to the similar clone
code paths that use copy_to()/get_instance_for_device() so all per-device clones
keep the full previous_controlnet chain.
In `@comfy/model_management.py`:
- Around line 214-231: The function get_all_torch_devices currently only handles
NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and
unload_all_models); update it to (1) add detection for other common backends
(e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks
such as torch.cuda.device_count() and torch.backends.mps.is_available() and
append appropriate torch.device entries, (2) if after all backend checks devices
is still empty, append get_torch_device() as a safe fallback so callers always
get at least the current device, and (3) make the exclude_current branch robust
by checking membership before calling devices.remove(get_torch_device()); refer
to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.
In `@comfy/model_patcher.py`:
- Around line 1315-1321: The ON_PREPARE_STATE callbacks are being invoked with
four positional args in prepare_state, breaking backward-compatibility for
callbacks that expect three; update prepare_state to detect each callback's
accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount)
and call either callback(self, timestep, model_options, ignore_multigpu) if it
accepts 4 args or callback(self, timestep, model_options) if it only accepts 3
(or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the
same arity-gated invocation when recursing into multigpu clones; reference
prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the
callsite.
In `@comfy/multigpu.py`:
- Around line 60-112: create_multigpu_deepclones clones existing "multigpu"
additional models but never removes ones that exceed the new max_gpus; to fix,
after computing limit_extra_devices (the allowed device list) retrieve
model.get_additional_models_with_key("multigpu"), filter out any clone whose
load_device is not in ([model.load_device] + limit_extra_devices) (use each
ModelPatcher.load_device to decide), then call
model.set_additional_models("multigpu", filtered_list) before
match_multigpu_clones()/gpu_options.register; ensure reuse_loaded logic still
can find matching clones and that is_multigpu_base_clone flags remain correct
for retained clones.
In `@comfy/quant_ops.py`:
- Line 23: The unconditional call ck.registry.disable("cuda") in
comfy/quant_ops.py should be removed and only invoked when the unsupported
multigpu+cuda combination is actually active; locate the
ck.registry.disable("cuda") invocation and wrap it with a guard that checks the
real multigpu/backend state (for example an existing multigpu flag or function
like is_multigpu_enabled(), a config/ENV check, or the code path that handles
multigpu setup) so that CUDA is only disabled when multigpu is enabled and the
specific backend combination is unsupported, otherwise leave CUDA enabled for
normal single-GPU runs.
In `@comfy/sampler_helpers.py`:
- Line 200: Add the missing BaseModel import used in the type annotation for
real_model (the line "real_model: BaseModel = model.model") by adding "from
typing import TYPE_CHECKING" already present and then inside the existing
TYPE_CHECKING block import BaseModel from its module (e.g., "from <module>
import BaseModel") so the annotation is defined at type-check time;
alternatively remove the BaseModel annotation if you prefer not to add the
import.
In `@comfy/samplers.py`:
- Around line 391-397: The multigpu scheduler currently ignores multigpu_options
and uses integer floor division (//) inside math.ceil, producing coarse,
incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
- Around line 847-850: The code calls x['control'].pre_run(model, ...) for the
base control and then calls device_cnet.pre_run(model, ...) for each control
clone, incorrectly passing the base model to per-device controls; update the
loop to pass the matching per-device model clone instead. Specifically, when
iterating x['control'].multigpu_clones (the device_cnet clones), look up the
corresponding model clone (e.g., from model.multigpu_clones using the same
keys/ids) and call device_cnet.pre_run(model_clone,
percent_to_timestep_function) so each control clone receives its matching model
clone; keep the initial x['control'].pre_run(model, ...) for the base control.
In `@comfy/sd.py`:
- Line 1557: The assignment to out[0].cached_patcher_init can raise when out[0]
is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so
guard it: check that out[0] is not None before assigning to
out[0].cached_patcher_init and, if the CLIP patcher is created separately for
checkpoint-backed models, set its own cached_patcher_init instead (or attach the
init tuple to the patcher instance). Locate the assignment line and ensure both
code paths (when out[0] exists and when only a CLIP patcher exists) receive the
same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed
CLIP patchers keep their cached_patcher_init.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: acbb973a-f861-46f7-805a-a50660ad4671
📒 Files selected for processing (12)
comfy/cli_args.pycomfy/controlnet.pycomfy/model_management.pycomfy/model_patcher.pycomfy/multigpu.pycomfy/patcher_extension.pycomfy/quant_ops.pycomfy/sampler_helpers.pycomfy/samplers.pycomfy/sd.pycomfy_extras/nodes_multigpu.pynodes.py
| def create_gpu_options(self, device_index: int, relative_speed: float, gpu_options: comfy.multigpu.GPUOptionsGroup=None): | ||
| if not gpu_options: | ||
| gpu_options = comfy.multigpu.GPUOptionsGroup() | ||
| gpu_options.clone() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -e
FILE=$(fd '^multigpu\.py$' comfy)
rg -n -C4 'class GPUOptionsGroup|def clone\(' "$FILE"Repository: Comfy-Org/ComfyUI
Length of output: 793
🏁 Script executed:
#!/bin/bash
set -e
FILE=$(fd '^nodes_multigpu\.py$' comfy_extras)
sed -n '66,80p' "$FILE"Repository: Comfy-Org/ComfyUI
Length of output: 529
Capture the result of gpu_options.clone() on line 69.
The cloned object is being discarded. Since GPUOptionsGroup.clone() returns a new object, the current code ignores it and then mutates the caller-supplied gpu_options directly. When multiple MultiGPU Options nodes share the same GPU options group, this causes unintended state sharing across the node chain.
Change line 69 from:
gpu_options.clone()
to:
gpu_options = gpu_options.clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_multigpu.py` around lines 66 - 69, The
GPUOptionsGroup.clone() return value is being discarded in create_gpu_options;
capture and use the cloned object so we don't mutate the caller-supplied
gpu_options. Change the behavior in create_gpu_options to assign the result of
GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.
| parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.") | ||
| parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.") | ||
| parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.") | ||
| parser.add_argument("--cuda-device", type=str, default=None, metavar="DEVICE_ID", help="Set the ids of cuda devices this instance will use. All other devices will not be visible.") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--cuda-device", type=str, default=None)
for argv in (
["--cuda-device", "0,1"],
["--cuda-device", "0", "1"],
):
try:
ns = parser.parse_args(argv)
print(argv, "->", ns.cuda_device)
except SystemExit as e:
print(argv, "-> parse failed with exit", e.code)
PYRepository: Comfy-Org/ComfyUI
Length of output: 243
🏁 Script executed:
rg "cuda_device" -A 3 -B 3 --max-count=20Repository: Comfy-Org/ComfyUI
Length of output: 702
🏁 Script executed:
rg "(cuda-device|cuda_device)" -lRepository: Comfy-Org/ComfyUI
Length of output: 86
🏁 Script executed:
git log --oneline --all -n 1Repository: Comfy-Org/ComfyUI
Length of output: 113
🏁 Script executed:
git diff HEAD~1 HEAD -- comfy/cli_args.py 2>/dev/null || git show HEAD:comfy/cli_args.py 2>/dev/null | grep -A 5 -B 5 "cuda-device"Repository: Comfy-Org/ComfyUI
Length of output: 1367
--cuda-device 0 1 still won't parse.
Changing the type to str only changes the value type, not the option arity. argparse still accepts one token here, so --cuda-device 0 1 fails with "unrecognized arguments: 1". The help text mentions "ids" (plural), implying multi-device support, but the current implementation requires comma-separated format: --cuda-device 0,1. Either add nargs='+' to accept space-separated device IDs or clarify the help text to document the required comma-separated input format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/cli_args.py` at line 52, The --cuda-device argument currently only
accepts a single token; update the parser.add_argument call for "--cuda-device"
to accept multiple space-separated device IDs by adding nargs='+' (and set
type=int if you want integer IDs) so that invocations like "--cuda-device 0 1"
parse correctly; alternatively, if you prefer comma-separated input, change the
help text to explicitly state the required format instead of implying plural
support.
| def deepclone_multigpu(self, load_device, autoregister=False): | ||
| c = self.copy() | ||
| c.control_model = copy.deepcopy(c.control_model) | ||
| c.control_model_wrapped = comfy.model_patcher.ModelPatcher(c.control_model, load_device=load_device, offload_device=comfy.model_management.unet_offload_device()) | ||
| if autoregister: | ||
| self.multigpu_clones[load_device] = c | ||
| return c |
There was a problem hiding this comment.
Preserve the previous_controlnet chain in multigpu clones.
These new clone paths build c from copy(), but copy_to() does not carry previous_controlnet. Once get_instance_for_device() returns the per-device clone, stacked ControlNets/T2IAdapters on earlier links are silently dropped on secondary GPUs.
As per coding guidelines, comfy/** changes should focus on backward compatibility.
Also applies to: 952-958
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/controlnet.py` around lines 322 - 328, The multigpu clone path in
deepclone_multigpu currently builds c = self.copy() which does not carry the
previous_controlnet chain, causing stacked ControlNets/T2IAdapters to be lost on
secondary GPUs; update deepclone_multigpu to copy previous_controlnet (and any
linked .previous_controlnet chain) from self to c after c = self.copy() so that
the full chain is preserved, then continue deep-copying control_model and
wrapping it as before (ensure multigpu_clones[load_device] assignment remains
unchanged); apply the same preservation of previous_controlnet chaining to the
similar clone code paths that use copy_to()/get_instance_for_device() so all
per-device clones keep the full previous_controlnet chain.
| def get_all_torch_devices(exclude_current=False): | ||
| global cpu_state | ||
| devices = [] | ||
| if cpu_state == CPUState.GPU: | ||
| if is_nvidia(): | ||
| for i in range(torch.cuda.device_count()): | ||
| devices.append(torch.device(i)) | ||
| elif is_intel_xpu(): | ||
| for i in range(torch.xpu.device_count()): | ||
| devices.append(torch.device(i)) | ||
| elif is_ascend_npu(): | ||
| for i in range(torch.npu.device_count()): | ||
| devices.append(torch.device(i)) | ||
| else: | ||
| devices.append(get_torch_device()) | ||
| if exclude_current: | ||
| devices.remove(get_torch_device()) | ||
| return devices |
There was a problem hiding this comment.
Handle non-CUDA backends in get_all_torch_devices.
This helper only enumerates CUDA/XPU/NPU devices, so ROCm/DirectML/MLU-style paths leave devices empty. With exclude_current=True that turns into a remove() failure, and unload_all_models() also stops freeing anything on those backends because it now routes through this helper.
As per coding guidelines, comfy/** changes should focus on backward compatibility and memory management/GPU resource handling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_management.py` around lines 214 - 231, The function
get_all_torch_devices currently only handles NVIDIA/Intel/Ascend and can return
an empty list (breaking exclude_current and unload_all_models); update it to (1)
add detection for other common backends (e.g., ROCm/DirectML/MLU/MPS) or at
minimum attempt generic torch backend checks such as torch.cuda.device_count()
and torch.backends.mps.is_available() and append appropriate torch.device
entries, (2) if after all backend checks devices is still empty, append
get_torch_device() as a safe fallback so callers always get at least the current
device, and (3) make the exclude_current branch robust by checking membership
before calling devices.remove(get_torch_device()); refer to
get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.
| def prepare_state(self, timestep, model_options, ignore_multigpu=False): | ||
| for callback in self.get_all_callbacks(CallbacksMP.ON_PREPARE_STATE): | ||
| callback(self, timestep) | ||
| callback(self, timestep, model_options, ignore_multigpu) | ||
| if not ignore_multigpu and "multigpu_clones" in model_options: | ||
| for p in model_options["multigpu_clones"].values(): | ||
| p: ModelPatcher | ||
| p.prepare_state(timestep, model_options, ignore_multigpu=True) |
There was a problem hiding this comment.
Keep ON_PREPARE_STATE callback arity backward-compatible.
This now passes ignore_multigpu as a fourth positional argument to every existing ON_PREPARE_STATE callback. Any custom node still implementing the old (patcher, timestep, model_options) signature will fail with TypeError during sampling unless this is gated behind an arity check or opt-in keyword path.
As per coding guidelines, comfy/** changes should focus on backward compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_patcher.py` around lines 1315 - 1321, The ON_PREPARE_STATE
callbacks are being invoked with four positional args in prepare_state, breaking
backward-compatibility for callbacks that expect three; update prepare_state to
detect each callback's accepted arity (e.g., via inspect.signature or
callable.__code__.co_argcount) and call either callback(self, timestep,
model_options, ignore_multigpu) if it accepts 4 args or callback(self, timestep,
model_options) if it only accepts 3 (or attempt the 4-arg call and fall back to
3-arg on TypeError), and apply the same arity-gated invocation when recursing
into multigpu clones; reference prepare_state and CallbacksMP.ON_PREPARE_STATE
to locate where to change the callsite.
| ck.registry.disable("cuda") | ||
| logging.warning("WARNING: You need pytorch with cu130 or higher to use optimized CUDA operations.") | ||
|
|
||
| ck.registry.disable("cuda") # multigpu will not work rn with comfy-kitchen on cuda backend |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's see the full context of the file around line 23
cat -n comfy/quant_ops.py | head -50Repository: Comfy-Org/ComfyUI
Length of output: 1910
🏁 Script executed:
# Let's also check if there's any git context to see if this is a new change
git diff HEAD -- comfy/quant_ops.py | head -100Repository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Search for multigpu-related logic in the codebase to understand the gating/conditions
rg -n "multigpu" --type py -B 2 -A 2Repository: Comfy-Org/ComfyUI
Length of output: 25235
Don't disable the comfy-kitchen CUDA backend globally.
Line 23 disables it unconditionally at import time, which affects all CUDA runs including single-GPU and non-multigpu workflows. This regresses quantization performance for users not using multigpu. Gate the disable behind actual multigpu activation or the specific unsupported backend combination instead of doing it unconditionally here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/quant_ops.py` at line 23, The unconditional call
ck.registry.disable("cuda") in comfy/quant_ops.py should be removed and only
invoked when the unsupported multigpu+cuda combination is actually active;
locate the ck.registry.disable("cuda") invocation and wrap it with a guard that
checks the real multigpu/backend state (for example an existing multigpu flag or
function like is_multigpu_enabled(), a config/ENV check, or the code path that
handles multigpu setup) so that CUDA is only disabled when multigpu is enabled
and the specific backend combination is unsupported, otherwise leave CUDA
enabled for normal single-GPU runs.
| minimum_memory_required += inference_memory | ||
| comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required, force_full_load=force_full_load) | ||
| real_model = model.model | ||
| real_model: BaseModel = model.model |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check the imports and line 200 in the file
head -20 comfy/sampler_helpers.py && echo "---" && sed -n '195,205p' comfy/sampler_helpers.pyRepository: Comfy-Org/ComfyUI
Length of output: 1130
🏁 Script executed:
# Search for all imports of BaseModel or where it might come from
rg -i "basemodel|from.*import|import" comfy/sampler_helpers.py | head -30Repository: Comfy-Org/ComfyUI
Length of output: 463
🏁 Script executed:
# Check if there's a BaseModel defined or imported elsewhere that should be in this file
rg "class BaseModel|from.*BaseModel" comfy/Repository: Comfy-Org/ComfyUI
Length of output: 50375
🏁 Script executed:
# Check git diff to see if line 200 is new
git diff HEAD^ comfy/sampler_helpers.py | grep -A 5 -B 5 "BaseModel"Repository: Comfy-Org/ComfyUI
Length of output: 246
🏁 Script executed:
# Look for similar type annotations in the file to understand the pattern
rg ":\s*\w+\s*=" comfy/sampler_helpers.py | head -20Repository: Comfy-Org/ComfyUI
Length of output: 100
🏁 Script executed:
# Check if there are any existing TYPE_CHECKING imports or patterns
sed -n '1,30p' comfy/sampler_helpers.pyRepository: Comfy-Org/ComfyUI
Length of output: 894
Add the missing BaseModel import to fix undefined name.
Line 200 uses BaseModel in the type annotation, but it's not imported. Add it to the TYPE_CHECKING block to match the file's existing pattern, or remove the annotation.
Suggested fix
if TYPE_CHECKING:
from comfy.model_patcher import ModelPatcher
from comfy.controlnet import ControlBase
+ from comfy.model_base import BaseModel📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| real_model: BaseModel = model.model | |
| if TYPE_CHECKING: | |
| from comfy.model_patcher import ModelPatcher | |
| from comfy.controlnet import ControlBase | |
| from comfy.model_base import BaseModel |
🧰 Tools
🪛 GitHub Actions: Python Linting
[error] 200-200: F821 Undefined name BaseModel detected by ruff check. Ensure BaseModel is imported or defined.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/sampler_helpers.py` at line 200, Add the missing BaseModel import used
in the type annotation for real_model (the line "real_model: BaseModel =
model.model") by adding "from typing import TYPE_CHECKING" already present and
then inside the existing TYPE_CHECKING block import BaseModel from its module
(e.g., "from <module> import BaseModel") so the annotation is defined at
type-check time; alternatively remove the BaseModel annotation if you prefer not
to add the import.
| devices = [dev_m for dev_m in model_options['multigpu_clones'].keys()] | ||
| device_batched_hooked_to_run: dict[torch.device, list[tuple[comfy.hooks.HookGroup, tuple]]] = {} | ||
|
|
||
| total_conds = 0 | ||
| for to_run in hooked_to_run.values(): | ||
| total_conds += len(to_run) | ||
| conds_per_device = max(1, math.ceil(total_conds//len(devices))) |
There was a problem hiding this comment.
relative_speed is not used by the multigpu scheduler.
This branch still computes a fixed conds_per_device and round-robins by raw condition count; multigpu_options is never consulted here. The new MultiGPU Options node therefore has no effect on work distribution, and the // inside math.ceil(...) makes the split even coarser on uneven counts.
As per coding guidelines, comfy/** changes should focus on performance implications in hot paths.
Also applies to: 403-416, 433-435
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/samplers.py` around lines 391 - 397, The multigpu scheduler currently
ignores multigpu_options and uses integer floor division (//) inside math.ceil,
producing coarse, incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
| if 'control' in x: | ||
| x['control'].pre_run(model, percent_to_timestep_function) | ||
| for device_cnet in x['control'].multigpu_clones.values(): | ||
| device_cnet.pre_run(model, percent_to_timestep_function) |
There was a problem hiding this comment.
Run per-device controls against the matching model clone.
These new pre_run() calls feed every device clone the base model. Any control that snapshots model-specific state during pre_run() will capture the wrong device/model; QwenFunControlNet.pre_run() in this file already stores model.diffusion_model, so its multigpu clone will still point at the base UNet.
As per coding guidelines, comfy/** changes should focus on backward compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/samplers.py` around lines 847 - 850, The code calls
x['control'].pre_run(model, ...) for the base control and then calls
device_cnet.pre_run(model, ...) for each control clone, incorrectly passing the
base model to per-device controls; update the loop to pass the matching
per-device model clone instead. Specifically, when iterating
x['control'].multigpu_clones (the device_cnet clones), look up the corresponding
model clone (e.g., from model.multigpu_clones using the same keys/ids) and call
device_cnet.pre_run(model_clone, percent_to_timestep_function) so each control
clone receives its matching model clone; keep the initial
x['control'].pre_run(model, ...) for the base control.
| out[0].cached_patcher_init = (load_checkpoint_guess_config_model_only, (ckpt_path, embedding_directory, model_options, te_model_options)) | ||
| if output_clip and out[1] is not None: | ||
| out[1].patcher.cached_patcher_init = (load_checkpoint_guess_config_clip_only, (ckpt_path, embedding_directory, model_options, te_model_options)) | ||
| out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0) |
There was a problem hiding this comment.
Guard the model-side cache assignment.
load_checkpoint_guess_config_clip_only() reaches this path with output_model=False, so out[0] is None and this line raises before the CLIP patcher can be returned. It also leaves checkpoint-backed CLIP patchers without their own cached_patcher_init.
Possible fix
- out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)
+ if out[0] is not None:
+ out[0].cached_patcher_init = (
+ load_checkpoint_guess_config,
+ (ckpt_path, False, False, False, embedding_directory, True, model_options, te_model_options),
+ 0,
+ )
+ if out[1] is not None:
+ out[1].patcher.cached_patcher_init = (
+ load_checkpoint_guess_config,
+ (ckpt_path, False, True, False, embedding_directory, False, model_options, te_model_options),
+ 1,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/sd.py` at line 1557, The assignment to out[0].cached_patcher_init can
raise when out[0] is None (e.g. when called from
load_checkpoint_guess_config_clip_only()), so guard it: check that out[0] is not
None before assigning to out[0].cached_patcher_init and, if the CLIP patcher is
created separately for checkpoint-backed models, set its own cached_patcher_init
instead (or attach the init tuple to the patcher instance). Locate the
assignment line and ensure both code paths (when out[0] exists and when only a
CLIP patcher exists) receive the same cached_patcher_init tuple so no
NullReference occurs and checkpoint-backed CLIP patchers keep their
cached_patcher_init.




Overview
This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.
As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.
The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.⚠️ .
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux
Implementation Details
Based on
max_gpusand the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher'sadditional_modelsunder keymultigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with anyadditional_modelskept consistent. To avoid unnecessarily deepcloning models,currently_loaded_modelsfromcomfy.model_managementare checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.When native conds are used as the work units,
_calc_cond_batchcalls and returns_calc_cond_batch_multigputo avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.Since deepcloning requires that the base model is 'clean',
comfy.model_managementhas received aunload_model_and_clonesfunction to unload only specific models and their clones.The


--cuda-devicestartup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)
Performance (will add more examples soon)
Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.


Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

