[inductor] Runtime estimations: use nccl estimator; mm only benchmark mode by IvanKobzarev · Pull Request #161405 · pytorch/pytorch

IvanKobzarev · 2025-08-25T14:36:06Z

Stack from ghstack (oldest at bottom):

-> [inductor] Runtime estimations: use nccl estimator; mm only benchmark mode #161405

During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.

Adding optional usage of:

c10d.time_estimator for collectives, which is based on NCCL estimator

Benchmark mode only for matmuls, as they are highly dependent on mm backend

The logic mostly copied from Ruisi's PRs for inductor simple_fsdp [simplefsdp auto-bucketing] ir node runtime estimation #157572

This estimations corrections are in default BaseSchedulerNode.estimate_runtime()

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Differential Revision: D81152294

… mode [ghstack-poisoned]

pytorch-bot · 2025-08-25T14:36:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161405

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3e76e04 with merge base 5b90e85 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… mode ghstack-source-id: 8ba5e3a Pull Request resolved: pytorch#161405

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

… mode ghstack-source-id: de87a24 Pull Request resolved: pytorch#161405

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

IvanKobzarev · 2025-08-27T18:24:24Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

IvanKobzarev · 2025-08-28T17:06:07Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) [ghstack-poisoned]

IvanKobzarev · 2025-08-29T19:40:42Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Microve · 2025-09-02T05:11:34Z

+        }
+    elif name == "torch.ops._c10d_functional.all_gather_into_tensor_out.default":
+        # TODO: use real all_gather_into_tensor_out
+        fn = torch.ops._c10d_functional.all_gather_into_tensor


Curious why we use all_gather_into_tensor here for all_gather_into_tensor_out?

I reused arguments parsing for all_gather_into_tensor,
as collective work should be the same.

But in _out variant we do not have memory allocation

Microve · 2025-09-02T05:23:40Z

+        fn = torch.ops._c10d_functional.all_to_all_single
+        # Artificial uniform split assumption,
+        # which can be not true in case of uneven sharding.
+        split_sizes = [in_t.size(0) // pg_size] * pg_size


if in_t.size(0) is 5 and pg_size is 2, then split_sizes is [2,2]? If so, it only covers 4 not 5?

Microve · 2025-09-02T05:32:39Z


+    tensor_size_mult = 1.0
+    if coll == NCCL_COLL.ALL_TO_ALL:
+        tensor_size_mult = 2.0 / group_size


Why are 2.0 divided by group_size?

Yeah, just added hacky multiplier to be closer to the values during benchmarking.

The rest of this code is a direct port of the nccl logic

The following heuristics are copied from https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc. We aim to estimate the runtime as accurately as possible.

does it make sense to keep that property ?

also, i dont see AlltoAll here for whatevr reason.. https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc

eellison

few comments

eellison · 2025-09-02T12:14:28Z

+    if name == "torch.ops._c10d_functional.all_gather_into_tensor.default":
+        fn = torch.ops._c10d_functional.all_gather_into_tensor
+        return fn, {
+            "input": in_t,
+            "group_size": pg_size,
+            "group_name": pg_name,
+        }
+    elif name == "torch.ops._c10d_functional.all_gather_into_tensor_out.default":
+        # TODO: use real all_gather_into_tensor_out
+        fn = torch.ops._c10d_functional.all_gather_into_tensor
+        return fn, {
+            "input": in_t,
+            "group_size": pg_size,
+            "group_name": pg_name,
+        }
+    elif name == "torch.ops._c10d_functional.reduce_scatter_tensor.default":


I think the way you generating input nodes below generically would be better here as well

eellison · 2025-09-02T12:18:08Z

+    args = snode.node.inputs
+    args = snode.node.fill_non_provided_args(
+        [*args, *snode.node.constant_args], snode.node.kwargs
+    )
+    kwargs = snode.node.kwargs
+    flat_args, flat_args_pytree_spec = pytree.tree_flatten((args, kwargs))


These are the same utilities we could have used above for getting the nccl inputs.

eellison · 2025-09-02T12:20:51Z

+        num_iters = 3
+        start_event = torch.cuda.Event(enable_timing=True)
+        end_event = torch.cuda.Event(enable_timing=True)
+        cpu_start = time.time()
+        start_event.record(torch.cuda.current_stream())
+        for _ in range(num_iters):
+            fn(*args, **kwargs)
+        end_event.record(torch.cuda.current_stream())
+        cpu_end = time.time()
+        torch.cuda.synchronize()
+        cpu_time = cpu_end - cpu_start
+        total_op_time = start_event.elapsed_time(end_event) - cpu_time
+        mean_op_time_ms = total_op_time / num_iters
+        del flat_args
+        mean_op_time_ns = mean_op_time_ms * 1e6
+        cache.put(cache_key, mean_op_time_ns)
+        return mean_op_time_ns


Could we reuse benchmark_gpu here ?

eellison · 2025-09-02T12:23:55Z

+        return value
+
+
+def estimate_runtime_benchmark(snode: BaseSchedulerNode) -> Optional[float]:


Have you tried benchmark_fused_nodes ? this should already accomplish what this function does

eellison · 2025-09-02T12:24:00Z

        super().__init__()
        V.graph.scheduler = self
        self.backends: dict[torch.device, BaseScheduling] = {}
+        self.estimate_runtime_cache = EstimateRuntimeCache()


Can we make this a machine local cache ? See:

pytorch/torch/_inductor/fx_passes/pad_mm.py

Lines 250 to 268 in 1f820de

@functools.cache

def get_pad_cache() -> torch._inductor.codecache.LocalCache:

return torch._inductor.codecache.LocalCache()

def get_cached_should_pad(key: str) -> bool:

return get_pad_cache().lookup(key) # type: ignore[return-value]

def set_cached_should_pad(key: str, value: bool) -> None:

return get_pad_cache().set_value(key, value=value)

def get_cached_base_mm_benchmark_time(key: str) -> float:

return get_pad_cache().lookup(key) # type: ignore[return-value]

def set_cached_base_mm_benchmark_time(key: str, value: float) -> None:

return get_pad_cache().set_value(key, value=value)

ruisizhang123

I found there is a config.estimate_op_runtime, which allows users to parse customized op estimation function to inductor:

pytorch/torch/_inductor/comms.py

Lines 1227 to 1236 in b7e207c

    
           def estimate_op_runtime(snode: BaseSchedulerNode) -> float: 
        
               """ 
        
               Returns estimated op runtime in nanoseconds (ns) 
        
               """ 
        
               if config.estimate_op_runtime == "default": 
        
                   runtime = snode.get_estimated_runtime() 
        
               else: 
        
                   assert callable(config.estimate_op_runtime) 
        
                   runtime = config.estimate_op_runtime(snode) 
        
               return runtime

.

It might make more sense to have runtime_estimations_use_nccl_lib_estimations and runtime_estimations_mms_benchmark as a callable function to estimate_op_runtime?

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) [ghstack-poisoned]

… mode ghstack-source-id: 5306306 Pull Request resolved: #161405

IvanKobzarev · 2025-09-04T18:54:23Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) [ghstack-poisoned]

… mode ghstack-source-id: fad7f9f Pull Request resolved: #161405

IvanKobzarev · 2025-09-04T20:47:35Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eellison · 2025-09-05T22:40:40Z

 # for built-in estimation function, pass in "default"; for user-defined estimation function, pass in the function handle
 estimate_op_runtime = "default"

+runtime_estimations_mms_benchmark: bool = False


nit: config menu

IvanKobzarev · 2025-09-08T09:48:10Z

@pytorchbot merge

pytorchmergebot · 2025-09-08T09:49:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-08T09:55:36Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x fae7122e3f48111925a0cd6905383bb1e4923264 returned non-zero exit code 1

Auto-merging test/distributed/test_inductor_collectives.py
Auto-merging torch/_inductor/config.py
Auto-merging torch/_inductor/scheduler.py
Auto-merging torch/_inductor/utils.py
CONFLICT (content): Merge conflict in torch/_inductor/utils.py
error: could not apply fae7122e3f4... [inductor] Runtime estimations: use nccl estimator; mm only benchmark mode
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) [ghstack-poisoned]

… mode ghstack-source-id: e7e4ac3 Pull Request resolved: #161405

IvanKobzarev · 2025-09-08T10:14:18Z

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…y benchmark mode" During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp #157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) [ghstack-poisoned]

… mode ghstack-source-id: 11e7f95 Pull Request resolved: #161405

IvanKobzarev · 2025-09-08T14:20:28Z

@pytorchbot merge

pytorchmergebot · 2025-09-08T14:22:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… mode (pytorch#161405) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp pytorch#157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: pytorch#161405 Approved by: https://github.com/eellison

… mode ghstack-source-id: d4b213b Pull Request resolved: pytorch/pytorch#161405

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

fcd8ce0

… mode [ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Aug 25, 2025

IvanKobzarev mentioned this pull request Aug 25, 2025

[inductor] Use runtime estimations in comm reordering/sink #161406

Closed

IvanKobzarev requested a review from eellison August 25, 2025 14:37

IvanKobzarev added the topic: not user facing topic category label Aug 25, 2025

IvanKobzarev requested review from fmassa and wconstab August 25, 2025 16:19

IvanKobzarev added a commit to IvanKobzarev/pytorch that referenced this pull request Aug 25, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

c691cde

… mode ghstack-source-id: 8ba5e3a Pull Request resolved: pytorch#161405

IvanKobzarev mentioned this pull request Aug 26, 2025

[bucketing] custom_ops mode to hide inductor copies overhead #161499

Closed

IvanKobzarev added a commit to IvanKobzarev/pytorch that referenced this pull request Aug 26, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

487c5e4

… mode ghstack-source-id: de87a24 Pull Request resolved: pytorch#161405

IvanKobzarev mentioned this pull request Aug 27, 2025

nccl est debug #161606

Closed

IvanKobzarev added 2 commits August 27, 2025 03:37

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 27, 2025

pytorch-bot Bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 29, 2025

ruisizhang123 mentioned this pull request Aug 29, 2025

[inductor] Runtime estimations: add triton's do_bench code #161821

Closed

Microve reviewed Sep 2, 2025

View reviewed changes

eellison reviewed Sep 2, 2025

View reviewed changes

IvanKobzarev mentioned this pull request Sep 2, 2025

[WIP] Experiment with multiD Collectives Trie bucketing #161975

Closed

ruisizhang123 reviewed Sep 2, 2025

View reviewed changes

IvanKobzarev added 3 commits September 2, 2025 15:10

IvanKobzarev added a commit that referenced this pull request Sep 4, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

aa4a12a

… mode ghstack-source-id: 5306306 Pull Request resolved: #161405

IvanKobzarev added a commit that referenced this pull request Sep 4, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

fae7122

… mode ghstack-source-id: fad7f9f Pull Request resolved: #161405

eellison approved these changes Sep 5, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 8, 2025

pytorchmergebot removed the merging label Sep 8, 2025

IvanKobzarev added a commit that referenced this pull request Sep 8, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

20a599b

… mode ghstack-source-id: e7e4ac3 Pull Request resolved: #161405

IvanKobzarev added a commit that referenced this pull request Sep 8, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

475577c

… mode ghstack-source-id: 11e7f95 Pull Request resolved: #161405

pytorchmergebot added the merging label Sep 8, 2025

pytorchmergebot added the Merged label Sep 8, 2025

pytorchmergebot closed this in 25c170b Sep 8, 2025

pytorchmergebot removed the merging label Sep 8, 2025

github-actions Bot deleted the gh/IvanKobzarev/140/head branch October 9, 2025 02:10

Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025

[inductor] Runtime estimations: use nccl estimator; mm only benchmark…

44a8e1d

… mode ghstack-source-id: d4b213b Pull Request resolved: pytorch/pytorch#161405

		return value


		def estimate_runtime_benchmark(snode: BaseSchedulerNode) -> Optional[float]:

	@functools.cache
	def get_pad_cache() -> torch._inductor.codecache.LocalCache:
	return torch._inductor.codecache.LocalCache()


	def get_cached_should_pad(key: str) -> bool:
	return get_pad_cache().lookup(key) # type: ignore[return-value]


	def set_cached_should_pad(key: str, value: bool) -> None:
	return get_pad_cache().set_value(key, value=value)


	def get_cached_base_mm_benchmark_time(key: str) -> float:
	return get_pad_cache().lookup(key) # type: ignore[return-value]


	def set_cached_base_mm_benchmark_time(key: str, value: float) -> None:
	return get_pad_cache().set_value(key, value=value)

	def estimate_op_runtime(snode: BaseSchedulerNode) -> float:
	"""
	Returns estimated op runtime in nanoseconds (ns)
	"""
	if config.estimate_op_runtime == "default":
	runtime = snode.get_estimated_runtime()
	else:
	assert callable(config.estimate_op_runtime)
	runtime = config.estimate_op_runtime(snode)
	return runtime

Conversation

IvanKobzarev commented Aug 25, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161405

✅ No Failures

Uh oh!

IvanKobzarev commented Aug 27, 2025

Uh oh!

IvanKobzarev commented Aug 28, 2025

Uh oh!

IvanKobzarev commented Aug 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 left a comment

Choose a reason for hiding this comment

Uh oh!

IvanKobzarev commented Sep 4, 2025

Uh oh!

IvanKobzarev commented Sep 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IvanKobzarev commented Sep 8, 2025

Uh oh!

pytorchmergebot commented Sep 8, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 8, 2025

Merge failed

Uh oh!

IvanKobzarev commented Sep 8, 2025

Uh oh!

IvanKobzarev commented Sep 8, 2025

Uh oh!

pytorchmergebot commented Sep 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

IvanKobzarev commented Aug 25, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Aug 25, 2025 •

edited

Loading