[Inductor] User kernel unary epilogue fusion by AmesingFlank · Pull Request #173662 · pytorch/pytorch

AmesingFlank · 2026-01-28T16:31:36Z

This PR enables fusing user-defined triton kernels with unary epilogues such as relu(). This roughly involves the following changes

Extended TTIR analysis from tracking only writes to tracking both reads and writes (analyze_kernel_access).
We deem a user triton kernel eligible for fusion iff it writes to, but does not read from, a tensor initialized with UB values (see UserDefinedTritonKernel::can_fuse_epilogues for detailed rationale)
Created FusedExternTritonKernelSchedulerNode and updated fusion logic to allow extern kernel nodes (triton kernels) to epilogue-fuse with pointwise scheduler nodes.
To modify the triton kernel source code, we parse the original src into a python AST, then identify the expr containing the original value written via tl.store(). We generate an expr for the value after the epilogue and replace that into the tl.store

@eellison

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

pytorch-bot · 2026-01-28T16:31:40Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

pytorch-bot · 2026-01-28T16:31:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173662

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit c6605fa with merge base f26ec24 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
shufflenet_v2_x1_0

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge) (gh) (trunk failure)
'Test'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-01-28T16:31:45Z

The committers listed above are authorized under a signed CLA.

✅ login: AmesingFlank / name: Dunfan Lu (0ace935, 11f6ffa, 159a1dd, 50f3dec, 708a057, 9b92790, a192800, a1be44f, c6605fa, dc9ffa3)

eellison

very good start ! left some comments

eellison · 2026-01-29T00:37:32Z

+        for op in reversed(epilogue.data.origins): # `origins` contain the ops in reverse
+            if op.name == "relu":
+                store_value_expr = f"triton_helpers.maximum(0, {store_value_expr})"
+            elif op.name == "sigmoid":
+                store_value_expr = f"tl.sigmoid({store_value_expr})"
+            else:
+                raise  AssertionError("unsupported epilogue op: ", op.name)


Ultimately, we'll need to actually codegen the unary fn.

eellison · 2026-01-29T17:40:14Z

        "tt.atomic_cas": [0],
        "tt.atomic_rmw": [0],
        "tt.experimental_descriptor_store": [0],
        "tt.experimental_tensormap_create": [0],


tt.experimental_tensormap_create is not a write op.. it just creates a descriptor which will then be used for either loading or storing..

This is unfortunately a part of the code base that has no owner currently. maybe it will be you :)

I see. Given that this is pre-existing, OK if we look into this in a follow-up?

eellison · 2026-01-30T03:00:05Z

-    MUTATION_OPS = {
+    WRITE_OPS = {
        "tt.store": [0],
        "tt.atomic_cas": [0],


we're missing a bunch of atomic ops.... we should also record which writes are atomic and update the write Dep mode to reflect that. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/dependencies.py#L83

we should also update the typing hint to be be Optional[Literal[...]] so it's more clear what the modes can be

eellison · 2026-01-30T03:11:38Z

+@functools.cache
+def identify_triton_stores(source_code: str):
+    """
+    Parse Python source code of triton kernel and find all tl.store calls.


The existing parser uses ttgir, and this uses the ast. Are there any concerns there ?

The two paths are for different purposes. The TTIR-based analysis are for extracting read/write info out of the code, whereas the AST parsing is so that we can modify certain parts of the code to achieve fusion. I do think these scenarios call for different approaches and its the best to use IR for program analysis and use AST for source code manipulation

eellison · 2026-01-30T03:15:54Z

+        if next(iter(self.mutable_args[0].origins)).name != "empty":
+            return False


similarly, we should not rely on the origins. you can check that this is a Nop input:

pytorch/torch/_inductor/lowering.py

Lines 3675 to 3676 in a182b08

# explicitly set ranges to zeros in order to make a NopKernelSchedulerNode

buffer.data = dataclasses.replace(buffer.data, ranges=[0] * len(size))

eellison · 2026-01-30T18:44:11Z

+        We do this by pruning the `out` tensor allocation and directly writing the relu-output.
+        """
+        # only do epilogue fusion if the kernel has a single output tensor
+        if len(self.args_read_writes.writes) != 1:


let's also check that the input & output size/ stride/dtype are the same before we have logic to fix.

Good call. I updated the PR to include that check in Scheduler::can_fuse.

eellison · 2026-01-30T18:53:12Z

+        if isinstance(node, ir.UserDefinedTritonKernel) and node.can_fuse_epilogues():
+            numel = math.prod(node.mutable_args[0].shape)
+            rnumel = 1
+            device = node.get_device_or_error()
+            # pyrefly: ignore [bad-assignment]
+            self.group = (device, (numel, rnumel))


so, this will only support fusion for one of the outputs. i think supporting only a single epilogue is fine initially. potentially we could see which output has a potential epilogue fusion while we have this limitation.

eellison · 2026-01-30T18:53:35Z

+    def fusable_pointwise_ops(cls):
+        return OrderedSet(["relu", "sigmoid"])


I guess this is because of our current codegen limitations ?

Correct. I've updated the PR to properly call node.codegen() so this is now removed

pytorchmergebot · 2026-03-06T18:34:50Z

Reverting PR 173662 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 615c79fa101e6d79144bf47ce8334d20c9787b2d returned non-zero exit code 1

Auto-merging test/inductor/test_triton_kernels.py
Auto-merging torch/_higher_order_ops/triton_kernel_wrap.py
CONFLICT (content): Merge conflict in torch/_higher_order_ops/triton_kernel_wrap.py
Auto-merging torch/_inductor/codegen/wrapper.py
Auto-merging torch/_inductor/config.py
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_inductor/utils.py
error: could not revert 615c79fa101... [Inductor] User kernel unary epilogue fusion (#173662)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

zou3519 · 2026-03-06T19:15:52Z

I'm going to disable the test until we fix it

desertfire · 2026-03-06T19:52:52Z

If it's only cpp_wrapper failures, I can take care of that. I recently enabled more cpp-wrapper tests on CI, so this was a landing race.

Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. [ghstack-poisoned]

Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. ghstack-source-id: 5244ba6 Pull Request resolved: #176745

laithsakka · 2026-03-06T21:53:41Z

this PR increase instruction count on mm_loop_inductor_gpu by 10%

laithsakka · 2026-03-06T21:55:07Z

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. [ghstack-poisoned]

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. ghstack-source-id: c856a94 Pull Request resolved: #176745

ghstack-source-id: 91f34cc Pull-Request: #176772

ghstack-source-id: 9ca1d8f Pull-Request: #176772

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: #176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519

ghstack-source-id: 4ba8944 Pull-Request: #176772

ghstack-source-id: 4c02f9a Pull-Request: #176772

ghstack-source-id: adc2f8a Pull-Request: #176772

ghstack-source-id: 48c9117 Pull-Request: #176772

ghstack-source-id: 64331bc Pull-Request: #176772

ghstack-source-id: 4a338c8 Pull-Request: #176772

pytorch-bot Bot added the module: inductor label Jan 28, 2026

AmesingFlank changed the title ~~[WIP][Indictor] User kernel unary epilogue fusion~~ [WIP][Inductor] User kernel unary epilogue fusion Jan 28, 2026

AmesingFlank force-pushed the frank_dev branch 11 times, most recently from 4a3cf5f to 69413d7 Compare January 29, 2026 15:41

AmesingFlank marked this pull request as ready for review January 29, 2026 15:43

AmesingFlank requested review from aorenste and zou3519 as code owners January 29, 2026 15:43

AmesingFlank changed the title ~~[WIP][Inductor] User kernel unary epilogue fusion~~ [Inductor] User kernel unary epilogue fusion Jan 29, 2026

AmesingFlank force-pushed the frank_dev branch 3 times, most recently from 5e60dd8 to 335b8c5 Compare January 29, 2026 17:10

eellison reviewed Jan 30, 2026

View reviewed changes

AmesingFlank force-pushed the frank_dev branch from 335b8c5 to 9a51e40 Compare January 30, 2026 14:54

eellison reviewed Jan 30, 2026

View reviewed changes

AmesingFlank force-pushed the frank_dev branch 4 times, most recently from df2508a to 1ec9830 Compare January 31, 2026 03:53

desertfire mentioned this pull request Mar 6, 2026

[inductor][CI] Fix cpp-wrapper CI failures #176745

Closed

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

91bbab5

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

e7bbe3f

ghstack-source-id: 91f34cc Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

fd0cb93

ghstack-source-id: 91f34cc Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

d645fe5

ghstack-source-id: 91f34cc Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

c251f19

ghstack-source-id: 9ca1d8f Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

a8269b4

ghstack-source-id: 4ba8944 Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

8265cde

ghstack-source-id: 4c02f9a Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 7, 2026

[Inductor] fix performance regression caused by #173662

fe261ca

ghstack-source-id: adc2f8a Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

137a667

ghstack-source-id: adc2f8a Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

fed15bd

ghstack-source-id: 48c9117 Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

88c2032

ghstack-source-id: 48c9117 Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

2f55639

ghstack-source-id: 64331bc Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

ef04ceb

ghstack-source-id: 64331bc Pull-Request: #176772

AmesingFlank added a commit that referenced this pull request Mar 8, 2026

[Inductor] fix performance regression caused by #173662

609cc94

ghstack-source-id: 4a338c8 Pull-Request: #176772

coufon mentioned this pull request Mar 8, 2026

[Inductor] Fallback to super().get_read_writes when epilogue_fusion_user_defined_triton_kernel is disabled #176832

Closed

v0i0 mentioned this pull request Mar 11, 2026

Unpin H100 nightly torch and Triton versions pytorch/helion#1654

Merged

jjvraw mentioned this pull request Mar 19, 2026

[Inductor] Avoid redundant cache entries in identify_triton_stores #177843

Closed

jjvraw mentioned this pull request Apr 2, 2026

[Inductor][RFC] Symbolic Analysis of User-Defined Triton Kernels #179149

Draft

jjvraw mentioned this pull request Apr 29, 2026

[Inductor] User kernel pointer arg with no tl.load/tl.store incorrectly eliminated when epilogue fusion enabled #181864

Open

		if next(iter(self.mutable_args[0].origins)).name != "empty":
		return False

	# explicitly set ranges to zeros in order to make a NopKernelSchedulerNode
	buffer.data = dataclasses.replace(buffer.data, ranges=[0] * len(size))

		def fusable_pointwise_ops(cls):
		return OrderedSet(["relu", "sigmoid"])

Conversation

AmesingFlank commented Jan 28, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jan 28, 2026

This PR needs a release notes: label

Uh oh!

pytorch-bot Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173662

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

linux-foundation-easycla Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Mar 6, 2026

Reverting PR 173662 failed

Uh oh!

zou3519 commented Mar 6, 2026

Uh oh!

desertfire commented Mar 6, 2026

Uh oh!

laithsakka commented Mar 6, 2026

Uh oh!

laithsakka commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AmesingFlank commented Jan 28, 2026 •

edited by pytorch-bot Bot

Loading

This PR needs a `release notes:` label

pytorch-bot Bot commented Jan 28, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jan 28, 2026 •

edited

Loading