Skip to content

[RFC] Helion kernel + Inductor ops prologue / epilogue fusion #1346

@yf225

Description

@yf225
Image

Detailed Flow

Phase 1: Dynamo Tracing

Dynamo creates a HelionKernelVariable that stores the kernel in HelionKernelSideTable with a unique kernel_idx.

Phase 2: HOP Creation and Output Spec Inference

HelionKernelVariable.call_function() binds the kernel to infer output specs (shapes, dtypes, mutations, aliases), then emits a helion_kernel_wrapper_mutation HOP node in the FX graph.

Phase 3: Inductor Lowering

When Inductor processes the FX graph, it encounters the HOP and lowers it to a HelionTemplateBuffer (subclass of Inductor's TemplateBuffer).
Inductor's TemplateBuffer is the standard mechanism for integrating external/custom kernels that generate their own code. It provides:

  • Scheduler integration for fusion decisions (prologue/epilogue)
  • render() callback for lazy code generation
  • Input/output buffer tracking for dependency analysis
  • Multi-output support via MultiOutput nodes

Phase 4: Inductor Scheduling and Fusion Decisions

The Inductor Scheduler decides which prologue/epilogue nodes can fuse with the Helion kernel based on compatibility analysis.
The fusion decisions are stored in _prologue_specs and _epilogue_specs.

Prologue fusion criteria (op feeding into kernel input):

Epilogue fusion criteria (op consuming kernel output):

  • Kernel must not use atomic operations (atomics require precise store order)
  • Shape and stride must match exactly between kernel output and epilogue input
  • No non-trivial view operations (only offset=0 views with matching shape/stride)
  • Epilogue can only read from kernel outputs (main output or multi-outputs)

Phase 5 & 6: Inside HelionTemplateBuffer.render()

Inductor Codegen calls HelionTemplateBuffer.render(), which performs both autotuning and code generation:

Phase 5: Autotuning via ensure_config_exists()

  • If config= provided: Uses the provided config directly (no autotuning)
  • If config= not provided: Full autotuning search

This autotunes the Helion kernel without the prologue/epilogue ops. (In the future we can take those ops into consideration.)

Phase 6: Code Generation via generate_ast()

Triton codegen for the Helion kernel, with hl.load/hl.store intercept to insert prologue/epilogue ops.

+-------------------------------------------+
|    render()                               |
|                   |                       |
|                   v                       |
|   +-------------------------------+       |
|   | ensure_config_exists()        |       |  <-- Phase 5: Autotune if needed
|   +-------------------------------+       |
|                   |                       |
|                   v                       |
|   +-------------------------------+       |
|   | generate_ast(host_function,   |       |  <-- Phase 6: Triton codegen
|   |              config)          |       |
|   +-------------------------------+       |
|                   |                       |
|   During AST generation, hl.load()        |
|   and hl.store() are intercepted to       |
|   insert prologue/epilogue ops            |
|                   |                       |
|                   v                       |
|          Fused Triton Code                |
+-------------------------------------------+
hl.load Interception (Prologue Fusion)
+-------------------------------------------+
|    memory_ops.py: hl.load codegen         |
|                   |                       |
|   Check: env.is_fusion_enabled()?         |
|               |           |               |
|              Yes          No              |
|               |           |               |
|               v           v               |
|   +------------------+  Return raw load   |
|   | codegen_prologue |                    |
|   | _fusion()        |                    |
|   +------------------+                    |
|               |                           |
|               v                           |
|       Fused load expression               |
|   e.g., "(tl.load(x + idx0) * 2.0)"       |
+-------------------------------------------+
hl.store Interception (Epilogue Fusion)
+-------------------------------------------+
|    memory_ops.py: store codegen           |
|                   |                       |
|   Check: env.is_fusion_enabled()?         |
|               |           |               |
|              Yes          No              |
|               |           |               |
|               v           v               |
|   +------------------+  Return raw store  |
|   | codegen_epilogue |                    |
|   | _fusion()        |                    |
|   +------------------+                    |
|               |                           |
|               v                           |
|       Fused store expression              |
|   e.g., "tl.store(ptr, (val + 1.0))"      |
|   + extra_stores for epilogue outputs     |
+-------------------------------------------+

FAQ

Why intercept at hl.load/store codegen?

At the time of making fusion decision in Inductor scheduler, the prologue/epilogue ops to be fused are already in Inductor IR form (i.e. they are a low-level IR closest to actual Triton code). The most natural place to intercept Helion compilation/codegen pipeline to insert these ops, is in the hl.load/store codegen function which already deals with Triton-level codegen.

A theoretical alternative is to let Inductor scheduler on-the-fly modify the Helion kernel to take an additional prologue/epilogue callable FX graph, and within the Helion kernel, call the FX callable on the inputs/outputs. However, this requires "going up the IR level" i.e. converting the prologue/epilogue ops from Inductor IR (low-level) to FX IR (high-level), which is in general difficult to do, so we avoid using that approach here.

Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?

Yes, Inductor invokes Helion which invokes Inductor—but for different purposes, and there is no circular dependency.

Inductor has three main stages: lowering → scheduler → Triton codegen. Depending on the use case, we don't have to use all three.

As described above, the prologue/epilogue fusion flow works as follows:

  1. Inductor lowering processes the torch.compile program and produces IR containing a call to the Helion kernel. This IR is then fed into the Inductor scheduler.

  2. Inductor scheduler encounters the Helion kernel and triggers autotuning to find the optimal config. During autotuning, Helion internally uses Inductor's lowering and Triton codegen stages, but not the scheduler—so there is no circular dependency.

  3. Inductor codegen inserts the epilogue ops into the Helion kernel's hl.load/store codegen, so the prologue/epilogue ops are generated within the Helion Triton kernel. Inductor codegen then emits a call to that fused Helion Triton kernel within the overall torch.compile output code.

Metadata

Metadata

Assignees

Labels

PyTorch interopUmbrella label for everything metaprogramming / PyTorch interop relatedUBN

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions