[RFC] Helion kernel + Inductor ops prologue / epilogue fusion

<img width="897" height="587" alt="Image" src="https://github.com/user-attachments/assets/48e5ce19-97fe-4c95-a4b2-349ba8e77a18" />

## Detailed Flow

### Phase 1: Dynamo Tracing

Dynamo creates a `HelionKernelVariable` that stores the kernel in `HelionKernelSideTable` with a unique `kernel_idx`.

### Phase 2: HOP Creation and Output Spec Inference

`HelionKernelVariable.call_function()` binds the kernel to infer output specs (shapes, dtypes, mutations, aliases), then emits a `helion_kernel_wrapper_mutation` HOP node in the FX graph.

### Phase 3: Inductor Lowering

When Inductor processes the FX graph, it encounters the HOP and lowers it to a `HelionTemplateBuffer` (subclass of Inductor's `TemplateBuffer`).
Inductor's `TemplateBuffer` is the standard mechanism for integrating external/custom kernels that generate their own code. It provides:
- Scheduler integration for fusion decisions (prologue/epilogue)
- `render()` callback for lazy code generation
- Input/output buffer tracking for dependency analysis
- Multi-output support via `MultiOutput` nodes

### Phase 4: Inductor Scheduling and Fusion Decisions

The Inductor Scheduler decides which prologue/epilogue nodes can fuse with the Helion kernel based on compatibility analysis.
The fusion decisions are stored in `_prologue_specs` and `_epilogue_specs`.

**Prologue fusion criteria** (op feeding into kernel input):
- All prologue nodes must be `ComputedBuffer` (not raw buffers or views)
- Prologue must produce exactly one buffer that the kernel reads
- Shape and stride must match exactly between prologue output and kernel input
- No view operations allowed in prologue
- We will also follow other criteria currently in place for prologue fusion in Inductor (https://github.com/pytorch/pytorch/blob/179e56335825876295e330f3dda9ce7ae8beac99/torch/_inductor/scheduler.py#L5333-L5389)

**Epilogue fusion criteria** (op consuming kernel output):
- Kernel must not use atomic operations (atomics require precise store order)
- Shape and stride must match exactly between kernel output and epilogue input
- No non-trivial view operations (only offset=0 views with matching shape/stride)
- Epilogue can only read from kernel outputs (main output or multi-outputs)

### Phase 5 & 6: Inside `HelionTemplateBuffer.render()`

Inductor Codegen calls `HelionTemplateBuffer.render()`, which performs both autotuning and code generation:

#### Phase 5: Autotuning via `ensure_config_exists()`

- **If `config=` provided**: Uses the provided config directly (no autotuning)
- **If `config=` not provided**: Full autotuning search

This autotunes the Helion kernel *without* the prologue/epilogue ops. (In the future we can take those ops into consideration.)

#### Phase 6: Code Generation via `generate_ast()`

Triton codegen for the Helion kernel, with `hl.load`/`hl.store` intercept to insert prologue/epilogue ops.

```
+-------------------------------------------+
|    render()                               |
|                   |                       |
|                   v                       |
|   +-------------------------------+       |
|   | ensure_config_exists()        |       |  <-- Phase 5: Autotune if needed
|   +-------------------------------+       |
|                   |                       |
|                   v                       |
|   +-------------------------------+       |
|   | generate_ast(host_function,   |       |  <-- Phase 6: Triton codegen
|   |              config)          |       |
|   +-------------------------------+       |
|                   |                       |
|   During AST generation, hl.load()        |
|   and hl.store() are intercepted to       |
|   insert prologue/epilogue ops            |
|                   |                       |
|                   v                       |
|          Fused Triton Code                |
+-------------------------------------------+
```

##### hl.load Interception (Prologue Fusion)

```
+-------------------------------------------+
|    memory_ops.py: hl.load codegen         |
|                   |                       |
|   Check: env.is_fusion_enabled()?         |
|               |           |               |
|              Yes          No              |
|               |           |               |
|               v           v               |
|   +------------------+  Return raw load   |
|   | codegen_prologue |                    |
|   | _fusion()        |                    |
|   +------------------+                    |
|               |                           |
|               v                           |
|       Fused load expression               |
|   e.g., "(tl.load(x + idx0) * 2.0)"       |
+-------------------------------------------+
```

##### hl.store Interception (Epilogue Fusion)

```
+-------------------------------------------+
|    memory_ops.py: store codegen           |
|                   |                       |
|   Check: env.is_fusion_enabled()?         |
|               |           |               |
|              Yes          No              |
|               |           |               |
|               v           v               |
|   +------------------+  Return raw store  |
|   | codegen_epilogue |                    |
|   | _fusion()        |                    |
|   +------------------+                    |
|               |                           |
|               v                           |
|       Fused store expression              |
|   e.g., "tl.store(ptr, (val + 1.0))"      |
|   + extra_stores for epilogue outputs     |
+-------------------------------------------+
```

## FAQ

### Why intercept at hl.load/store codegen?

At the time of making fusion decision in Inductor scheduler, the prologue/epilogue ops to be fused are already in Inductor IR form (i.e. they are a low-level IR closest to actual Triton code). The most natural place to intercept Helion compilation/codegen pipeline to insert these ops, is in the hl.load/store codegen function which already deals with Triton-level codegen.

A theoretical alternative is to let Inductor scheduler on-the-fly modify the Helion kernel to take an additional prologue/epilogue callable FX graph, and within the Helion kernel, call the FX callable on the inputs/outputs. However, this requires "going up the IR level" i.e. converting the prologue/epilogue ops from Inductor IR (low-level) to FX IR (high-level), which is in general difficult to do, so we avoid using that approach here.

### Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?

Yes, Inductor invokes Helion which invokes Inductor—but for different purposes, and there is no circular dependency.

Inductor has three main stages: **lowering → scheduler → Triton codegen**. Depending on the use case, we don't have to use all three.

As described above, the prologue/epilogue fusion flow works as follows:

1. **Inductor lowering** processes the `torch.compile` program and produces IR containing a call to the Helion kernel. This IR is then fed into the Inductor scheduler.

2. **Inductor scheduler** encounters the Helion kernel and triggers autotuning to find the optimal config. During autotuning, Helion internally uses Inductor's **lowering** and **Triton codegen** stages, but _**not**_ the scheduler—so there is no circular dependency.

3. **Inductor codegen** inserts the epilogue ops into the Helion kernel's hl.load/store codegen, so the prologue/epilogue ops are generated within the Helion Triton kernel. Inductor codegen then emits a call to that fused Helion Triton kernel within the overall `torch.compile` output code.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Helion kernel + Inductor ops prologue / epilogue fusion #1346

Detailed Flow

Phase 1: Dynamo Tracing

Phase 2: HOP Creation and Output Spec Inference

Phase 3: Inductor Lowering

Phase 4: Inductor Scheduling and Fusion Decisions

Phase 5 & 6: Inside `HelionTemplateBuffer.render()`

Phase 5: Autotuning via `ensure_config_exists()`

Phase 6: Code Generation via `generate_ast()`

hl.load Interception (Prologue Fusion)

hl.store Interception (Epilogue Fusion)

FAQ

Why intercept at hl.load/store codegen?

Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Helion kernel + Inductor ops prologue / epilogue fusion #1346

Description

Detailed Flow

Phase 1: Dynamo Tracing

Phase 2: HOP Creation and Output Spec Inference

Phase 3: Inductor Lowering

Phase 4: Inductor Scheduling and Fusion Decisions

Phase 5 & 6: Inside HelionTemplateBuffer.render()

Phase 5: Autotuning via ensure_config_exists()

Phase 6: Code Generation via generate_ast()

hl.load Interception (Prologue Fusion)

hl.store Interception (Epilogue Fusion)

FAQ

Why intercept at hl.load/store codegen?

Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Phase 5 & 6: Inside `HelionTemplateBuffer.render()`

Phase 5: Autotuning via `ensure_config_exists()`

Phase 6: Code Generation via `generate_ast()`