Detailed Flow
Phase 1: Dynamo Tracing
Dynamo creates a HelionKernelVariable that stores the kernel in HelionKernelSideTable with a unique kernel_idx.
Phase 2: HOP Creation and Output Spec Inference
HelionKernelVariable.call_function() binds the kernel to infer output specs (shapes, dtypes, mutations, aliases), then emits a helion_kernel_wrapper_mutation HOP node in the FX graph.
Phase 3: Inductor Lowering
When Inductor processes the FX graph, it encounters the HOP and lowers it to a HelionTemplateBuffer (subclass of Inductor's TemplateBuffer).
Inductor's TemplateBuffer is the standard mechanism for integrating external/custom kernels that generate their own code. It provides:
- Scheduler integration for fusion decisions (prologue/epilogue)
render() callback for lazy code generation
- Input/output buffer tracking for dependency analysis
- Multi-output support via
MultiOutput nodes
Phase 4: Inductor Scheduling and Fusion Decisions
The Inductor Scheduler decides which prologue/epilogue nodes can fuse with the Helion kernel based on compatibility analysis.
The fusion decisions are stored in _prologue_specs and _epilogue_specs.
Prologue fusion criteria (op feeding into kernel input):
Epilogue fusion criteria (op consuming kernel output):
- Kernel must not use atomic operations (atomics require precise store order)
- Shape and stride must match exactly between kernel output and epilogue input
- No non-trivial view operations (only offset=0 views with matching shape/stride)
- Epilogue can only read from kernel outputs (main output or multi-outputs)
Phase 5 & 6: Inside HelionTemplateBuffer.render()
Inductor Codegen calls HelionTemplateBuffer.render(), which performs both autotuning and code generation:
Phase 5: Autotuning via ensure_config_exists()
- If
config= provided: Uses the provided config directly (no autotuning)
- If
config= not provided: Full autotuning search
This autotunes the Helion kernel without the prologue/epilogue ops. (In the future we can take those ops into consideration.)
Phase 6: Code Generation via generate_ast()
Triton codegen for the Helion kernel, with hl.load/hl.store intercept to insert prologue/epilogue ops.
+-------------------------------------------+
| render() |
| | |
| v |
| +-------------------------------+ |
| | ensure_config_exists() | | <-- Phase 5: Autotune if needed
| +-------------------------------+ |
| | |
| v |
| +-------------------------------+ |
| | generate_ast(host_function, | | <-- Phase 6: Triton codegen
| | config) | |
| +-------------------------------+ |
| | |
| During AST generation, hl.load() |
| and hl.store() are intercepted to |
| insert prologue/epilogue ops |
| | |
| v |
| Fused Triton Code |
+-------------------------------------------+
hl.load Interception (Prologue Fusion)
+-------------------------------------------+
| memory_ops.py: hl.load codegen |
| | |
| Check: env.is_fusion_enabled()? |
| | | |
| Yes No |
| | | |
| v v |
| +------------------+ Return raw load |
| | codegen_prologue | |
| | _fusion() | |
| +------------------+ |
| | |
| v |
| Fused load expression |
| e.g., "(tl.load(x + idx0) * 2.0)" |
+-------------------------------------------+
hl.store Interception (Epilogue Fusion)
+-------------------------------------------+
| memory_ops.py: store codegen |
| | |
| Check: env.is_fusion_enabled()? |
| | | |
| Yes No |
| | | |
| v v |
| +------------------+ Return raw store |
| | codegen_epilogue | |
| | _fusion() | |
| +------------------+ |
| | |
| v |
| Fused store expression |
| e.g., "tl.store(ptr, (val + 1.0))" |
| + extra_stores for epilogue outputs |
+-------------------------------------------+
FAQ
Why intercept at hl.load/store codegen?
At the time of making fusion decision in Inductor scheduler, the prologue/epilogue ops to be fused are already in Inductor IR form (i.e. they are a low-level IR closest to actual Triton code). The most natural place to intercept Helion compilation/codegen pipeline to insert these ops, is in the hl.load/store codegen function which already deals with Triton-level codegen.
A theoretical alternative is to let Inductor scheduler on-the-fly modify the Helion kernel to take an additional prologue/epilogue callable FX graph, and within the Helion kernel, call the FX callable on the inputs/outputs. However, this requires "going up the IR level" i.e. converting the prologue/epilogue ops from Inductor IR (low-level) to FX IR (high-level), which is in general difficult to do, so we avoid using that approach here.
Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?
Yes, Inductor invokes Helion which invokes Inductor—but for different purposes, and there is no circular dependency.
Inductor has three main stages: lowering → scheduler → Triton codegen. Depending on the use case, we don't have to use all three.
As described above, the prologue/epilogue fusion flow works as follows:
-
Inductor lowering processes the torch.compile program and produces IR containing a call to the Helion kernel. This IR is then fed into the Inductor scheduler.
-
Inductor scheduler encounters the Helion kernel and triggers autotuning to find the optimal config. During autotuning, Helion internally uses Inductor's lowering and Triton codegen stages, but not the scheduler—so there is no circular dependency.
-
Inductor codegen inserts the epilogue ops into the Helion kernel's hl.load/store codegen, so the prologue/epilogue ops are generated within the Helion Triton kernel. Inductor codegen then emits a call to that fused Helion Triton kernel within the overall torch.compile output code.
Detailed Flow
Phase 1: Dynamo Tracing
Dynamo creates a
HelionKernelVariablethat stores the kernel inHelionKernelSideTablewith a uniquekernel_idx.Phase 2: HOP Creation and Output Spec Inference
HelionKernelVariable.call_function()binds the kernel to infer output specs (shapes, dtypes, mutations, aliases), then emits ahelion_kernel_wrapper_mutationHOP node in the FX graph.Phase 3: Inductor Lowering
When Inductor processes the FX graph, it encounters the HOP and lowers it to a
HelionTemplateBuffer(subclass of Inductor'sTemplateBuffer).Inductor's
TemplateBufferis the standard mechanism for integrating external/custom kernels that generate their own code. It provides:render()callback for lazy code generationMultiOutputnodesPhase 4: Inductor Scheduling and Fusion Decisions
The Inductor Scheduler decides which prologue/epilogue nodes can fuse with the Helion kernel based on compatibility analysis.
The fusion decisions are stored in
_prologue_specsand_epilogue_specs.Prologue fusion criteria (op feeding into kernel input):
ComputedBuffer(not raw buffers or views)Epilogue fusion criteria (op consuming kernel output):
Phase 5 & 6: Inside
HelionTemplateBuffer.render()Inductor Codegen calls
HelionTemplateBuffer.render(), which performs both autotuning and code generation:Phase 5: Autotuning via
ensure_config_exists()config=provided: Uses the provided config directly (no autotuning)config=not provided: Full autotuning searchThis autotunes the Helion kernel without the prologue/epilogue ops. (In the future we can take those ops into consideration.)
Phase 6: Code Generation via
generate_ast()Triton codegen for the Helion kernel, with
hl.load/hl.storeintercept to insert prologue/epilogue ops.hl.load Interception (Prologue Fusion)
hl.store Interception (Epilogue Fusion)
FAQ
Why intercept at hl.load/store codegen?
At the time of making fusion decision in Inductor scheduler, the prologue/epilogue ops to be fused are already in Inductor IR form (i.e. they are a low-level IR closest to actual Triton code). The most natural place to intercept Helion compilation/codegen pipeline to insert these ops, is in the hl.load/store codegen function which already deals with Triton-level codegen.
A theoretical alternative is to let Inductor scheduler on-the-fly modify the Helion kernel to take an additional prologue/epilogue callable FX graph, and within the Helion kernel, call the FX callable on the inputs/outputs. However, this requires "going up the IR level" i.e. converting the prologue/epilogue ops from Inductor IR (low-level) to FX IR (high-level), which is in general difficult to do, so we avoid using that approach here.
Does Inductor invoke Helion which invokes Inductor? Is this a circular dependency?
Yes, Inductor invokes Helion which invokes Inductor—but for different purposes, and there is no circular dependency.
Inductor has three main stages: lowering → scheduler → Triton codegen. Depending on the use case, we don't have to use all three.
As described above, the prologue/epilogue fusion flow works as follows:
Inductor lowering processes the
torch.compileprogram and produces IR containing a call to the Helion kernel. This IR is then fed into the Inductor scheduler.Inductor scheduler encounters the Helion kernel and triggers autotuning to find the optimal config. During autotuning, Helion internally uses Inductor's lowering and Triton codegen stages, but not the scheduler—so there is no circular dependency.
Inductor codegen inserts the epilogue ops into the Helion kernel's hl.load/store codegen, so the prologue/epilogue ops are generated within the Helion Triton kernel. Inductor codegen then emits a call to that fused Helion Triton kernel within the overall
torch.compileoutput code.