Skip to content

Non-record: Fused Triton Megakernels — RMSNorm + LeakyReLU² (val_bpb 1.3560)#1192

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/megakernels
Open

Non-record: Fused Triton Megakernels — RMSNorm + LeakyReLU² (val_bpb 1.3560)#1192
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/megakernels

Conversation

@dentity007
Copy link
Copy Markdown

Summary

Custom Triton kernels for RMSNorm and LeakyReLU(0.75)² — beats baseline by 0.0017 BPB via eval speedup.

val_bpb: 1.3560 | 1×RTX 5090, 600s

  • Triton fwd kernels for eval, PyTorch for training (fullgraph=True compatible)
  • autograd.Function wrappers included for training-time kernel use
  • MEGAKERNEL_ENABLED=0 falls back to identical baseline behavior
  • Implements OpenAI's requested "Megakernels" research direction

🤖 Generated with Claude Code

@dentity007 dentity007 closed this Apr 1, 2026
@dentity007 dentity007 reopened this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant