EcholoKernel

So long, and thanks for all the fish. - Douglas Adams

TLDR - Writing multi-GPU kernels is hard. We built a slick UI + multi-turn LLM agent (RAG, RLM, profiler feedback, etc.) to make it easier. This equals a better assisted, iterative, human-in-the-loop way to write real, useful multiGPU code. EcholoKernel can generate NVSHMEM4py + Triton–compatible kernels (despite the near-absence of documentation) to get up to 2.4× speedup over AG + GEMM baselines on small tensors and ~1.3× speedup on the Ulysses attention combine step.

Inspiration

Writing GPU kernels is painful. Writing multi-GPU kernels is worse.

Companies love to advertise TFLOPs, but the real bottleneck in modern AI systems is networking. Especially in sparse and MoE-style models, runtime is dominated by just getting data where it needs to be, not doing the math inside a kernel.

There’s been real progress on LLM agents that can write GPU kernels using RL and RAG, but most of that work stops at a single GPU. To use multiple GPUs, researchers default to NCCL via torch.distributed, which works well for standard patterns but leaves performance on the table for workloads that need fine-grained communication.

So we asked: can an AI agent translate high-level torch.distributed code and turn it into fine-grained, fused compute + communication kernels? And can we get better performance?

What it does

We built a UI and an RLM-powered agent that translates torch.distributed code into NVSHMEM-style multigpu kernels. The core idea is to keep a human in the loop, but let the agent handle the monotony and grind of kernel iteration/ideation.

Our backend uses Triton + NVSHMEM4Py, chosen deliberately to stay close to a DSL kernel developers already understand.

The agent generates candidate kernels, compiles and runs them, and then reasons over profiler and compiler feedback to iteratively improve performance. Through the UI, a human can see prior agent attempts and step in at any point to nudge the algorithm/syntax if needed. We found this makes multiGPU kernel prototyping a lot less painful!

Results

We were able to get a 2.4x speedup on AG + GEMM for 256 x 256 matrices, and consistently lower latencies on smaller matrix sizes for AG + GEMM (see slides).

We also tried it with something harder, namely ulysses attention, and got ~1.3x speedup across 2 H100s for sequence lengths ranging from 256 - 16384.

These are real-world workloads and kernels and algorithms used in production today: dropping in our replacement led GPT-2 to consistently have a higher tokens/second generated! We're excited to see what would happen if we took the time to drop in a model end-to-end and have a giant reasoning model go to town on optimizing it.

How we built it

The UI is built using Next and FastAPI. Benchmarking and orchestration are written primarily in Python, with heavy use of Modal and Together to compile, run, and evaluate kernels across GPUs.

At the core is a multi-turn RLM agent that generates candidate kernels, runs them, and incorporates compiler errors, runtime behavior, and profiler feedback into subsequent iterations. Rather than treating kernel generation as a one-shot problem, the system is explicitly designed around having a human in the loop, able to intervene between LLM loops using the UI. See the slides) for more details on the agentic system.

Challenges we ran into

Writing correct multi-GPU kernels is still hard, for humans and for LLMs.

Correctness was the dominant challenge: many generated kernels deadlocked, produced subtly wrong outputs, or violated NVSHMEM and Triton semantics in ways that only showed up at runtime. The agent often struggled with synchronization, memory ordering, and the implicit assumptions baked into these DSLs.

Accomplishments that we're proud of

That said, it was still striking how far we could get given our limited prior experience with NVSHMEM + Triton and the scarcity of documentation. We're producing coherent NVSHMEM4Py + Triton kernels with minimal public examples, and made a working end-to-end system that generates, benchmarks, and iterates on real multi-GPU kernels.

We also demonstrated real-world applications: we have legit performance gains on production-relevant GEMM and attention workloads, demonstrating that agent-assisted kernel development is both within reach and actually useful.

What's next for EcholoKernel

Improving the agent, fleshing out benchmarks, and getting some sleep.

Built With

Share this project:

Updates