Posts on Lei.Chat()

Gluon: Explicit Performance

Sat, 28 Feb 2026 15:48:43 -0800

Gluon enhances the Triton language and compiler solutions with an additional approach towards GPU kernel programming. It strikes a different balance in the portability and performance spectrum to expose more compiler internals; thus giving developers more explicit controls to reach higher performance ceiling. In this blog post I’ll explain Gluon per my understanding. I will also use this as an opportunity to talk about domain-specific languages, particularly in the context of dramatically evolving agentic software development.

Triton Bespoke Layouts

Sun, 25 Jan 2026 10:12:19 -0800

Hopefully the previous articles covering linear layout concepts and examples facilitate building a solid understanding of the core generic layer powering various Triton code generation lowering and optimizations. Now let’s turn our focus to those bespoke layouts, which we still consistently interact with when working on Triton compiler internals. Additionally, developers can directly program layouts with Gluon now; writing those bespoke layouts is generally more intuitive than linear layouts.

Triton Linear Layout: Examples

Sat, 10 Jan 2026 14:09:38 -0800

The previous blog post talked about Triton linear layout concepts, aiming to provide some underlying motivations and an intuitive understanding. As a companion, in this one I’d like to touch on linear layout internals and follow up with some concrete examples to show its usage in action and make it even more comprehensible. Following the same vein, common languages and explanations are preferred instead of mathematical terms and interpretations.

Triton Linear Layout: Concept

Tue, 31 Dec 2024 14:21:28 -0800

Layout is a core concept in Triton for representing and optimizing distribution mappings from source problems to the target hardware compute and memory hierarchy. In this blog post I will talk about linear layout in Triton, the new unifying mechanism over existing bespoke layouts for different purposes. The aim is to provide motivation and an intuitive understanding of linear layout; I will rely on examples and illustrations instead of theories and proofs.

Triton Compiler Development Tips

Wed, 25 Dec 2024 15:13:01 -0800

Triton provides an elegant solution to program GPU kernels in Python, positioning itself as a critical component in the modern AI software stack. To deliver performance and portability, it leverages a compiler, the capability of which determines the potential. Hacking the compiler internals is not a simple task. Here are some tips hopefully useful to folks. I’ll try to keep this blog post updated periodically.

Leaving Google

Tue, 26 Sep 2023 14:50:03 -0700

Time flies—almost 9 years have passed since I joined Google. Now the time has come for me to leave and move on. While here, I’m super lucky to mostly work on open source projects that I can publicly talk about. So at the end of my tenure with Google, I’d like to reflect and summarize the incredible journey, which I am super grateful for and thoroughly enjoyed, before I forget some details.

Single-node ML Runtime Foundation

Sat, 01 Apr 2023 14:02:36 -0700

Previous blog posts overviewed the MLIR dialect hierarchy for kernel code generation (CodeGen) and zoomed in on the Linalg and Vector dialects among them. Now I will switch to discuss the runtime side a bit, in order to provide a holistic view of MLIR-based machine learning (ML) compilers. This one touches the foundation and basics, including the target landscape, runtime requirements and designs to meet thereof.

MLIR Linalg Dialect and Patterns

Wed, 31 Aug 2022 14:59:09 -0700

I explained the Vector dialect and related patterns in the previous blog post. In this one let us look at a layer higher and talk about the Linalg dialect and transformations around it.

MLIR Vector Dialect and Patterns

Sun, 31 Jul 2022 15:07:00 -0700

The vector dialect and related transformations are crucial components in the MLIR CodeGen flow for machine learning (ML). Today I will zoom in on it to explain its positioning in the overall picture, characteristics, important operations and transformations, and best practices of using it based on my experiences.

MLIR CodeGen Dialects for Machine Learning Compilers

Sun, 20 Feb 2022 15:21:03 -0500

The initial blog post in this series captured my overall take on the evolution trends of compilers and IRs. It also touched on LLVM IR, SPIR-V, and MLIR, explaining the problems they are addressing and design focuses thereof. Today I will expand on MLIR and talk about its dialect hierarchy for machine learning (ML) compilers systematically.

Compilers and IRs: LLVM IR, SPIR-V, and MLIR

Sat, 08 Jan 2022 13:58:34 -0500

Compilers are often critical components in various development toolchains that boosts developer productivity. A compiler is normally used as a monolithic black box that consumes a high-level source program and produces a semantically-equivalent low-level one. It is still structured inside though; what flows between internal layers are called intermediate representations (IRs).

IRs are critical to compilers. Like there are many compilers, there are also many IRs in use. I’m fortunate to have direct experience with three major schools of IRs or infrastructures thus far—LLVM IR, SPIR-V, MLIR, particularly extensively for the last two, where I both joined development in an early stage. So I’d like to write a series of blog posts to log down my understanding of compilers and IRs. Hopefully it could be beneficial to others.

CodeGen Performant Convolution Kernels for Mobile GPUs

Sun, 19 Sep 2021 19:17:07 -0400

This blog post talks about how to generate performant code for convolution ops using MLIR’s multiple levels of abstractions and transformations. I initially created it for targeting ARM Mali GPUs in IREE. But given it is just direct tiling and vectorization, it should be widely applicable.

I will walk through the lowering steps, so if you are interested to know how to organize MLIR’s various dialects/patterns together to achieve similar tasks, this blog post might also be useful.

Android Native Library Benchmarking Pipeline for Open Source Projects

Sat, 21 Aug 2021 22:47:43 -0400

Today I would like to describe one way to build a scalable and frictionless benchmarking pipeline for Android native libraries, aiming to support different benchmark and device variants. It is for open source projects, so it composes public services, commonly free under such conditions. The ingredients are cloud virtual machines for building, local single board computers (e.g., Raspberry Pi) for hosting Android devices and executing benchmarks, a Dana server for keeping track of benchmark results of landed changes, and Python scripts for posting benchmark comparisons to pull requests. A Buildkite pipeline chains them together and drives the full flow.

GPGPU, ML Inference, and Vulkan Compute

Sun, 25 Jul 2021 11:25:26 -0400

Nowadays GPUs are utilized for both graphics rendering and general-purpose compute (GPGPU). For the latter, CUDA is the indisputable leading solution. Though, with so many other GPU vendors, the quest for a GPGPU standard never stops. OpenCL was a great attempt and is used widely; but still it falls short on many aspects. Given the success of Vulkan in graphics and it being both a graphics and compute API, one would wonder whether it can actually be the next-generation GPGPU standard. I certainly believe so; but the road is not full of roses.

Edge/Mobile ML Inference Challenges

Sat, 17 Jul 2021 13:48:27 -0400

These days if you would like to learn about machine learning, there are abundant great resources on the web discussing model architectures and how to code and train them. Materials about inference, though, are generally much harder to find, especially for edge and mobile. You might ask, inference is just the forward pass of training, so how hard can it be? Actually, it faces lots of unique challenges, to the extent that we are basically solving completely different major problems. I have been working on inference at the edge for a while, so let me capture them in this blog post, by contrasting training and inference in the cloud.

Sampling Performance Counters from Mobile GPU Drivers

Thu, 08 Jul 2021 19:16:41 -0400

In a previous blog post I gave a general introduction to GPU driver internals in Android/Linux systems. Following up with it, today I will explain how a specific functionality, hardware performance counter (perf counter) queries, is handled in both Qualcomm Adreno and ARM Mali drivers, by walking through the kernel driver source code.

Android/Linux GPU Drivers: Internals and Resources

Mon, 05 Jul 2021 18:20:07 -0400

Recently I have been working on a library that needs to directly interact with GPU kernel drivers from various vendors on Android/Linux systems. Compared to various GPU APIs, information at this level is quite sparse; so it is not a straightforward task, to say the least, and ends up requiring me to piece multiple sources together to figure out the details. So I am logging these driver internals and resources down in case it can be useful to others that are interested in these low-level bits.

What is Vulkan Compute?

Fri, 25 Jun 2021 10:15:58 -0400

Vulkan is designed to be both a graphics and compute API. However, there is no formal definition of the compute subset from the Khronos group, the industry consortium behind Vulkan. The unified specification of Vulkan does not help here either as it contains everything, both graphics and compute. Unlike the complicated graphics subset, the compute subset is actually quite straightforward and clean. So in this blog post I try to explain what Vulkan compute is, from my point of view.

Shader Toolchain: HLSL in Vulkan

Sat, 12 May 2018 17:44:14 -0400

On 2018 Vulkan Developer Day in Montréal, I gave a talk regarding “Shader Toolchain: HLSL in Vulkan”. Here are the links to the video recording, slides, and documentation/downloads for DirectX Shader Compiler (DXC) SPIR-V CodeGen.

HLSL for Vulkan: Semantic Strings and Location Numbers

Fri, 11 May 2018 13:15:45 -0400

This blog post discusses how HLSL semantic strings are translated into SPIR-V location numbers for Vulkan shader inter-stage interface matching in the SPIR-V CodeGen of DirectXShaderCompiler (DXC). It is one of the “HLSL for Vulkan” series.

HLSL for Vulkan: Resources

Tue, 24 Apr 2018 16:39:21 -0400

This blog post discusses how to manage resources in HLSL for Vulkan, using the SPIR-V CodeGen of DirectXShaderCompiler (DXC). It is one of the “HLSL for Vulkan” series.

HLSL for Vulkan: Matrices

Wed, 18 Apr 2018 20:13:20 -0400

This blog post discusses how HLSL matrices are translated into SPIR-V for Vulkan consumption in the SPIR-V CodeGen of DirectXShaderCompiler. It is one of the “HLSL for Vulkan” series.