<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Posts on Lei.Chat()</title>
    <link>https://www.lei.chat/posts/</link>
    <description>Recent content in Posts on Lei.Chat()</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>&amp;copy; 2018 - 2026 &lt;a href=&#34;https://www.lei.chat/&#34;&gt;Lei Zhang&lt;/a&gt;
</copyright>
    <lastBuildDate>Sat, 28 Feb 2026 15:48:43 -0800</lastBuildDate><atom:link href="https://www.lei.chat/posts/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Gluon: Explicit Performance</title>
      <link>https://www.lei.chat/posts/gluon-explicit-performance/</link>
      <pubDate>Sat, 28 Feb 2026 15:48:43 -0800</pubDate>
      
      <guid>https://www.lei.chat/posts/gluon-explicit-performance/</guid>
      <description>&lt;p&gt;Gluon enhances the Triton language and compiler solutions with an additional approach towards GPU
kernel programming.
It strikes a different balance in the portability and performance spectrum to expose more compiler
internals; thus giving developers more explicit controls to reach higher performance ceiling.
In this blog post I&amp;rsquo;ll explain Gluon per my understanding.
I will also use this as an opportunity to talk about domain-specific languages, particularly
in the context of dramatically evolving agentic software development.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Triton Bespoke Layouts</title>
      <link>https://www.lei.chat/posts/triton-bespoke-layouts/</link>
      <pubDate>Sun, 25 Jan 2026 10:12:19 -0800</pubDate>
      
      <guid>https://www.lei.chat/posts/triton-bespoke-layouts/</guid>
      <description>&lt;p&gt;Hopefully the previous articles covering linear layout &lt;a href=&#34;../triton-linear-layout-concept/&#34;&gt;concepts&lt;/a&gt; and
&lt;a href=&#34;../triton-linear-layout-examples/&#34;&gt;examples&lt;/a&gt; facilitate building a solid understanding of the core generic layer powering
various Triton code generation lowering and optimizations.
Now let&amp;rsquo;s turn our focus to those bespoke layouts, which we still consistently interact with
when working on Triton compiler internals.
Additionally, developers can directly program layouts with Gluon now; writing those bespoke layouts
is generally more intuitive than linear layouts.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Triton Linear Layout: Examples</title>
      <link>https://www.lei.chat/posts/triton-linear-layout-examples/</link>
      <pubDate>Sat, 10 Jan 2026 14:09:38 -0800</pubDate>
      
      <guid>https://www.lei.chat/posts/triton-linear-layout-examples/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;../triton-linear-layout-concept/&#34;&gt;The previous blog post&lt;/a&gt; talked about Triton linear layout concepts, aiming to provide
some underlying motivations and an intuitive understanding.
As a companion, in this one I&amp;rsquo;d like to touch on linear layout internals and follow up with some
concrete examples to show its usage in action and make it even more comprehensible.
Following the same vein, common languages and explanations are preferred instead of mathematical
terms and interpretations.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Triton Linear Layout: Concept</title>
      <link>https://www.lei.chat/posts/triton-linear-layout-concept/</link>
      <pubDate>Tue, 31 Dec 2024 14:21:28 -0800</pubDate>
      
      <guid>https://www.lei.chat/posts/triton-linear-layout-concept/</guid>
      <description>&lt;p&gt;Layout is a core concept in Triton for representing and optimizing distribution
mappings from source problems to the target hardware compute and memory
hierarchy.
In this blog post I will talk about linear layout in Triton, the new unifying
mechanism over existing bespoke layouts for different purposes.
The aim is to provide motivation and an intuitive understanding of linear
layout;
I will rely on examples and illustrations instead of theories and proofs.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Triton Compiler Development Tips</title>
      <link>https://www.lei.chat/posts/triton-compiler-development-tips/</link>
      <pubDate>Wed, 25 Dec 2024 15:13:01 -0800</pubDate>
      
      <guid>https://www.lei.chat/posts/triton-compiler-development-tips/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;https://triton-lang.org/&#34;&gt;Triton&lt;/a&gt; provides an elegant solution to program GPU kernels in Python,
positioning itself as a critical component in the modern AI software stack.
To deliver performance and portability, it leverages a compiler, the capability
of which determines the potential.
Hacking the compiler internals is not a simple task.
Here are some tips hopefully useful to folks.
I&amp;rsquo;ll try to keep this blog post updated periodically.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Leaving Google</title>
      <link>https://www.lei.chat/posts/leaving-google/</link>
      <pubDate>Tue, 26 Sep 2023 14:50:03 -0700</pubDate>
      
      <guid>https://www.lei.chat/posts/leaving-google/</guid>
      <description>&lt;p&gt;Time flies&amp;mdash;almost 9 years have passed since I joined Google.
Now the time has come for me to leave and move on.
While here, I&amp;rsquo;m super lucky to mostly work on open source projects that I can
publicly talk about.
So at the end of my tenure with Google, I&amp;rsquo;d like to reflect and summarize the
incredible journey, which I am super grateful for and thoroughly enjoyed,
before I forget some details.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Single-node ML Runtime Foundation</title>
      <link>https://www.lei.chat/posts/single-node-ml-runtime-foundation/</link>
      <pubDate>Sat, 01 Apr 2023 14:02:36 -0700</pubDate>
      
      <guid>https://www.lei.chat/posts/single-node-ml-runtime-foundation/</guid>
      <description>&lt;p&gt;Previous blog posts overviewed the MLIR dialect hierarchy for &lt;a href=&#34;../mlir-codegen-dialects-for-machine-learning-compilers/&#34;&gt;kernel code
generation&lt;/a&gt; (CodeGen) and zoomed in on the
&lt;a href=&#34;../mlir-linalg-dialect-and-patterns/&#34;&gt;Linalg&lt;/a&gt; and &lt;a href=&#34;../mlir-vector-dialect-and-patterns/&#34;&gt;Vector&lt;/a&gt; dialects among them.
Now I will switch to discuss the runtime side a bit, in order to provide
a holistic view of MLIR-based machine learning (ML) compilers.
This one touches the foundation and basics, including the target landscape,
runtime requirements and designs to meet thereof.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>MLIR Linalg Dialect and Patterns</title>
      <link>https://www.lei.chat/posts/mlir-linalg-dialect-and-patterns/</link>
      <pubDate>Wed, 31 Aug 2022 14:59:09 -0700</pubDate>
      
      <guid>https://www.lei.chat/posts/mlir-linalg-dialect-and-patterns/</guid>
      <description>&lt;p&gt;I explained the Vector dialect and related patterns in the &lt;a href=&#34;../mlir-vector-dialect-and-patterns/&#34;&gt;previous blog
post&lt;/a&gt;. In this one let us look at a layer higher and
talk about the Linalg dialect and transformations around it.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>MLIR Vector Dialect and Patterns</title>
      <link>https://www.lei.chat/posts/mlir-vector-dialect-and-patterns/</link>
      <pubDate>Sun, 31 Jul 2022 15:07:00 -0700</pubDate>
      
      <guid>https://www.lei.chat/posts/mlir-vector-dialect-and-patterns/</guid>
      <description>&lt;p&gt;The &lt;code&gt;vector&lt;/code&gt; dialect and related transformations are crucial components in the
MLIR CodeGen flow for machine learning (ML).
Today I will zoom in on it to explain its positioning in the overall
picture, characteristics, important operations and transformations,
and best practices of using it based on my experiences.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>MLIR CodeGen Dialects for Machine Learning Compilers</title>
      <link>https://www.lei.chat/posts/mlir-codegen-dialects-for-machine-learning-compilers/</link>
      <pubDate>Sun, 20 Feb 2022 15:21:03 -0500</pubDate>
      
      <guid>https://www.lei.chat/posts/mlir-codegen-dialects-for-machine-learning-compilers/</guid>
      <description>&lt;p&gt;The &lt;a href=&#34;../compilers-and-irs-llvm-ir-spirv-and-mlir/&#34;&gt;initial blog post&lt;/a&gt; in this series captured my overall take
on the evolution trends of compilers and IRs.
It also touched on &lt;a href=&#34;../compilers-and-irs-llvm-ir-spirv-and-mlir/#llvm-ir&#34;&gt;LLVM IR&lt;/a&gt;, &lt;a href=&#34;../compilers-and-irs-llvm-ir-spirv-and-mlir/#spir-v&#34;&gt;SPIR-V&lt;/a&gt;, and
&lt;a href=&#34;../compilers-and-irs-llvm-ir-spirv-and-mlir/#mlir&#34;&gt;MLIR&lt;/a&gt;, explaining the problems they are addressing and design
focuses thereof.
Today I will expand on MLIR and talk about its dialect hierarchy for machine
learning (ML) compilers systematically.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Compilers and IRs: LLVM IR, SPIR-V, and MLIR</title>
      <link>https://www.lei.chat/posts/compilers-and-irs-llvm-ir-spirv-and-mlir/</link>
      <pubDate>Sat, 08 Jan 2022 13:58:34 -0500</pubDate>
      
      <guid>https://www.lei.chat/posts/compilers-and-irs-llvm-ir-spirv-and-mlir/</guid>
      <description>&lt;p&gt;Compilers are often critical components in various development toolchains that
boosts developer productivity.
A compiler is normally used as a monolithic black box that consumes a high-level
source program and produces a semantically-equivalent low-level one.
It is still structured inside though; what flows between internal layers
are called intermediate representations (IRs).&lt;/p&gt;
&lt;p&gt;IRs are critical to compilers. Like there are many compilers, there are also
many IRs in use.
I&amp;rsquo;m fortunate to have direct experience with three major schools of IRs or
infrastructures thus far&amp;mdash;LLVM IR, SPIR-V, MLIR, particularly extensively for
the last two, where I both joined development in an early stage.
So I&amp;rsquo;d like to write a series of blog posts to log down my understanding of
compilers and IRs. Hopefully it could be beneficial to others.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>CodeGen Performant Convolution Kernels for Mobile GPUs</title>
      <link>https://www.lei.chat/posts/codegen-performant-convolution-kernels-for-mobile-gpus/</link>
      <pubDate>Sun, 19 Sep 2021 19:17:07 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/codegen-performant-convolution-kernels-for-mobile-gpus/</guid>
      <description>&lt;p&gt;This blog post talks about how to generate performant code for convolution ops
using MLIR’s multiple levels of abstractions and transformations.
I initially created it for targeting ARM Mali GPUs in IREE. But given it is
just direct tiling and vectorization, it should be widely applicable.&lt;/p&gt;
&lt;p&gt;I will walk through the lowering steps, so if you are interested to know how to
organize MLIR’s various dialects/patterns together to achieve similar tasks,
this blog post might also be useful.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Android Native Library Benchmarking Pipeline for Open Source Projects</title>
      <link>https://www.lei.chat/posts/android-native-library-benchmarking-pipeline-for-open-source-projects/</link>
      <pubDate>Sat, 21 Aug 2021 22:47:43 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/android-native-library-benchmarking-pipeline-for-open-source-projects/</guid>
      <description>&lt;p&gt;Today I would like to describe one way to build a scalable and frictionless
benchmarking pipeline for Android native libraries, aiming to support different
benchmark and device variants.
It is for open source projects, so it composes public services, commonly
free under such conditions.
The ingredients are cloud virtual machines for building, local single board
computers (e.g., Raspberry Pi) for hosting Android devices and executing
benchmarks, a &lt;a href=&#34;https://github.com/google/dana&#34;&gt;Dana&lt;/a&gt; server for keeping track of benchmark results of
landed changes, and Python scripts for posting benchmark comparisons to pull
requests.
A &lt;a href=&#34;https://buildkite.com&#34;&gt;Buildkite&lt;/a&gt; pipeline chains them together and drives the full flow.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>GPGPU, ML Inference, and Vulkan Compute</title>
      <link>https://www.lei.chat/posts/gpgpu-ml-inference-and-vulkan-compute/</link>
      <pubDate>Sun, 25 Jul 2021 11:25:26 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/gpgpu-ml-inference-and-vulkan-compute/</guid>
      <description>&lt;p&gt;Nowadays GPUs are utilized for both graphics rendering and general-purpose
compute (GPGPU). For the latter, CUDA is the indisputable leading solution.
Though, with so many other GPU vendors, the quest for a GPGPU standard never
stops. OpenCL was a great attempt and is used widely; but still it falls
short on many aspects.
Given the success of Vulkan in graphics and it being both a graphics and
compute API, one would wonder whether it can actually be the next-generation
GPGPU standard. I certainly believe so; but the road is not full of roses.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Edge/Mobile ML Inference Challenges</title>
      <link>https://www.lei.chat/posts/edge-mobile-ml-inference-challenges/</link>
      <pubDate>Sat, 17 Jul 2021 13:48:27 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/edge-mobile-ml-inference-challenges/</guid>
      <description>&lt;p&gt;These days if you would like to learn about machine learning, there are
abundant great resources on the web discussing model architectures and how to
code and train them.
Materials about inference, though, are generally much harder to find,
especially for edge and mobile. You might ask, inference is just the forward
pass of training, so how hard can it be? Actually, it faces lots of unique
challenges, to the extent that we are basically solving completely different
major problems.
I have been working on inference at the edge for a while, so let me capture
them in this blog post, by contrasting training and inference in the cloud.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Sampling Performance Counters from Mobile GPU Drivers</title>
      <link>https://www.lei.chat/posts/sampling-performance-counters-from-gpu-drivers/</link>
      <pubDate>Thu, 08 Jul 2021 19:16:41 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/sampling-performance-counters-from-gpu-drivers/</guid>
      <description>&lt;p&gt;In a &lt;a href=&#34;../android-linux-gpu-drivers-internals-and-resources&#34;&gt;previous blog post&lt;/a&gt; I gave a general introduction
to GPU driver internals in Android/Linux systems. Following up with it, today
I will explain how a specific functionality, hardware performance counter
(perf counter) queries, is handled in both Qualcomm Adreno and ARM Mali drivers,
by walking through the kernel driver source code.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Android/Linux GPU Drivers: Internals and Resources</title>
      <link>https://www.lei.chat/posts/android-linux-gpu-drivers-internals-and-resources/</link>
      <pubDate>Mon, 05 Jul 2021 18:20:07 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/android-linux-gpu-drivers-internals-and-resources/</guid>
      <description>&lt;p&gt;Recently I have been working on a library that needs to directly interact with
GPU kernel drivers from various vendors on Android/Linux systems. Compared to
various GPU APIs, information at this level is quite sparse; so it is not a
straightforward task, to say the least, and ends up requiring me to piece
multiple sources together to figure out the details. So I am logging these driver
internals and resources down in case it can be useful to others that are
interested in these low-level bits.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>What is Vulkan Compute?</title>
      <link>https://www.lei.chat/posts/what-is-vulkan-compute/</link>
      <pubDate>Fri, 25 Jun 2021 10:15:58 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/what-is-vulkan-compute/</guid>
      <description>&lt;p&gt;Vulkan is designed to be both a graphics and compute API. However, there is no
formal definition of the compute subset from the Khronos group, the industry
consortium behind Vulkan. The unified specification of Vulkan does not help here
either as it contains everything, both graphics and compute. Unlike the
complicated graphics subset, the compute subset is actually quite
straightforward and clean. So in this blog post I try to explain what Vulkan
compute is, from my point of view.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Shader Toolchain: HLSL in Vulkan</title>
      <link>https://www.lei.chat/posts/shader-toolchain-hlsl-in-vulkan/</link>
      <pubDate>Sat, 12 May 2018 17:44:14 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/shader-toolchain-hlsl-in-vulkan/</guid>
      <description>&lt;p&gt;On &lt;a href=&#34;https://www.khronos.org/events/2018-vulkan-developer-day-in-montreal&#34;&gt;2018 Vulkan Developer Day in Montréal&lt;/a&gt;, I gave a talk
regarding “Shader Toolchain: HLSL in Vulkan”. Here are the links to the
video recording, slides, and documentation/downloads for DirectX Shader
Compiler (DXC) SPIR-V CodeGen.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>HLSL for Vulkan: Semantic Strings and Location Numbers</title>
      <link>https://www.lei.chat/posts/hlsl-for-vulkan-semantic-strings-and-location-numbers/</link>
      <pubDate>Fri, 11 May 2018 13:15:45 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/hlsl-for-vulkan-semantic-strings-and-location-numbers/</guid>
      <description>&lt;p&gt;This blog post discusses how HLSL semantic strings are translated into
SPIR-V location numbers for Vulkan shader inter-stage interface matching
in the &lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/SPIR-V.rst&#34;&gt;SPIR-V CodeGen&lt;/a&gt; of &lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler&#34;&gt;DirectXShaderCompiler&lt;/a&gt; (DXC).
It is one of the “HLSL for Vulkan” series.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>HLSL for Vulkan: Resources</title>
      <link>https://www.lei.chat/posts/hlsl-for-vulkan-resources/</link>
      <pubDate>Tue, 24 Apr 2018 16:39:21 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/hlsl-for-vulkan-resources/</guid>
      <description>&lt;p&gt;This blog post discusses how to manage resources in HLSL for Vulkan, using the
&lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/SPIR-V.rst&#34;&gt;SPIR-V CodeGen&lt;/a&gt; of &lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler&#34;&gt;DirectXShaderCompiler&lt;/a&gt; (DXC).
It is one of the “HLSL for Vulkan” series.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>HLSL for Vulkan: Matrices</title>
      <link>https://www.lei.chat/posts/hlsl-for-vulkan-matrices/</link>
      <pubDate>Wed, 18 Apr 2018 20:13:20 -0400</pubDate>
      
      <guid>https://www.lei.chat/posts/hlsl-for-vulkan-matrices/</guid>
      <description>&lt;p&gt;This blog post discusses how HLSL matrices are translated into SPIR-V for Vulkan
consumption in the &lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/SPIR-V.rst&#34;&gt;SPIR-V CodeGen&lt;/a&gt; of &lt;a href=&#34;https://github.com/Microsoft/DirectXShaderCompiler&#34;&gt;DirectXShaderCompiler&lt;/a&gt;.
It is one of the “HLSL for Vulkan” series.&lt;/p&gt;</description>
    </item>
    
  </channel>
</rss>
