<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://cmeraki.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://cmeraki.github.io/" rel="alternate" type="text/html" /><updated>2024-11-21T08:14:02+00:00</updated><id>https://cmeraki.github.io/feed.xml</id><title type="html">LLM Labs</title><subtitle>LLM Labs</subtitle><author><name>Meraki Labs</name></author><entry><title type="html">GPUs Part 3 - Going from here</title><link href="https://cmeraki.github.io/gpu-part3.html" rel="alternate" type="text/html" title="GPUs Part 3 - Going from here" /><published>2024-06-30T00:00:00+00:00</published><updated>2024-06-30T00:00:00+00:00</updated><id>https://cmeraki.github.io/gpu-part3</id><content type="html" xml:base="https://cmeraki.github.io/gpu-part3.html"><![CDATA[<!-- markdownlint-disable MD036 MD029 -->

<p>Written by <a href="https://www.linkedin.com/in/r0m1t/">Romit Jain</a></p>

<p>Hopefully, you have read <a href="./gpu-part1.html">part 1</a> and <a href="./gpu-part2.html">part 2</a> of Learning about GPUs series. This part provides an index of all the useful resources one can consider to get a more advanced understanding of GPUs.</p>

<h2 id="learning-about-the-fundamentals">Learning about the fundamentals</h2>

<ol>
  <li>[Book] Programming Massively Parallel Processors, A Hands-on Approach By David B. Kirk, Wen-mei W. Hwu
    <ol>
      <li>This is the best resource to learn about parallel programming and GPUs. The first 4 chapters explain the fundamentals of GPU hardware and its programming model</li>
    </ol>
  </li>
  <li>[YouTube playlist] 12 to 14 videos in <a href="https://www.youtube.com/playlist?list=PLG3vBTUJlY2HdwYsdFCdXQraInoc3j9DU">COS 436</a></li>
  <li><a href="https://www.youtube.com/channel/UCJgIbYl6C5no72a0NUAPcTA">CUDA Mode</a>
    <ol>
      <li>Very good resource for learning about GPUs/CUDA/Triton. They also have a very active Discord</li>
    </ol>
  </li>
  <li><a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html">CUDA C++ programming guide</a>
    <ol>
      <li>Official guide from Nvidia which can be used as a reference</li>
    </ol>
  </li>
  <li>[YouTube playlist] <a href="https://www.youtube.com/playlist?list=PLC6u37oFvF40BAm7gwVP7uDdzmW83yHPe">CUDA teaching center</a>
    <ol>
      <li>Short series to get started in CUDA and get a refresher on GPU hardware</li>
    </ol>
  </li>
</ol>

<h2 id="notable-talks">Notable Talks</h2>

<ol>
  <li><a href="https://www.youtube.com/watch?v=3l10o0DYJXg">GTC 2021 - How GPU Computing Works</a></li>
  <li><a href="https://www.youtube.com/live/v_q2JTIqE20">GPU Optimization session hosted by Chip Huyen</a></li>
  <li><a href="https://www.youtube.com/watch?v=QQceTDjA4f4">GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA</a></li>
  <li><a href="https://www.youtube.com/watch?v=KHa-OSrZPGo">Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler</a></li>
</ol>

<h2 id="notable-blogs">Notable blogs</h2>

<ol>
  <li><a href="https://codeconfessions.substack.com/p/gpu-computing">What every developer should know about GPU computing</a>
    <ol>
      <li>Gentle introduction to the GPU programming model</li>
    </ol>
  </li>
  <li><a href="https://www.thonking.ai/p/what-shapes-do-matrix-multiplications">What shapes do Matrix Multiplication Like?</a>
    <ol>
      <li>Puzzles to test your understanding of GPU hardware</li>
    </ol>
  </li>
  <li><a href="https://horace.io/brrr_intro.html">Making Deep Learning Go Brrrr From First Principles</a></li>
  <li><a href="https://finbarr.ca/how-is-llama-cpp-possible/">How is LLaMa.cpp possible?</a></li>
</ol>

<h2 id="programming-tutorials">Programming tutorials</h2>

<ol>
  <li><a href="https://penny-xu.github.io/blog/tiled-matrix-multiplication">Tiled matrix multiplication</a> in CUDA</li>
  <li>Matrix multiplication in pure CUDA: <a href="https://siboehm.com/articles/22/CUDA-MMM">How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog</a></li>
  <li><a href="https://github.com/srush/GPU-Puzzles">GPU puzzles by Srush</a></li>
  <li><a href="https://github.com/srush/Triton-Puzzles">Triton puzzles by Srush</a></li>
  <li><a href="https://github.com/karpathy/llm.c">LLM.c</a> LLM training in raw C/CUDA</li>
</ol>

<h2 id="citations">Citations</h2>

<p>For attribution, please cite this as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{romit2024gpus3,
  title   = {GPUs Part 3},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {June},
  url     = {https://cmeraki.github.io/gpu-part3.html}
}
</code></pre></div></div>]]></content><author><name>Meraki Labs</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">GPUs Part 2 - Understanding the GPU programming model</title><link href="https://cmeraki.github.io/gpu-part2.html" rel="alternate" type="text/html" title="GPUs Part 2 - Understanding the GPU programming model" /><published>2024-05-26T00:00:00+00:00</published><updated>2024-05-26T00:00:00+00:00</updated><id>https://cmeraki.github.io/gpu-part2</id><content type="html" xml:base="https://cmeraki.github.io/gpu-part2.html"><![CDATA[<!-- markdownlint-disable MD036 MD029 -->

<p>Written by <a href="https://www.linkedin.com/in/r0m1t/">Romit Jain</a></p>

<p><a href="./gpu-part1.html">Part 1</a> in the series gives a basic understanding of the GPU hardware. This blog will describe the programming model that is used to run programs on GPUs.</p>

<h2 id="hardware-to-software-mapping-and-programming-model-of-the-gpu">Hardware to software mapping and programming model of the GPU</h2>

<blockquote>
  <p>2 things to keep in mind before we start:</p>

  <ol>
    <li>The physical concepts of hardware do not necessarily translate one-to-one to logical concepts in software.</li>
    <li>In GPU programming, a kernel is a function that is written to be executed on the GPU. A program can have multiple kernels and they can be “launched” from the CPU.</li>
  </ol>
</blockquote>

<h3 id="threads">Threads</h3>

<p>Each kernel is executed by a thread in the GPU. And every thread executes the same kernel (assuming there is only a single kernel in the program). This makes it necessary to write kernels such that a single function can operate on all the data points. When a kernel is launched, multiple GPU threads are spawned that execute instructions written inside that kernel. The number of threads that are spawned at once is configurable.</p>

<p>All threads have some small memory associated with it which is called local memory. Apart from that, threads can also access the shared memory, L2 cache, and global memory.</p>

<p>Physically, threads are assigned to cores. Cores execute software threads.</p>

<h3 id="blocks">Blocks</h3>

<p>Threads are logically organized into blocks. Every block has a pre-defined number of threads assigned to it. <em>Just for logical purposes</em>, threads can be arranged inside a block in either a 1D, 2D, or 3D array layout. Blocks can be thought of as an array of threads. It’s important to understand that this 1D, 2D, or 3D arrangement is purely logical and for the developer’s convenience only. This arrangement is provided so it’s easier to visualize input and output data. For example, if the kernel needs to operate on a 100x100 matrix, then a kernel with a block size of 100 by 100 threads can be launched. That will start a total of $10^4$ (100x100) threads which can be mapped to the matrix. The kernel can be written such that every single thread operates on every single element of the matrix.</p>

<p>In the physical world, every block is assigned an SM (Streaming multiprocessor). Throughout its execution, the block will only be executed on the same SM. Since every block is assigned an SM, it also has access to the SM’s shared memory (refer to Part 1 of the series for more context). All the threads that are part of a single block can access and share this memory.</p>

<h3 id="grids">Grids</h3>

<p>Similar to how threads are organized in blocks, blocks are themselves organized into a grid. That allows the GPU to launch multiple blocks at one time. A single GPU has multiple SMs, so multiple blocks can be launched at once so that all of the SMs and cores are utilized. Let’s assume that the program executes 25 blocks and the GPU has 10 SMs. Then the program will execute 10 blocks in the first wave, 10 blocks in the second wave, and 5 blocks in the third wave. The first two waves will have 100% optimization but the last wave will have 50% utilization.</p>

<p>Blocks inside a grid can be organized in the same way that threads are organized inside a block. A grid can have a 1D, 2D, or 3D array layout of the blocks. The arrangement of blocks and threads is just logical. A single program only executes a single grid at a time. The grid has access to the global memory or HBM of the GPU.</p>

<p><img src="assets/images/post3/image-2.png" alt="threads-blocks" /></p>

<p>Figure 1: Grids/Blocks/Threads layout
Source: Borrowed from <a href="https://siboehm.com/articles/22/CUDA-MMM">this</a> excellent blog.</p>

<p>During execution, a total of <code class="language-plaintext highlighter-rouge">blocks per thread (b) * number of blocks (num)</code> physical threads are spawned. Each physical thread is numbered from <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">(b*num)-1</code>. So, how is the 2D or 3D structure of logical thread blocks mapped to the physical thread? By unrolling.</p>

<p>A 2D array layout can be unrolled to 1D. If it’s row-major ordering, then a 2D matrix after unrolling will look like this:</p>

<p><img src="assets/images/post3/image-1.png" alt="matrix unrolling" /></p>

<p>Figure 2: Element <code class="language-plaintext highlighter-rouge">A[2][3]</code> in the 2D matrix will be <code class="language-plaintext highlighter-rouge">A[5]</code> in the flattened 1D array. This is how the mapping of 2D blocks of thread to the 1D thread array is accomplished.</p>

<p>When blocks and threads are arranged in this 1D, 2D, or 3D layout, CUDA maps them to the x-axis, y-axis, and z-axis in its programming model. This will be useful in the next section.</p>

<h2 id="a-simple-example-in-cuda">A simple example in CUDA</h2>

<p>CUDA is a programming extension of C/C++ that helps write heterogeneous programs (that run on CPU and GPU). These programs allow to define and launch kernels from the CPU. CUDA is very powerful and offers a lot of ways to optimize the kernels. It’s just a bit … too verbose. Let’s implement a very naive implementation of matrix multiplication to understand how CUDA works. A few CUDA function calls will be used throughout the code. They should be self-explanatory, but in case they are not, just google the syntax. This is a relatively simple kernel, so should be easy to follow along.</p>

<p>Here are the general steps of writing and launching a kernel from CUDA:</p>

<ol>
  <li>Allocate the memory for the data (both input and output) on the CPU memory (also called as host). Allocate memory for the input (<code class="language-plaintext highlighter-rouge">X</code>), weight matrix (<code class="language-plaintext highlighter-rouge">W</code>), and output (<code class="language-plaintext highlighter-rouge">O</code>). Assuming <code class="language-plaintext highlighter-rouge">B</code> as the batch size, <code class="language-plaintext highlighter-rouge">N</code> as the number of rows or sequence length in transformers, <code class="language-plaintext highlighter-rouge">D_in</code> as the number of columns or embedding dimension, and D_out as the hidden dimension.</li>
</ol>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="o">*</span><span class="n">X</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_in</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>      <span class="c1">// Input data</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">W</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="n">D_in</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>    <span class="c1">// Weights</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">O</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>     <span class="c1">// Output data</span>
</code></pre></div></div>

<ol>
  <li>Allocate the memory for the data on the GPU (also called as device)</li>
</ol>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="o">*</span><span class="n">d_X</span><span class="p">,</span> <span class="o">*</span><span class="n">d_W</span><span class="p">,</span> <span class="o">*</span><span class="n">d_O</span><span class="p">;</span>

<span class="n">cudaMalloc</span><span class="p">((</span><span class="kt">void</span><span class="o">**</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">d_X</span><span class="p">,</span> <span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_in</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>      <span class="c1">//cudaMalloc is a CUDA function and allocates memory on the GPU memory</span>
<span class="n">cudaMalloc</span><span class="p">((</span><span class="kt">void</span><span class="o">**</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">d_W</span><span class="p">,</span> <span class="n">D_in</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
<span class="n">cudaMalloc</span><span class="p">((</span><span class="kt">void</span><span class="o">**</span><span class="p">)</span> <span class="o">&amp;</span><span class="n">d_O</span><span class="p">,</span> <span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
</code></pre></div></div>

<ol>
  <li>Copy the relevant data from the CPU memory to the GPU memory. Let’s assume <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">W</code> are loaded with the relevant data. Next, transfer that data to the GPU. Just for convenience, I have prefixed the variable that will reside on GPU memory with <code class="language-plaintext highlighter-rouge">d_</code>. These variables are a copy of <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">W</code> but allocated in the GPU memory.</li>
</ol>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">d_X</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_in</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">);</span>     <span class="c1">// cudaMemcpy is again a CUDA function</span>
<span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">d_W</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">D_in</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">);</span>

</code></pre></div></div>

<ol>
  <li>Launch the kernel. Assuming that the kernel is called <code class="language-plaintext highlighter-rouge">matMul</code>, <code class="language-plaintext highlighter-rouge">grid</code> defines how the blocks are arranged and <code class="language-plaintext highlighter-rouge">blocks</code> define how threads are arranged in each block. For this example, the <code class="language-plaintext highlighter-rouge">grid</code> will be a 1D array equal to the batch size. <code class="language-plaintext highlighter-rouge">blocks</code> will have the same layout as the output dimension of the output matrix (<code class="language-plaintext highlighter-rouge">N*D_out</code>). This means that every block will process a single output matrix from the batch and every thread will process a single cell of the output matrix.</li>
</ol>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Launch B blocks, each block processing a single batch</span>
<span class="n">dim3</span> <span class="nf">grid</span><span class="p">(</span><span class="n">B</span><span class="p">);</span>
<span class="cm">/*
Arrange the threads inside a block in the same dimension as the output
i.e N*D_out, so that logically each thread corresponds to a single element in the
output matrix. Hence, each thread is responsible for computing a single element of the output.
*/</span>
<span class="n">dim3</span> <span class="nf">blocks</span><span class="p">(</span><span class="n">D_out</span><span class="p">,</span> <span class="n">N</span><span class="p">);</span> <span class="c1">//D_out is first instead of N, because the function dim3 takes input in x, y, z notation. x axis is the columnar axis and y axis is the row axis</span>

<span class="n">matMul</span><span class="o">&lt;&lt;&lt;</span><span class="n">grid</span><span class="p">,</span> <span class="n">blocks</span><span class="o">&gt;&gt;&gt;</span><span class="p">(</span>
    <span class="n">d_X</span><span class="p">,</span>
    <span class="n">d_W</span><span class="p">,</span>
    <span class="n">d_O</span><span class="p">,</span>
    <span class="n">B</span><span class="p">,</span>
    <span class="n">N</span><span class="p">,</span>
    <span class="n">D_in</span><span class="p">,</span>
    <span class="n">D_out</span>
<span class="p">);</span>
</code></pre></div></div>

<p>In total <code class="language-plaintext highlighter-rouge">B*N*D_out</code> threads are spawned, arranged in <code class="language-plaintext highlighter-rouge">B</code> blocks.</p>

<ol>
  <li>Copy the relevant data (usually only the output) from the GPU memory to the CPU memory. Once the kernel execution is completed, the output is copied from the GPU memory back to the CPU memory so that it can be used for any downstream processing.</li>
</ol>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">O</span><span class="p">,</span> <span class="n">d_O</span><span class="p">,</span> <span class="n">B</span><span class="o">*</span><span class="n">N</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyDeviceToHost</span><span class="p">);</span>
</code></pre></div></div>

<p>These 5 steps are followed in almost all GPU programs. Let’s now dive deep into the actual kernel:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">matMul</span><span class="p">(</span>
    <span class="kt">float</span><span class="o">*</span> <span class="n">X</span><span class="p">,</span>
    <span class="kt">float</span><span class="o">*</span> <span class="n">W</span><span class="p">,</span>
    <span class="kt">float</span><span class="o">*</span> <span class="n">OO</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">B</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">N</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">D_in</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">D_out</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="cm">/*
    This kernel takes a batch of data: (B x N x Din)
    and a weight matrix: (Din X Dout)
    and produces: (B x N x Dout)
    */</span>

    <span class="kt">int</span> <span class="n">batch</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">row</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>

    <span class="kt">int</span> <span class="n">out_offset</span> <span class="o">=</span> <span class="n">N</span><span class="o">*</span><span class="n">D_out</span><span class="o">*</span><span class="n">batch</span> <span class="o">+</span> <span class="n">row</span><span class="o">*</span><span class="n">D_out</span> <span class="o">+</span> <span class="n">col</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">((</span><span class="n">batch</span> <span class="o">&lt;</span> <span class="n">B</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">col</span> <span class="o">&lt;</span> <span class="n">D_out</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">row</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">))</span> <span class="p">{</span>
        <span class="kt">float</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">D_in</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">sum</span> <span class="o">+=</span> <span class="n">X</span><span class="p">[</span><span class="n">N</span> <span class="o">*</span> <span class="n">D_in</span> <span class="o">*</span> <span class="n">batch</span> <span class="o">+</span> <span class="n">row</span> <span class="o">*</span> <span class="n">D_in</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">W</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="n">D_out</span> <span class="o">+</span> <span class="n">col</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="n">OO</span><span class="p">[</span><span class="n">out_offset</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Remember that physically there is no 2D or 3D arrangement of threads. That construct is just provided by CUDA to help developers map the problems appropriately. Physically it’s just a single 1D array of threads. Since <code class="language-plaintext highlighter-rouge">B*N*D_out</code> threads are spawned, it maps exactly with the 1D layout of the output matrix.</p>

<p>To figure out which data a particular thread should process, the kernel just needs to figure out which thread is it executing. Depending on the batch, row, and column, each thread will load different parts of the input and weight matrix. These are called offsets and there are 4 offsets calculated in the code:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">batch</code>: Figure out which matrix in the batch this kernel is processing.<code class="language-plaintext highlighter-rouge">blockIdx.x</code> gives the block ID in the x-axis of the grid layout. Since there is a 1D grid, this is the only direction available.</li>
  <li><code class="language-plaintext highlighter-rouge">row</code>: Figure out within a matrix, which row is the kernel processing. Rows are mapped to the y-axis of the block layout.</li>
  <li><code class="language-plaintext highlighter-rouge">col</code>: Figure out within a matrix, which column is the kernel processing. Columns are mapped to the x-axis of the block layout.</li>
  <li><code class="language-plaintext highlighter-rouge">out_offset</code>: Finally, map the thread ID to the exact cell in the output matrix:
    <ol>
      <li>Skipping <code class="language-plaintext highlighter-rouge">batch</code> matrices to arrive at the current matrix. To skip one single matrix, move ahead <code class="language-plaintext highlighter-rouge">N*D_out</code> number of elements in the flattened 1D array</li>
      <li>Skipping <code class="language-plaintext highlighter-rouge">row</code> number of rows. In a 1D flattened layout, a row can be skipped by moving ahead <code class="language-plaintext highlighter-rouge">D_out</code> elements.</li>
      <li>Finally, adding <code class="language-plaintext highlighter-rouge">col</code> to the summation of the above two to arrive at the element.</li>
    </ol>
  </li>
</ol>

<p>Hopefully, this figure will make it clearer about the offset calculation.</p>

<p><img src="assets/images/post3/cudakernels.png" alt="alt text" /></p>

<p>Figure 3: If the output data and threads have the exact length (which in this case is true), they can be mapped 1 to 1. <code class="language-plaintext highlighter-rouge">B</code>, <code class="language-plaintext highlighter-rouge">N</code>, <code class="language-plaintext highlighter-rouge">D_out</code>, are the batch size, number of rows, and number of columns in the output data respectively. <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">n</code>, <code class="language-plaintext highlighter-rouge">d</code> is <code class="language-plaintext highlighter-rouge">i th</code> batch, row, and column respectively.</p>

<p>After calculating these offsets, the corresponding row from <code class="language-plaintext highlighter-rouge">X</code> and the corresponding column from <code class="language-plaintext highlighter-rouge">W</code> are loaded followed by a single vector multiplication in a for loop. It is similar to <code class="language-plaintext highlighter-rouge">out_offset</code> calculation and should be easy to follow.</p>

<p>The complete code is present <a href="https://github.com/cmeraki/vit.triton/blob/main/examples/matmul_batch.cu">here</a>. Running the code requires <code class="language-plaintext highlighter-rouge">nvcc</code> (the compiler for CUDA programs), an NVIDIA GPU to run the program, the CUDA drivers, and the CUDA toolkit installed.</p>

<h2 id="a-simple-example-in-triton">A simple example in Triton</h2>

<p>CUDA is amazing and allows a lot of optimizations. But it is quite verbose. Plus, it might not be comfortable for those coming from the machine learning or data science domain. Open AI released a package called <a href="https://triton-lang.org/">Triton</a> that provides a Python environment to write kernels and compile them for any GPU. Triton allows us to write very performant kernels in Python directly.</p>

<p>But instead of working with individual threads, Triton works with blocks. Instead of each kernel being assigned a thread, in Triton each kernel is assigned a block. Triton abstracts out the thread computation completely.</p>

<p>In the above example of matrix multiplication, instead of computing a single element of the output in the kernel, Triton can compute values for small “blocks” of the output matrix at once.</p>

<p><img src="assets/images/post3/image-4.png" alt="alt text" /></p>

<p>Figure 4: (Left) CUDA execution model vs (Right) Triton execution model
Source: <a href="https://triton-lang.org/main/programming-guide/chapter-1/introduction.html">Triton documentation</a></p>

<p>Let’s reimplement the matrix multiplication example using Triton. The steps for Triton are very simple.</p>

<ol>
  <li>Implement a “wrapper” function to call the kernel. Below, the Triton’s kernel is being called with <code class="language-plaintext highlighter-rouge">matmul_kernel</code>. Define the grid and the block sizes similar to how it is done in CUDA. There are some assert statements to make sure that no errors are raised when input is passed to the kernel. Triton implicitly converts all torch tensors into a pointer. It just needs to be verified that all tensors passed to the kernel are already on the GPU (by <code class="language-plaintext highlighter-rouge">x.to('cuda:0')</code>).
    <ol>
      <li>Unlike CUDA however, the grid has 3 axes in this implementation. The first axis corresponds to the batch size, and in second axis corresponds to the number of times it will take <code class="language-plaintext highlighter-rouge">BLOCK_SIZE_ROW</code> to cover all the rows (similarly for <code class="language-plaintext highlighter-rouge">BLOCK_SIZE_COL</code> for the third axis).</li>
      <li>During execution, this means, that for kernel will process - <code class="language-plaintext highlighter-rouge">BLOCK_SIZE_ROW x BLOCK_SIZE_COL</code> sub-matrix in the input for every input in the batch.</li>
    </ol>
  </li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">matmul</span><span class="p">(</span><span class="nb">input</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">weight</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="s">"""
    Implements matrix multiplication between two matrices. The input matrix is 3 dimension where
    first dimension is the batch size. The weight matrix will be multiplied with each of the batches
    of the input matrix.

    Args:
        input (torch.Tensor): Matrix with dimension (B x N x D_in)
        weight (torch.Tensor): Matrix with dimension (D_in x D_out)

    Returns:
        torch.Tensor: Ouptut matrix with dimension (B x N x D_out)
    """</span>
    <span class="k">assert</span> <span class="nb">input</span><span class="p">.</span><span class="n">is_cuda</span><span class="p">,</span> <span class="s">'Inputs are not on GPU, ensure the input matrix is loaded on the GPU'</span>
    <span class="k">assert</span> <span class="n">weight</span><span class="p">.</span><span class="n">is_cuda</span><span class="p">,</span> <span class="s">'Weights are not on GPU, ensure the weight matrix is loaded on the GPU'</span>
    <span class="k">assert</span> <span class="nb">input</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">weight</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">],</span> <span class="s">'Input and weight matrix are not compatible'</span>

    <span class="n">B</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">D_in</span> <span class="o">=</span> <span class="nb">input</span><span class="p">.</span><span class="n">shape</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">D_out</span> <span class="o">=</span> <span class="n">weight</span><span class="p">.</span><span class="n">shape</span>

    <span class="n">output</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">empty</span><span class="p">((</span><span class="n">B</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">D_out</span><span class="p">),</span> <span class="n">device</span><span class="o">=</span><span class="nb">input</span><span class="p">.</span><span class="n">device</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">input</span><span class="p">.</span><span class="n">dtype</span><span class="p">)</span>

    <span class="n">BLOCK_SIZE_ROW</span><span class="p">,</span> <span class="n">BLOCK_SIZE_COL</span> <span class="o">=</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">16</span>
    <span class="c1"># Grid is aligned with the ouput matrix
</span>    <span class="n">grid</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">meta</span><span class="p">:</span> <span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">triton</span><span class="p">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">meta</span><span class="p">[</span><span class="s">"BLOCK_SIZE_ROW"</span><span class="p">]),</span> <span class="n">triton</span><span class="p">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">D_out</span><span class="p">,</span> <span class="n">meta</span><span class="p">[</span><span class="s">"BLOCK_SIZE_COL"</span><span class="p">]))</span>

    <span class="n">matmul_kernel</span><span class="p">[</span><span class="n">grid</span><span class="p">](</span>
        <span class="n">input_ptr</span><span class="o">=</span><span class="nb">input</span><span class="p">,</span>
        <span class="n">input_batch_stride</span><span class="o">=</span><span class="nb">input</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
        <span class="n">input_row_stride</span><span class="o">=</span><span class="nb">input</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
        <span class="n">input_col_stride</span><span class="o">=</span><span class="nb">input</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
        <span class="n">weight_ptr</span><span class="o">=</span><span class="n">weight</span><span class="p">,</span>
        <span class="n">weight_row_stride</span><span class="o">=</span><span class="n">weight</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
        <span class="n">weight_col_stride</span><span class="o">=</span><span class="n">weight</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
        <span class="n">output_ptr</span><span class="o">=</span><span class="n">output</span><span class="p">,</span>
        <span class="n">output_batch_stride</span><span class="o">=</span><span class="n">output</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
        <span class="n">output_row_stride</span><span class="o">=</span><span class="n">output</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
        <span class="n">output_col_stride</span><span class="o">=</span><span class="n">output</span><span class="p">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
        <span class="n">num_rows</span><span class="o">=</span><span class="n">N</span><span class="p">,</span>
        <span class="n">num_input_cols</span><span class="o">=</span><span class="n">D_in</span><span class="p">,</span>
        <span class="n">num_output_cols</span><span class="o">=</span><span class="n">D_out</span><span class="p">,</span>
        <span class="n">BLOCK_SIZE_ROW</span><span class="o">=</span><span class="n">BLOCK_SIZE_ROW</span><span class="p">,</span>
        <span class="n">BLOCK_SIZE_COL</span><span class="o">=</span><span class="n">BLOCK_SIZE_COL</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">output</span>
</code></pre></div></div>

<ol>
  <li>That’s it. Tensor strides<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> are used which is useful to figure out the step size needed between the next batch or row in the 1D flattened view of the 3D matrix. This will come in handy in the actual kernel. Once the kernel’s execution is complete, the output will be available in the tensor passed (<code class="language-plaintext highlighter-rouge">output</code>).</li>
</ol>

<p>The Triton kernel is decorated with a function <code class="language-plaintext highlighter-rouge">@triton.jit</code> for Triton to know that this is a function that will be executed on the GPU.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">triton</span><span class="p">.</span><span class="n">jit</span>
<span class="k">def</span> <span class="nf">matmul_kernel</span><span class="p">(</span>
    <span class="n">input_ptr</span><span class="p">,</span>
    <span class="n">input_batch_stride</span><span class="p">,</span>
    <span class="n">input_row_stride</span><span class="p">,</span>
    <span class="n">input_col_stride</span><span class="p">,</span>
    <span class="n">weight_ptr</span><span class="p">,</span>
    <span class="n">weight_row_stride</span><span class="p">,</span>
    <span class="n">weight_col_stride</span><span class="p">,</span>
    <span class="n">output_ptr</span><span class="p">,</span>
    <span class="n">output_batch_stride</span><span class="p">,</span>
    <span class="n">output_row_stride</span><span class="p">,</span>
    <span class="n">output_col_stride</span><span class="p">,</span>
    <span class="n">num_rows</span><span class="p">,</span>
    <span class="n">num_input_cols</span><span class="p">:</span> <span class="n">tl</span><span class="p">.</span><span class="n">constexpr</span><span class="p">,</span>
    <span class="n">num_output_cols</span><span class="p">,</span>
    <span class="n">BLOCK_SIZE_ROW</span><span class="p">:</span> <span class="n">tl</span><span class="p">.</span><span class="n">constexpr</span><span class="p">,</span>
    <span class="n">BLOCK_SIZE_COL</span><span class="p">:</span> <span class="n">tl</span><span class="p">.</span><span class="n">constexpr</span><span class="p">,</span>
<span class="p">):</span>
    <span class="c1"># Getting block indexes in all 3 dimensions
</span>    <span class="n">batch_idx</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">row_idx</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">col_idx</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>

    <span class="c1"># Offsets for input data
</span>    <span class="n">input_batch_offset</span> <span class="o">=</span> <span class="n">batch_idx</span> <span class="o">*</span> <span class="n">input_batch_stride</span>                                 <span class="c1"># Offsets to reach to the correct batch. Similar to CUDA, but instead strides are being used here
</span>
    <span class="n">input_row_offset</span> <span class="o">=</span> <span class="n">row_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_ROW</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_ROW</span><span class="p">)</span>
    <span class="n">input_row_mask</span> <span class="o">=</span> <span class="n">input_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">num_rows</span>
    <span class="n">input_row_offset</span> <span class="o">=</span> <span class="n">input_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">input_row_stride</span> <span class="c1"># Selecting relevant rows from input
</span>
    <span class="n">input_col_offset</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">num_input_cols</span><span class="p">)</span>
    <span class="n">input_col_mask</span> <span class="o">=</span> <span class="n">input_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">&lt;</span> <span class="n">num_input_cols</span>
    <span class="n">input_col_offset</span> <span class="o">=</span> <span class="n">input_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">input_col_stride</span> <span class="c1"># Selecting all columns from input
</span>
    <span class="n">input_data_ptr</span> <span class="o">=</span> <span class="n">input_ptr</span> <span class="o">+</span> <span class="n">input_batch_offset</span> <span class="o">+</span> <span class="n">input_row_offset</span> <span class="o">+</span> <span class="n">input_col_offset</span>
    <span class="n">input_data</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">input_data_ptr</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="p">(</span><span class="n">input_row_mask</span> <span class="o">&amp;</span> <span class="n">input_col_mask</span><span class="p">))</span> <span class="c1"># BLOCK_SIZE_ROW x D_in
</span>
    <span class="c1"># Offsets for weight data
</span>    <span class="n">weight_row_offset</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">num_input_cols</span><span class="p">)</span>
    <span class="n">weight_row_mask</span> <span class="o">=</span> <span class="n">weight_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">num_input_cols</span>
    <span class="n">weight_row_offset</span> <span class="o">=</span> <span class="n">weight_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">weight_row_stride</span> <span class="c1"># Selecing all rows from weight
</span>
    <span class="n">weight_col_offset</span> <span class="o">=</span> <span class="n">col_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_COL</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_COL</span><span class="p">)</span>
    <span class="n">weight_col_mask</span> <span class="o">=</span> <span class="n">weight_col_offset</span> <span class="o">&lt;</span> <span class="n">num_output_cols</span>
    <span class="n">weight_col_offset</span> <span class="o">=</span> <span class="n">weight_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">weight_col_stride</span> <span class="c1"># Selecting relevant columns from input
</span>
    <span class="n">weight_data_ptr</span> <span class="o">=</span> <span class="n">weight_ptr</span> <span class="o">+</span> <span class="n">weight_row_offset</span> <span class="o">+</span> <span class="n">weight_col_offset</span>
    <span class="n">weight_data</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">weight_data_ptr</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="p">(</span><span class="n">weight_row_mask</span> <span class="o">&amp;</span> <span class="n">weight_col_mask</span><span class="p">))</span> <span class="c1"># D_in x BLOCK_SIZE_COL
</span>
    <span class="c1"># Computation
</span>    <span class="n">result</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">input_data</span><span class="p">,</span> <span class="n">weight_data</span><span class="p">)</span> <span class="c1"># Matmul of a small block, BLOCK_SIZE_ROW x BLOCK_SIZE_COL
</span>
    <span class="c1"># Offsets for output data
</span>    <span class="n">output_batch_offset</span> <span class="o">=</span> <span class="n">batch_idx</span> <span class="o">*</span> <span class="n">output_batch_stride</span>                               <span class="c1"># Offsets to reach to the correct batch. Similar to CUDA, but instead strides are being used here
</span>
    <span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">row_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_ROW</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_ROW</span><span class="p">)</span>
    <span class="n">output_row_mask</span> <span class="o">=</span> <span class="n">output_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">num_rows</span>
    <span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">output_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">output_row_stride</span>

    <span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">col_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_COL</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_COL</span><span class="p">)</span>
    <span class="n">output_col_mask</span> <span class="o">=</span> <span class="n">output_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">&lt;</span> <span class="n">num_output_cols</span>
    <span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">output_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">output_col_stride</span>

    <span class="n">output_data_ptr</span> <span class="o">=</span> <span class="n">output_ptr</span> <span class="o">+</span> <span class="n">output_batch_offset</span> <span class="o">+</span> <span class="n">output_row_offset</span> <span class="o">+</span> <span class="n">output_col_offset</span>
    <span class="n">tl</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">output_data_ptr</span><span class="p">,</span> <span class="n">result</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="p">(</span><span class="n">output_row_mask</span> <span class="o">&amp;</span> <span class="n">output_col_mask</span><span class="p">))</span>
</code></pre></div></div>

<p>Similar to CUDA, calculate the current index of the block. But keep in mind, unlike CUDA where a single element of the output matrix is processed, here a single block (which is a 2D arrangement of a few elements) is processed. <code class="language-plaintext highlighter-rouge">tl.program_id</code> function helps in getting the index position in every axis.</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">batch_idx</code> gets the output matrix in the batch</li>
  <li><code class="language-plaintext highlighter-rouge">row_idx</code> gets the block number along the rows. Remember, this is not equal to the row number as in CUDA</li>
  <li><code class="language-plaintext highlighter-rouge">col_idx</code> gets the block number along the columns. Remember, this is not equal to the column number as in CUDA</li>
</ol>

<p>Once these 3 numbers are calculated, a 2D representation is created of the data that needs to be processed by each block. Let’s take some dummy numbers to understand how that is achieved. Assume that <code class="language-plaintext highlighter-rouge">B = 1</code>, <code class="language-plaintext highlighter-rouge">N = 16</code>, and <code class="language-plaintext highlighter-rouge">D_out = 12</code>. Block size in both column and row dimensions is 4 (i.e. <code class="language-plaintext highlighter-rouge">BLOCK_SIZE_ROW</code> and <code class="language-plaintext highlighter-rouge">BLOCK_SIZE_COL</code> is 4). So each block will be a 2D matrix of dimension (4 x 4).</p>

<p>Based on this</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grid</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">meta</span><span class="p">:</span> <span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">triton</span><span class="p">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">meta</span><span class="p">[</span><span class="s">"BLOCK_SIZE_ROW"</span><span class="p">]),</span> <span class="n">triton</span><span class="p">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">D_out</span><span class="p">,</span> <span class="n">meta</span><span class="p">[</span><span class="s">"BLOCK_SIZE_COL"</span><span class="p">]))</span>
</code></pre></div></div>

<p>Based on the assumptions, the grid configuration is (1, 4, 3). A total of 12 blocks will be launched. Now, what would it take to load the block with rows 8 to 11 and columns 4 to 7? Based on simple arithmetic, it looks like <code class="language-plaintext highlighter-rouge">(1, 2, 1)</code>th block should be loaded where the first dimension corresponds to the batch dimension, the second dimension corresponds to the row dimension and the third dimension corresponds to the column dimension. This would correspond to</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
<span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
<span class="n">tl</span><span class="p">.</span><span class="n">program_id</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
</code></pre></div></div>

<p><img src="assets/images/post3/tritonblocks.png" alt="tritonblocks" /></p>

<p>Figure 5: <code class="language-plaintext highlighter-rouge">(1 x 16 x 12)</code> matrix is divided into blocks of size <code class="language-plaintext highlighter-rouge">(4 x 4)</code>. <code class="language-plaintext highlighter-rouge">1, 2, 1</code>th block is highlighted. The value at every place is the index of that position in the 1D flattened array.</p>

<p>For this <code class="language-plaintext highlighter-rouge">(1, 2, 1)</code>th block, how to prepare the correct offsets? In the 1D representation of the matrix, the element numbers highlighted in green needs to be loaded.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Offsets for output data
</span><span class="n">output_batch_offset</span> <span class="o">=</span> <span class="n">batch_idx</span> <span class="o">*</span> <span class="n">output_batch_stride</span>                           

<span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">row_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_ROW</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_ROW</span><span class="p">)</span>       <span class="c1"># This arangement happens in 1D, tl.arange is like Python's arange
</span><span class="n">output_row_mask</span> <span class="o">=</span> <span class="n">output_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">num_rows</span>                         <span class="c1"># Think of masks as prevention against reading invalid data from memory
</span><span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">output_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">output_row_stride</span>              <span class="c1"># This arangement converts a 1D vector to a 2D vector with (n, None) shape
</span>
<span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">col_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_COL</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_COL</span>
<span class="n">output_col_mask</span> <span class="o">=</span> <span class="n">output_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">&lt;</span> <span class="n">num_output_cols</span>
<span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">output_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">output_col_stride</span>
</code></pre></div></div>

<p>Let’s decode what is happening here</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">row_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_ROW</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_ROW</span><span class="p">)</span>
<span class="c1"># row_idx = tl.program_id(1) = 2, BLOCK_SIZE_ROW = 4
# output_row_offset = 2*4 + (0, 1, 2, 3) = (8, 9, 10, 11)
</span></code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">output_row_offset</code> is added to the <code class="language-plaintext highlighter-rouge">output_ptr</code> directly the 8th, 9th, 10th, and 11th elements will be loaded from the 1D flattened array. But that is not desired. So how to get to the desired offsets:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output_row_offset</span> <span class="o">=</span> <span class="n">output_row_offset</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">output_row_stride</span>
<span class="c1"># This multiplies each element by the output_row_stride which is equal to 12 (number of columns), the number of elements to skip in 1D array to reach the start of next row
# ouput_row_offset becomes (96, 108, 120, 132). It also gets transformed into a row vector
</span></code></pre></div></div>

<p>A similar transformation is done for the columns:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">col_idx</span><span class="o">*</span><span class="n">BLOCK_SIZE_COL</span> <span class="o">+</span> <span class="n">tl</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_COL</span><span class="p">)</span>
<span class="c1"># col_idx = tl.program_id = 1, BLOCK_SIZE_COL = 4
# output_col_offset = 1*4 + (0, 1, 2, 3) = (4, 5, 6, 7)
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output_col_offset</span> <span class="o">=</span> <span class="n">output_col_offset</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">output_col_stride</span>
<span class="c1"># This multiplies each element by the output_col_stride which is equal to 1, the number of elements to skip in 1D array to advance by one column.
# Since this is a row major ordering, columns are adjacent to each other.
# ouput_col_offset becomes (4, 5, 6, 7). It also gets transformed into a column vector
</span></code></pre></div></div>

<p>Finally,</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output_data_ptr</span> <span class="o">=</span> <span class="n">output_ptr</span> <span class="o">+</span> <span class="n">output_batch_offset</span> <span class="o">+</span> <span class="n">output_row_offset</span> <span class="o">+</span> <span class="n">output_col_offset</span>       <span class="c1"># Adds all the offsets to the pointer
</span></code></pre></div></div>

<p>First, add <code class="language-plaintext highlighter-rouge">output_row_offset</code> and <code class="language-plaintext highlighter-rouge">output_col_offset</code>. Since one of them is a row vector and the other is a column vector, on addition a 2D array is produced with all the desired indices of all the elements that need to be loaded. After that, add <code class="language-plaintext highlighter-rouge">output_batch_offset</code> to get to the correct matrix in the batch.</p>

<p><img src="assets/images/post3/offset_addition.png" alt="offset_addition" /></p>

<p>Figure 6: How 2D blocks are created from 2 1D offsets</p>

<p>This gives the appropriate offsets for the data this block is interested in computing. Similarly, the relevant data for the other two tensors can be computed. The core idea is understanding the block calculation and offset calculation. The rest of the code is more about syntax rather than any core logic.</p>

<p>The complete code is present <a href="https://github.com/cmeraki/vit.triton/blob/main/examples/matmul_batch.py">here</a>. Triton and PyTorch are needed to run this code.</p>

<h2 id="how-you-can-rewrite-the-complete-architecture-using-optimized-kernel">How you can rewrite the complete architecture using optimized kernel</h2>

<p>Congrats on making this far away. Now that you understand the basics of GPU hardware and its programming model, you can go ahead and implement any network from scratch, this time not relying on PyTroch for operations but writing your kernels in CUDA or Triton.</p>

<p>In case, you want to implement a transformer encoder network, you would need to implement all the basic layers and operations in Triton or CUDA.</p>

<ol>
  <li>Matrix multiplication</li>
  <li>Layernorm</li>
  <li>Softmax</li>
  <li>Addition</li>
  <li>Concatenation</li>
</ol>

<p>You can then wrap these kernels in the PyTorch module and load weights from HF to compare your implementation with other PyTorch/TF native implementations. If this sounds interesting, this is exactly what we did too. We implemented most of the operations used in Vision Transformer (ViT) including patching and addition operations in Triton and loaded weights from a checkpoint to run a forward pass. You can look at the code at <a href="https://github.com/cmeraki/vit.triton">ViT.triton</a> and maybe implement your favorite model too using custom kernels!</p>

<h2 id="citations">Citations</h2>

<p>For attribution, please cite this as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{romit2024gpus2,
  title   = {GPUs Part 2},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {May},
  url     = {https://cmeraki.github.io/gpu-part2.html}
}
</code></pre></div></div>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://pytorch.org/docs/stable/generated/torch.Tensor.stride.html">Tensor strides</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Meraki Labs</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">GPUs Part 1 - Understanding GPU internals</title><link href="https://cmeraki.github.io/gpu-part1.html" rel="alternate" type="text/html" title="GPUs Part 1 - Understanding GPU internals" /><published>2024-05-25T00:00:00+00:00</published><updated>2024-05-25T00:00:00+00:00</updated><id>https://cmeraki.github.io/gpu-part1</id><content type="html" xml:base="https://cmeraki.github.io/gpu-part1.html"><![CDATA[<!-- markdownlint-disable MD036 MD029 -->

<p>Written by <a href="https://www.linkedin.com/in/r0m1t/">Romit Jain</a></p>

<p>LLMs are pretty big and can use a lot of computing power. This makes them slow in terms of latency and makes them tougher (than ML models) to deploy. Hence, there is some alpha in learning how to run them as fast as possible, because that is what the real bottleneck currently is. If you can reduce latency or increase throughput, that opens up a lot of doors for LLM applications.</p>

<p>To learn how to run these big models as fast as possible, understanding the hardware (both CPU and GPU) on which they run is crucial.</p>

<p>This blog and others in the series (<a href="./gpu-part2.html">part 2</a>) will help you learn about the basic layout of GPU hardware, a mental model of how the GPU programming model works, and how to progress from there to become a kernel master. (If you are asking, what’s a kernel, read till the end of the series)</p>

<p>PS, there is just a deep satisfaction in knowing how things work on the hardware. It gives you a deeper understanding of the models and an immense appreciation of all the abstractions.</p>

<h2 id="hardware">Hardware</h2>

<p>What is so special about GPUs that makes them extremely efficient for certain applications, especially LLMs? Understanding the hardware of GPUs is essential to answer this question. In one line, “GPUs are optimized for throughput whereas CPUs are optimized for latency”. In more lines -</p>

<h3 id="why-are-gpus-faster-for-llms">Why are GPUs faster for LLMs?</h3>

<p>The fastest way to run LLMs currently is to run them on GPUs. But why are GPUs faster than CPUs for LLMs? One valid answer is that GPU can process data parallelly because it operates in the <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD</a> fashion. CPUs are mostly designed to work with sequential tasks. But even then, what makes GPU process data parallelly? Majorly 2 things:</p>

<ol>
  <li>CPUs have a lot of space on their chip dedicated to cache and registers. GPUs make a design choice that reduces the size of the cache and increases the number of cores. This way they can fit more cores in the same chip area. Cores are essentially the processing units that process data.</li>
  <li>CPUs have a lot of functionalities in their cores. These functionalities help them operate in a variety of different tasks and hence CPUs are very robust. GPUs reduce these special functionalities which helps it to reduce the size of the cores. If the cores are smaller, GPUs can fit more cores in the same area.</li>
</ol>

<p><img src="assets/images/post2/image.png" alt="CPU v GPU" /></p>

<p><strong>Figure 1</strong>: The figure above illustrates the differences in cache and control logic sizes between GPUs and CPUs. The GPU features significantly reduced cache and control logic sizes, as well as smaller core sizes. These tradeoffs allow for a higher number of cores in the GPU.
<a href="https://docs.nvidia.com/cuda/archive/11.2.0/pdf/CUDA_C_Programming_Guide.pdf">Source</a></p>

<p>The more cores a GPU has, the greater the potential for parallel execution, leading to improved performance. However, it’s not solely about the number of cores. Other factors contribute to the overall efficiency and performance of GPUs.</p>

<h3 id="gpu-hardware-layout">GPU hardware layout</h3>

<p>Let’s now understand how these cores are organized and arranged on the hardware.</p>

<h4 id="cuda-cores">CUDA Cores</h4>

<p>Here is where the magic actually happens. These are the processing units of the GPU and come in different flavors, eg: Tensor Cores, Single precision cores, Double precision cores, etc. All of these cores handle different kinds of operations. The GPU decides where to send the operation based on the data type and instruction. The amount of operations these bad boys can do per second is what gives rise to FLOP numbers. Each of these different flavors has different performance numbers in terms of FLOPs because all of them do different kinds of operations.</p>

<p>For example, H100 has 16986 FP32 CUDA cores that can each do 2 floating point operations per cycle. The clock speed of the GPU is 1593 MHz. Theoretically, in total if all the cores are processing data at all times, it can achieve $ 1.593 * 10^9 * 2 * 16.986 * 10^3 = 54.1 * 10^{12} FLOPS$ or 54 teraFLOPs</p>

<p>This is close but not the same as what is shown on the official specs of <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a>. (I am not able to figure out the reason for the difference. If you know, please drop me an email!)</p>

<h4 id="streaming-multiprocessors-sms">Streaming Multiprocessors (SMs)</h4>

<p>All the cores in a GPU are organized into groups. Each of these groups is called a streaming multiprocessor (SM). Every SM has some memory associated with it. This memory can be shared amongst all the cores inside an SM but not by any other core outside this SM. This memory is called shared memory and is extremely fast in terms of data transfer speed or memory bandwidth. But this is also small in terms of capacity. So it’s essential to use this memory judiciously.</p>

<p>Why are GPUs divided like this? It’s to enable smaller groups of cores to share memory amongst themselves and work together. With every new generation of GPUs, typically SMs and cores per SMs go up in a GPU.</p>

<p>Let’s take some real numbers to understand the capacity. An H100 SXM GPU contains:</p>

<ol>
  <li>132 streaming multiprocessors (SM)</li>
  <li>Each SM has 128 FP32 CUDA cores (so a total of 16896 (132 * 128) CUDA cores)</li>
  <li>Each SM has 227 KB of shared memory</li>
  <li>And this memory has a bandwidth of 33 TB/s</li>
</ol>

<blockquote>
  <p>SMs are also grouped into TPCs (Texture/Processor Cluster). For reference, the above hardware has 2 SMs per single TPC. But that can be safely skipped for now.</p>
</blockquote>

<h4 id="memory">Memory</h4>

<p>There are three kinds of memory on the GPU</p>

<ol>
  <li>HBM/Global memory - This can be thought of as the equivalent of CPU memory. This is the slowest and largest memory available on the GPU.
    <ol>
      <li>For reference, H100 SXM has 80GB of HBM with 3 TB/s of bandwidth (i.e. it can transfer 3 TB per second either to or from HBM)</li>
      <li>This is where the model is loaded when we do <code class="language-plaintext highlighter-rouge">model.to(device='cuda:0')</code></li>
    </ol>
  </li>
  <li>L2 Cache - Faster than HBM but limited in size. This is shared among all the SMs.
    <ol>
      <li>For reference, H100 SXM has 50 MB (lol, in comparison to HBM) of L2 cache with 12 TB/s of bandwidth.</li>
    </ol>
  </li>
  <li>Shared memory - Fastest and smallest memory available on the GPU. Every SM has its shared memory and all the cores executing instructions in an SM have access to it.</li>
</ol>

<h2 id="back-to-llms">Back to LLMs</h2>

<p>Let me cite working examples to drive home a point - For LLMs, one should probably not worry about teraFLOPs. This answers the question that we asked at the end of the section Why are GPUs faster for LLMs?
Take an example of the H100 SXM GPU that can do 67 teraFlops (FP32) of computation. The memory bandwidth of the HBM is 3 TB/s. That means the GPU can transfer about 3 TB of data to the compute layer per second. Considering FP32 (4 bytes), we can transfer about 750 billion numbers to the compute layer in one second. In contrast, the compute layer can perform 67 trillion operations per second. Just to break even with the computation speed, we would either:</p>

<ol>
  <li>Need to transfer ~90x the data (67 trillion/750 billion) from the memory to the computer layer per second</li>
  <li>Or perform, ~90 operations on every data point each second</li>
</ol>

<p>So, it’s tough to keep up with the computing power of the GPU. The bottleneck comes in transferring the data. There are three good resources on this topic to understand it better:</p>

<ol>
  <li><a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf">NVIDIA docs</a></li>
  <li>An article by Horace He <a href="https://horace.io/brrr_intro.html">here</a>.</li>
  <li>Another practical example is stated in the article: <a href="https://finbarr.ca/how-is-llama-cpp-possible/">How is Llama.cpp possible?</a></li>
</ol>

<blockquote>
  <p>Apart from the above, we also have warps in GPUs. Warps are a collection of 32 threads that are executed at once by the GPU. It’s slightly more complex to understand how warps work, so I will leave it out of the scope of this blog.</p>
</blockquote>

<p>By now, you should be able to understand how GPU hardware is organized. There are a few other hardware concepts that I did not go through like warp scheduler, register files, etc. here, but that are not crucial to get started.</p>

<p>You are now all ready to start with the <a href="./gpu-part2.html">part 2</a> of this series.</p>

<h2 id="citations">Citations</h2>

<p>For attribution, please cite this as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{romit2024gpus1,
  title   = {GPUs Part 1},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {May},
  url     = {https://cmeraki.github.io/gpu-part1.html}
}
</code></pre></div></div>]]></content><author><name>Meraki Labs</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Throughput is all you need</title><link href="https://cmeraki.github.io/throughput-is-all-you-need.html" rel="alternate" type="text/html" title="Throughput is all you need" /><published>2024-04-11T00:00:00+00:00</published><updated>2024-04-11T00:00:00+00:00</updated><id>https://cmeraki.github.io/throughput-is-all-you-need</id><content type="html" xml:base="https://cmeraki.github.io/throughput-is-all-you-need.html"><![CDATA[<!-- markdownlint-disable MD033 MD036 MD053-->

<p>Written by <a href="https://www.linkedin.com/in/r0m1t/">Romit Jain</a></p>

<h2 id="throughput-why">Throughput, why?</h2>

<p>If we want to build efficient applications on top of current LLMs, there are currently two challenges:</p>

<ol>
  <li>Improving <strong>Inference latency</strong>: The speed with which the model returns the tokens per second</li>
  <li>Improving <strong>Inference throughput</strong>: The total number of requests that the model can serve in parallel</li>
</ol>

<p>Inferencing LLMs with lower latency comes down to working around the limitations of the GPU’s memory bandwidth <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. FlashAttention, speculative decoding, and KV caching are ways in which one can improve the latency of the model.</p>

<p>Increasing inference throughput comes down to effectively managing the available VRAM of the GPU. Given a limited budget of GPU VRAM, there are various areas where improvements can be made:</p>

<ol>
  <li>Reducing the size of the model: By quantization or knowledge distillation eg: GPTQ</li>
  <li>Batching<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>: Batching more requests in the same amount of GPU VRAM</li>
  <li>Separating prefill and decoding stages of generation<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></li>
</ol>

<p>One can refer to blog<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> or <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> for an overview of the above concepts.</p>

<p>For this blog, let’s zoom into one specific aspect of improving throughput, i.e. batching. After the model is loaded in the GPU VRAM, whatever remaining memory is available to us is reserved for the KV cache and serving the requests. The only lever that we can control here apart from the model size is the KV cache. Efficiently managing this KV cache can help us dramatically increase throughput by enabling us to batch more requests. For certain use cases, it can increase the throughput by 20x compared to native HuggingFace implementation.</p>

<p>vLLM is one such library that helps us achieve very high throughout. vLLM deploys LLMs on GPUs and focuses on:</p>

<ol>
  <li>Allocating the KV cache in the most efficient way possible</li>
  <li>This, in turn, allows us to increase the batch size and server more requests per minute</li>
</ol>

<p>In this blog, we will learn about the intuition behind vLLM, and its inner workings and also simulate it for a real-world application to understand the nuances and limitations of the library.</p>

<h2 id="setup">Setup</h2>

<p>Taking real-world numbers around model sizes and GPU VRAM can help visualize and validate the workings of vLLM. Let us consider a case of deploying a Mistral 7B model on the highest-end consumer-grade GPU (Nvidia RTX 4090). If we choose to deploy the model at half-precision (FP 16, each parameter taking 2 bytes), the model would occupy ~14 GB of the VRAM from the available 24 GB VRAM on a 4090 GPU. Assuming an overhead of 3 GBs, the GPU would have 7 GB of VRAM available. This 7 GB of available VRAM will be reserved for the KV cache.</p>

<p><img src="assets/images/post1/image1.png" alt="alt_text" title="image_tooltip" />
Figure 1: Memory layout of the GPU</p>

<p>In our scenario, we would assume 8k as the context length to serve the model. Whenever a request arrives, the model computes the attention scores for all the prompt tokens and then generates one token at a time using autoregressive decoding. While decoding, it requires some VRAM on the GPU to store the token. A single token would take 0.125MB of VRAM to be stored in the KV cache.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Token size calculation

For every token, we need to store its corresponding tokens for K and V matrices. We also need to store it for all the layers and all the attention heads.

The general formula is: 2*2*n*h*d, where the first 2 is for FP 16 weights (2 bytes), the second 2 is for the K/V matrix, n is for the number of layers, h is for the number of heads, d is the embedding dimension
For Mistral 7B, 2*2* 32*8*128 = 0.125 MB

The KV cache for a single request on the complete context length of the model would be 1 GB (8k * 0.125 MB).
</code></pre></div></div>

<p><strong>A case for a single GPU serving a single request</strong></p>

<p>If we decide to serve only a single request at a time with this GPU, we would be wasting a lot of resources. Given that 7 GB of VRAM is available for KV cache, the model can store cache for 56k tokens (7 GB/ 0.125 MB). Considering all of the VRAM to be reserved for a single request, the space for 48k tokens (56k-8k) would be wasted since the model has a context length of only 8k tokens. The throughput of the model would be very low (only a single request is being processed at a time) and it is not using all of the VRAM of the GPU available to it. It would be wasting 6 GB of memory for every request.</p>

<p>This is termed as external fragmentation. This is clearly not the best way to utilize the GPU for serving LLMs. Figure 2 shows the extreme version of external fragmentation.</p>

<p><img src="assets/images/post1/image2.png" alt="alt_text" title="image_tooltip" /></p>

<p>Figure 2: Inside the KV cache: Single request</p>

<p><strong>A case for a single GPU serving multiple requests</strong></p>

<p>How can we improve upon this? Enter batching. In batching, we serve multiple requests at the same time taking advantage of the parallelism of GPUs. Let’s consider a scenario where we are serving multiple requests at the same time of 8k context length each. GPU would need to pre-allocate the space for 8k tokens for every request. For every request, the GPU would need 1 GB of VRAM to store the KV cache. Hence, it would be able to serve 7 requests concurrently (7 GB/ 1 GB). This would avoid external fragmentation in our scenario, but it could lead to another problem.</p>

<p>One thing to note here is that every request might not generate 8k tokens. Request 1 may end up generating 4k tokens, Request 2 may end up just generating 2k tokens, and so on. But since we had already reserved space for all the 8k tokens, we are wasting the memory and not utilizing the complete memory. This is called internal fragmentation.</p>

<p>There can be another scenario where after allocating the memory for all the requests, the available VRAM of the GPU is less than the memory required for a single request. In this scenario, the memory for the request will not be allocated and the remaining memory will be wasted. This is again a case of external fragmentation.</p>

<p><img src="assets/images/post1/image3.png" alt="alt_text" title="image_tooltip" /></p>

<p>Figure 3: Inside the KV cache: Multiple requests</p>

<p><strong>A case for a single GPU serving multiple requests efficiently</strong></p>

<p>So, is there any improvement possible over the naive batching method we discussed earlier? Yes, indeed there is a way. Enter vLLM.</p>

<p>Let’s assume that the complete memory of the GPU is broken down into small chunks of memory called blocks. Each block is equivalent to the memory required for 16 tokens (i.e. in our example, 0.125 MB * 16 = 2 MB). Once we allocate memory for a block, even partially, it won’t be available for any other allocation.</p>

<p>Since every request might not need 8k tokens, let’s assume that on average every request would require 5000 tokens. GPU will allocate 313 blocks (5000/16) of memory for the request. These blocks are not stored in a contiguous layout in the memory. Hence, we would need to maintain an address book that maps every request to its corresponding blocks. There’s another optimization in here. Since this memory is not stored in a contiguous memory, we don’t need to allocate all of the memory at once. We can allocate memory as and when required once the previous blocks are filled to the capacity. This is the core of how vLLM allocates memory.</p>

<p><img src="assets/images/post1/image4.png" alt="alt_text" title="image_tooltip" /></p>

<p>Figure 4: vLLM token to block mapping.
Source <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>

<p>The above solves 2 problems:</p>

<ol>
  <li>The request only allocates memory required for its generation instead of pre-allocating for the complete context length of the model. The memory allocation happens at the block level, so technically memory is allocated for 16 tokens at a time. This reduces internal fragmentation significantly
    <ol>
      <li>If the request uses 1.5k tokens, we need to allocate memory only for 94 blocks i.e. 94 * 2 MB = 184 MB, instead of 1 GB for the complete 8k context length of the model</li>
      <li>A single request’s tokens can be stored in multiple blocks</li>
    </ol>
  </li>
  <li>The complete memory is broken down into equally sized blocks, so even external fragmentation is minimized. The block size is chosen such that it fills the available GPU memory evenly.</li>
</ol>

<p>The approaches defined above help in utilizing the GPU VRAM efficiently. Given the block size of 2 MB, vLLM can store a total of ~3500 blocks in the available memory of 7 GB. If each request needs 313 blocks (5k tokens on average) during its lifetime, the GPU would have memory to serve 11 requests in parallel. By using the KV cache more effectively and allocating memory in blocks instead of complete context length, vLLM has increased the throughput from 7 to 11 in our example.</p>

<p>This is how vLLM helps in increasing the batch size and throughput of any model. For computing attention over tokens distributed in non-contagious blocks, vLLM has introduced Paged Attention. Paged Attention are optimized CUDA kernels to access tokens from different blocks and compute attention scores over them.</p>

<h2 id="inside-the-simulation">Inside the simulation</h2>

<p>To understand the behavior of vLLM in production, let us simulate a real scenario of a chat application. This chat application uses an LLM and is being served by vLLM. For chat applications, we have another dimension where a single chat can have multiple turns of conversation alternating between user and assistant messages.</p>

<p><img src="assets/images/post1/image5.png" alt="alt_text" title="image_tooltip" /></p>

<p>Figure 5: A multi-turn conversation. From the perspective of an LLM, all of these messages are a part of a single request. As the conversation progresses, every new message from the user gets appended to the same request and is sent to the LLM again</p>

<p>Our objective is to predict the behavior of vLLMs and try to replicate them in the experiments. To start with, let’s consider some simulation parameters (similar to our example in the previous section):</p>

<ol>
  <li>Block size (number of tokens stored together in one block): 16</li>
  <li>The average number of turns in each chat: 10</li>
  <li>Average input token length at each turn in the chat: 150</li>
  <li>Average output token length at each turn in the chat: 350</li>
  <li>Average latency for each turn in the chat: 10s <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></li>
  <li>Average number of tokens required for a single chat session: (150 + 350) * 10 = 5000</li>
  <li>The average number of blocks required for a single chat session: is 313 (5000/16)</li>
</ol>

<p>For serving an LLM, let’s take any flavor of the Mistral 7B model deployed at half precision. Taking the model parameters,</p>

<ol>
  <li>Model dimension: 128</li>
  <li>Number of layers: 32</li>
  <li>Number of KV heads: 32</li>
  <li>Input sequence length: 8192</li>
</ol>

<p>According to these parameters, we would require:</p>

<ol>
  <li>0.125 MB of memory per token in KV cache</li>
  <li>2 MB of memory per block (assuming block size to be 16 tokens, 0.125 * 16 = 2 MB)</li>
</ol>

<p>Assuming 7 GB of KV cache available for our use</p>

<ol>
  <li>We can store ~3500 blocks in GPU VRAM (7GB/2MB)</li>
  <li>As calculated above, given an average of 313 blocks per chat session and 3500 blocks available, we can hold 11 (floor(3500/313)) conversations in a single GPU and serve them in parallel</li>
</ol>

<p>Based on our simulation, we calculated that an LLM served by vLLM can serve 11 requests in parallel for our setup. If we were implementing a naive batching, it would have not been able to serve more than 7 requests parallelly (which we discussed in the previous sections). Let’s experiment with this simulation to test the calculation. I send <em>N</em> number of requests at once to a model hosted using the vLLM backend. Note that these requests are long-running (each request has multiple turns).</p>

<p>Below you can find the results from the experiments, where you can see two things:</p>

<ol>
  <li>Scheduler State: Number of requests being served concurrently by vLLM</li>
  <li>Cache Utilization: % of GPU memory being used. Note that this percentage is based on the KV cache space we calculated earlier (i.e. 7 GB is the total GPU memory for the KV cache in our setup. If the utilization is 50%, that would translate to 3.5 GB of KV cache being used)</li>
</ol>

<p>N = 10, we can see that the GPU utilization never reached 100%.</p>

<p><img src="assets/images/post1/image6.png" alt="alt_text" title="image_tooltip" /></p>

<p>N = 12, we can see that the GPU utilization reached 100% utilization, and 1 of the requests is moved to a waiting queue for some time (where it is not processed). This indicates that the results we got are similar to what we got from the experiments.</p>

<p><img src="assets/images/post1/image7.png" alt="alt_text" title="image_tooltip" /></p>

<p>N = 14, we can see that the GPU utilization hits 100% and then approximately 2 requests are moved to the waiting queue</p>

<p><img src="assets/images/post1/image8.png" alt="alt_text" title="image_tooltip" /></p>

<p>We can notice two things here:</p>

<ol>
  <li>It takes some time for the GPU to reach 100% utilization. This is because currently we have deployed a chat application where each turn takes 10 seconds and we have a total of 10 turns. So, the KV cache keeps on getting larger and larger as time goes by. But once the chat conversation ends after 10 turns, we will notice a drop in the GPU utilization.</li>
  <li>If we go above the calculated parallel limit of our chats, we will eventually see some requests being transferred to a waiting queue. That implies the GPU is completely utilized and it can not process all the requests in a single batch.</li>
</ol>

<p>The complete experiment can be rerun and you can find the code used to run the experiments <a href="https://github.com/cmeraki/vllm-simulation">here</a>.</p>

<p>An overview of all the parameters we discussed is mentioned below for reference. You can make a copy of the following <a href="https://docs.google.com/spreadsheets/d/1BsLg2zcqSgiEyssqH9Wt0qG-mKGJ9S7rHjuSt8iH3hA/edit?usp=sharing">sheet</a> and play with simulation parameters to understand the requirements. Yellow blocks can be updated, and green blocks are calculated ones.</p>

<table>
  <tr>
   <td><strong>Model Parameters</strong>
   </td>
   <td><strong>Value</strong>
   </td>
   <td><strong>Units</strong>
   </td>
  </tr>
  <tr>
   <td>Model size
   </td>
   <td><p style="text-align: right">
7.00</p>

   </td>
   <td>B
   </td>
  </tr>
  <tr>
   <td>Model dim
   </td>
   <td><p style="text-align: right">
128</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Model layers
   </td>
   <td><p style="text-align: right">
32</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Model KV heads
   </td>
   <td><p style="text-align: right">
8</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Bytes per parameter
   </td>
   <td><p style="text-align: right">
2</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Input sequence length
   </td>
   <td><p style="text-align: right">
8192</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td><strong>vLLM Parameters</strong>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Block size
   </td>
   <td><p style="text-align: right">
16</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td><strong>GPU Parameters</strong>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Memory
   </td>
   <td><p style="text-align: right">
24</p>

   </td>
   <td>GB
   </td>
  </tr>
  <tr>
   <td>Utilization
   </td>
   <td><p style="text-align: right">
100%</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Buffer
   </td>
   <td><p style="text-align: right">
3</p>

   </td>
   <td>GB
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td><strong>Simulation params</strong>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Total turns in a chat
   </td>
   <td><p style="text-align: right">
10</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Input tokens in a turn
   </td>
   <td><p style="text-align: right">
150</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Output tokens in a turn
   </td>
   <td><p style="text-align: right">
350</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td><strong>Experimental results</strong>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Average latency per turn
   </td>
   <td><p style="text-align: right">
10</p>

   </td>
   <td>s
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td><strong>Calculations</strong>
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Memory per token
   </td>
   <td><p style="text-align: right">
0.125</p>

   </td>
   <td>MB
   </td>
  </tr>
  <tr>
   <td>Memory per block
   </td>
   <td><p style="text-align: right">
2</p>

   </td>
   <td>MB
   </td>
  </tr>
  <tr>
   <td>Memory remaining for KV cache
   </td>
   <td><p style="text-align: right">
7</p>

   </td>
   <td>GB
   </td>
  </tr>
  <tr>
   <td>Total token length of a chat
   </td>
   <td><p style="text-align: right">
5000</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Total blocks required for a chat
   </td>
   <td><p style="text-align: right">
313</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Blocks that can be stored in KV cache
   </td>
   <td><p style="text-align: right">
3584</p>

   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>Total chats that can be served concurrently at full context length
   </td>
   <td><p style="text-align: right">
11</p>

   </td>
   <td>
   </td>
  </tr>
</table>

<h2 id="notes">Notes</h2>

<p>vLLM does a few more things:</p>

<ol>
  <li>KV cache reuse: By reusing the KV cache for different requests, a new request can skip computing the attention scores for the common tokens. This translates to lower latency. However, this is not the contribution of this paper. KV caching is a common technique used during LLM serving
    <ol>
      <li>Single prompt, multiple generations: vLLM can cache a common prompt or prefix and use that for multiple generations. This is similar to the above and helps in reducing latency</li>
      <li>Parallel sampling and beam search: Following on from the above, vLLM also implements KV cache reuse for parallel sampling and beam search.</li>
    </ol>
  </li>
  <li>Pause the world: Whenever a new request comes in between the decoding stage of ongoing requests in the batch, vLLM pauses the generation of requests in the batch and computes the KV cache for the new request. Once the KV cache is computed, it adds it to the batch and continues decoding the new batch
    <ol>
      <li>This results in higher latency if too many requests are coming back to back</li>
      <li>vLLM is working to update this behavior</li>
    </ol>
  </li>
  <li>Queue: vLLM also provides a FastAPI server on top of its backend. It implements queues that store the request that vLLM can not serve if the GPU memory is full</li>
</ol>

<h2 id="citations">Citations</h2>

<p>For attribution, please cite this as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{romit2024throughput,
  title   = {Throughput is all you need},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {April},
  url     = {https://cmeraki.github.io/throughput-is-all-you-need.html}
}
</code></pre></div></div>

<h2 id="references">References</h2>

<p>These are some of the references that I have linked throughout the blog and some general recommended reading for getting a better understanding of the concepts we discussed in the blog.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://horace.io/brrr_intro.html">Making Deep Learning go Brrrr From First Principles</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://www.anyscale.com/blog/continuous-batching-llm-inference#the-basics-of-llm-inference">How continuous batching enables 23x throughput in LLM inference while reducing p50 latency</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://hao-ai-lab.github.io/blogs/distserve/">Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices">LLM Inference Performance Engineering: Best Practices</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/">Mastering LLM Techniques: Inference Optimization</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p><a href="https://blog.vllm.ai/2023/06/20/vllm.html">vLLM</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p>For a single token generation, the latency is usually bound by the memory bandwidth of the GPU. Considering Nvidia 4090 which has a memory bandwidth of 1008 GB/s and Mistral 7B which has 14 GB parameters, the ideal estimate of latency would be 72 tok/s (1008/14). In the real world, you can expect to get around 60 tok/s</p>

      <ol>
        <li>For 600 tokens, the total time comes around to be 10s (600/60)</li>
        <li>Refer to this blog for more explanation: <a href="https://kipp.ly/transformer-inference-arithmetic/#latency-calculations">Transformer Inference Arithmetic</a></li>
      </ol>
      <p><a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Meraki Labs</name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Building Tts</title><link href="https://cmeraki.github.io/building-tts.html" rel="alternate" type="text/html" title="Building Tts" /><published>2021-11-21T00:00:00+00:00</published><updated>2021-11-21T00:00:00+00:00</updated><id>https://cmeraki.github.io/building-tts</id><content type="html" xml:base="https://cmeraki.github.io/building-tts.html"><![CDATA[<h2 id="indri-tts--asr">Indri TTS / ASR</h2>
<h6 id="nov-21-2024">Nov 21, 2024</h6>

<p>Today, we are releasing Indri TTS model series, which are 124M/350M param, multilingual, fully autoregressive TTS models, that can produce hyper realistic human voices. You can try out the models here : https://indrivoice.ai . Or download and use it on your machine from github / hf. Currently the model supports English, Hindi and Kannada. New languages are easy to add using scripts provided in git repo.</p>

<p>Indri can generate hyper-realistic audio that is very hard to differentiate from real speech. It faithfully reproduces background noises, echoes, music and non-speech sounds alongwith speech. Here are a few examples of generations :</p>

<h3 id="data">Data</h3>
<p>We have used 20k hours of available English TTS data, alongwith 5k hours of per language data.</p>

<p>We collected videos with clean audio from sharing websites and passed it through whisper-v3-turbo to generate transcriptions. These transcriptions are limited to 15s in length. We also post process the chunks and remove any silences longer than 250ms.</p>

<h4 id="what-to-look-for-in-data-">What to look for in data ?</h4>
<ol>
  <li>Clearly spoken speech. You should be able to make out the words that are being spoken. E.g. podcasts, talks etc. make for great sources, whereas on-site news or action movie clips do not.</li>
  <li>No background music, sounds etc. Although we can remove the background using separation, it leaves artifacts which the model learns to replicate.</li>
</ol>

<h3 id="modelling">Modelling</h3>
<h4 id="audio-tokenizer">Audio Tokenizer</h4>
<p>A lot about tokenizers has been covered in previous blog. If you haven’t, go through the tokenizers blog to understand how to decide on an audio tokenizer.</p>

<h4 id="impact-of-tokenizer">Impact of tokenizer</h4>
<ol>
  <li>Small context length : Using a tokenizer which has low frequency, results in small sequences. This makes them easier to model. E.g. Hubert is 50Hz, and encoded</li>
  <li>Speed :</li>
  <li>Final model size :</li>
</ol>

<p>We use Mimi tokenizer (link), which produces 32 codebooks at 12.5Hz. We found 8 codebooks to be sufficient to faithfully reproduce audio under consideration.</p>

<h4 id="handling-audio-tokens">Handling audio tokens</h4>
<p>Transformers are good at modelling 1-D sequences. Audio tokenizers convert audio into n-codebooks at kHz, giving a 2D sequence of tokens.</p>

<p>We convert this to a 1D sequence by weaving codebooks together. Tokens of n-th codebook are offset by (n-1 x  n_tokens_per_codebook). Both semantic and audio tokens are weaved together in a single sequence.</p>

<p>For n_codebooks = 2, tokens_per_codebook = 16 :</p>

<p>\(\begin{bmatrix}
1 &amp; 5 &amp; 3 \\
12 &amp; 8 &amp; 9 \\
\end{bmatrix}\)
converts to 
\(\begin{bmatrix}
1 &amp; 5 &amp; 3 &amp; 12 + 16 &amp; 8 + 16 &amp; 9 + 16 \\
\end{bmatrix}\)</p>

<p>This results in an audio vocab of size n_codebooks x tokens_per_codebook.</p>

<p>We bring text and audio tokens into a common embedding space and train a small transformer (gpt2) over text+audio sequences.</p>

<h3 id="sequences">Sequences</h3>
<p>Indri is a multimodal decoder only transformer (gpt2 arch), that consumes and generates both audio and text tokens as part of same sequence. We convert different problems such as tts/asr/continuation into sequence to sequence problems, indicating tasks by special tokens.</p>

<p>TTS systems such as spear-tts use a tiered approach where they train two models :</p>
<ol>
  <li>text to semantic tokens : learns to read</li>
  <li>semantic to acoustic tokens : learns to speak</li>
</ol>

<p>This separates the speaker voice characteristics (e.g. pitch) from reading (e.g. speed, accent) etc. But first model has to complete its generation, for the next model to start producing output. Hence streaming output can only start when all semantic tokens are ready.</p>

<p>We use a single model to generate both semantic and acoustic tokens. Hence we can stream output from the moment first audio has been generated.</p>

<h4 id="token-sequence">Token Sequence</h4>

<p>We use special tokens to indicate:</p>
<ol>
  <li>start of modality <code class="language-plaintext highlighter-rouge">&lt;text&gt;, &lt;audio&gt;</code></li>
  <li>a common stop token <code class="language-plaintext highlighter-rouge">&lt;stop&gt;</code></li>
  <li>speaker identifier <code class="language-plaintext highlighter-rouge">&lt;speaker_idx&gt;</code></li>
  <li>task <code class="language-plaintext highlighter-rouge">&lt;tts&gt;, &lt;asr&gt;</code></li>
</ol>

<h3 id="references">References</h3>]]></content><author><name>Meraki Labs</name></author><summary type="html"><![CDATA[Indri TTS / ASR Nov 21, 2024]]></summary></entry></feed>