Memory Manager | Internals for Interns

In the previous article we looked at how a user program crosses the ring 3 → ring 0 boundary to ask the kernel for help. The example we used was read() — a file descriptor, a buffer pointer, a byte count. But we glossed over something important: what is that buffer? Who decided it existed? Who owns the physical RAM behind it?

Those questions are what the memory manager answers. And it answers them for every process on the machine, simultaneously, for every allocation that has ever happened since boot. It’s one of the most complex subsystems in the kernel, so I want to approach it the way you’d approach an unfamiliar library — start at the front desk with the catalog, then walk back through the stacks.

A Metaphor to Carry Us Through

Here’s the mental model I want you to hold onto: think of memory management as an enormous public library in your town.

The library has one vast collection of physical shelves, and every shelf is identical: one shelf is a page frame — 4 KB of physical RAM. There’s a fixed number of them: on a machine with 16 GB of RAM, about four million shelves, and that’s all you’ll ever have. This is the real, finite, physical stuff.

Now the trick. No reader ever walks the shelves directly. Instead, the reader goes to the library’s service desk and asks for an entry by its number. The desk keeps a private catalog for every reader — a personal set of index cards — and looks yours up to find where the entry really lives. A card says “the thing you call entry #5,000 lives on physical shelf 19.” Two different readers can both have a card numbered #5,000, and those cards can point at completely different shelves — or, sometimes, deliberately at the same shelf.

With that in mind, let’s go from the ground up. We start with the hardware.

The MMU and Page Tables

Remember the service desk: you hand it an entry number, it looks the entry up in your catalog, and it tells you the real shelf the data sits on — and it does this for every single thing you ask for. That desk is real hardware, the MMU (Memory Management Unit), and the catalog it reads is a tree of kernel-managed tables called page tables.

Now the concepts behind the metaphor. Every time an application touches memory — reading a variable, calling a function, fetching a string — it uses a virtual address (the entry number you hand the desk), and that address never reaches the RAM chips directly. The MMU has to translate it into the physical address where the data actually lives, and to do that it walks the process’s page tables. The kernel writes the cards; the MMU reads them, billions of times a second.

But the library has a catalog for every reader — how does the MMU know it’s your application’s catalog it should be reading, and not someone else’s? From the CR3 register. (CR3 is the x86-64 name; other architectures use a different register for the same job — TTBR0_EL1 on ARM64, satp on RISC-V — but the idea is identical.) CR3 holds the physical address of the root of your process’s page table tree — its PGD (Page Global Directory) — and that root is the single thing that points at your exclusive catalog and defines what memory belongs to you. On every context switch, the kernel just reloads CR3 with the next process’s PGD, swapping in a completely different catalog: from that instant the same virtual addresses resolve to entirely different physical memory.

That catalog isn’t one giant flat table — that would be impossibly large. It’s a tree of nested drawers, and on modern x86-64 with 5-level paging the tree is five levels deep: PGD → P4D → PUD → PMD → PTE. (Most x86-64 machines actually run just four levels — PGD → PUD → PMD → PTE, with the P4D folded away — since 5-level paging needs both newer hardware and a kernel built for it; but five is the general case, and the extra level changes nothing about how the walk works.) The clever part is that the virtual address itself tells the MMU which drawer to open at each level. The hardware slices the address into six pieces — one index per level, plus a final offset. Say our process reads from:

Diagram slicing the virtual address 0x0005A0320C82A000 into its six fields: the PGD, P4D, PUD, PMD and PTE indexes (9 bits each) plus the 12-bit offset within the page

Each index is 9 bits — exactly enough to pick one of 512 entries (2⁹ = 512), because every drawer is a single 4 KB page holding 512 entries of 8 bytes (64 bits) each (512 × 8 = 4096). The trailing 12 bits aren’t an index at all; they’re the offset inside the final page (2¹² = 4096, our 4 KB page size).

Now the walk. The MMU starts at the PGD that CR3 points to, uses the first index (5) to pick an entry, and that entry hands it the physical address of the next drawer down. It uses the next index (320) to pick an entry there, which points to the next drawer, and so on, one level at a time:

Diagram of the five-level page table walk starting from CR3: each level (PGD, P4D, PUD, PMD, PTE) uses one index from the virtual address to pick an entry that points to the next drawer down, ending at the page frame @ 0x8E33000

You might wonder why those entry values look like 0x3A05007 instead of a clean 0x3A05000. Since every drawer is page-aligned, the low 12 bits of any next-drawer address are always zero — so the kernel borrows them to stash a handful of permission bits describing what you’re allowed to do with whatever the entry points to: whether the page is writable, whether it’s actually in RAM right now, whether user-space is allowed to reach it, whether code can be executed from it. That trailing 0x007, for instance, just means present, writable, and user-accessible. These bits are also how memory protection is enforced — a code page leaves the writable bit cleared so you can’t overwrite your own instructions, a stack page is marked no-execute so injected machine code won’t run. (They’re all defined in arch/x86/include/asm/pgtable_types.h.)

The entry at the bottom — the PTE (Page Table Entry) — is the one that finally names a real page frame: 4 KB of physical RAM (one shelf in our library). The MMU takes that frame address and adds the 12-bit offset it set aside at the very start, and that is the physical address it reads from memory — the exact byte our virtual address was pointing at all along:

physical = 0x8E33000 + 0x000 = 0x8E33000

Five lookups to turn one virtual address into one physical address. (The CPU caches recent translations in a small buffer called the TLB so it doesn’t repeat this walk on every access; and the L1/L2/L3 caches keep recently-used data close to the CPU so it doesn’t hit RAM every time. We won’t cover either here.)

Now that we have a physical address, we can go to the right place to fetch the information — but who keeps track of which pages are occupied, which are free, and what hardware rules apply to each? That’s what the physical layer is for.

Shelves, Wings and the Shelf Registry

The library keeps a registry with one entry describing the state of every single shelf. The kernel does exactly this: for every 4 KB page frame in the machine there’s one small record, called a struct page. Conceptually, you can think of those records as a registry indexed by the page’s serial number, its PFN (Page Frame Number): give the kernel a PFN, and it can find the corresponding struct page through its memmap/vmemmap machinery. On a machine with 16 GB of RAM, that’s over four million records.

Each record only holds the essentials the kernel needs to manage that one page. It knows whether the page is free or in use. It keeps a reference count — how many parts of the system are still relying on that page — so the kernel can tell when nobody needs it anymore and it’s safe to reclaim. It remembers what the page is backing, whether that’s a piece of a file’s cached contents or anonymous memory like a process’s heap or stack, and how many page tables are pointing at it, which matters when a page is shared between several processes. And it carries a few status flags: is the page dirty (written to but not yet saved to disk), is it locked, is it being written back right now. That handful of fields, multiplied across all four million pages, is enough for the kernel to track the life of every piece of physical memory in the machine.

One modern wrinkle worth flagging before we move on: the kernel increasingly manages shelves not one at a time but in shrink-wrapped bundles of adjacent ones called folios — a single index card for a whole run of pages, so a larger file chunk or a huge page can be tracked, reclaimed, and moved as one unit instead of page by page. In fact, most of the bookkeeping just described (the reference count, the dirty/locked flags, what the page is backing, its place on the reclaim lists we’ll meet later) lives on the folio now, with struct page itself being slimmed down. We’ll keep saying “page” throughout for clarity, but folio is the word you’ll run into all over current kernel source — just read it as “one or more pages handled together.” We won’t cover it further here.

That registry treats every page the same way, but the hardware underneath doesn’t — some come with strings attached.

Zones: Pages With Special Rules

Not all pages are equally usable by everyone. Picture a delivery courier whose cart only reaches the lowest shelves — anything that courier handles has to be placed down low, never on the upper floors. Physical RAM has the same kind of quirk: some old devices can only do DMA (Direct Memory Access) into the lowest 16 MB of memory, and some 32-bit devices can’t address anything above 4 GB. To keep track of these constraints, the kernel splits RAM into zones — a low zone for those restricted devices, a normal zone where the vast majority of allocations land, and a few special-purpose ones. Most of the time everything comes from the normal zone; the restricted zones only matter when a driver specifically needs low-address memory.

Zones sort pages by what they’re allowed to do; a second dimension matters just as much — where in the machine a page physically sits.

NUMA Nodes: Different Wings, Different Walks

Picture the library spread across several wings. Books in your own wing are a short walk away; fetching one from a distant wing means a long trek down the corridor. On multi-socket machines, RAM is physically attached to different CPU sockets in just this way. A CPU reading from RAM attached to its own socket is fast (local access); reading from another socket’s RAM is slower (remote access). The kernel models this with NUMA (Non-Uniform Memory Access) nodes, each with its own set of zones. That’s why, whenever it can, it tries to satisfy an allocation from the node’s local memory and dodge the remote-access penalty; on a single-socket machine the question doesn’t even come up, since there’s only one node.

Now we know how the pages are organized — so how does the library actually hand them out?

The Buddy Allocator — Reserving Runs of Pages

The librarians don’t just give out one page at a time. They keep their free space organized as runs of adjacent empty pages, and when you ask for space they find a run of the right length — splitting a long run in half when a shorter one will do, and gluing two short runs back together when both come free. That’s the buddy allocator, implemented in mm/page_alloc.c (~7,800 lines). It manages free physical pages in lists grouped by order, where order n means a block of 2ⁿ contiguous pages:

Order 0: 4 KB (1 page)
Order 1: 8 KB (2 pages)
Order 2: 16 KB (4 pages)
…
Order 10: 4 MB (1,024 pages)

When the kernel wants to allocate a 16 KB block (order 2), the buddy allocator looks at the order-2 free list. If there’s a run there, it takes it. If not, it grabs an order-3 block (32 KB), splits it in half — the two halves are buddies, adjacent runs of pages — hands one back, and puts the other on the order-2 list. When the block is later freed, the allocator checks if its buddy is also free; if so, it glues them back into a longer run. This merging keeps large contiguous stretches of memory available for things like huge pages and DMA buffers.

Diagram of the buddy allocator: free runs grouped by order (order n = 2ⁿ pages), the SPLIT path borrowing an order-3 block and cutting it into two buddies, and the MERGE path gluing two free buddies back into one larger run

Splitting and gluing runs keeps fragmentation in check, but there’s still one bottleneck left on the fast path: the lock itself.

Per-CPU Caches: Skipping the Lock

The buddy allocator protects its free lists with a zone lock — think of it as needing the head librarian’s key every time you touch the shelf registry. Taking that lock on every single allocation would be a bottleneck in a system with hundreds of CPUs. So each CPU has a small Per-CPU Page cache (PCP) — a personal cache of pre-allocated pages (mostly single pages, plus a few small runs) that the CPU can give out and take back without bothering the head librarian for the common case. The PCP is defined in include/linux/mmzone.h and refills from the buddy allocator in batches when it runs low.

The buddy allocator deals in whole pages, but most of what the kernel needs is far smaller — so the next layer up is built to carve those pages into something finer-grained.

Kernel Allocators — Specialized Drawers

The structures the kernel needs room for are its own small internal records — the bookkeeping it uses to run the system, which lives entirely inside the kernel and never in user space. Handing over an entire 4 KB page for a 64-byte structure is like dedicating a whole shelf to a single sticky note, wasting 4,032 bytes. The answer is a slab allocator: a slab is a page (or a few pages) grabbed once from the buddy allocator and dedicated to objects of one single size, lined up side by side like books of equal height on a shelf. A request for a 64-byte object simply hands back the next free spot instead of burning a whole page on it — one shelf, many identical books.

SLUB: The Default Kernel Object Allocator

Linux uses SLUB (implemented in mm/slub.c) as its default slab allocator. To organize all this, SLUB groups slabs into caches: a cache is the whole set of slabs dedicated to one kind of book — objects of a single specific size, like a task_struct or an inode. So there’s a cache for task_structs, another for inodes, and so on for every kind of structure; each one manages its own collection of slabs.

When you call kmalloc(size, GFP_KERNEL), SLUB rounds the request up to one of a handful of size classes (8, 16, 32, 64, 96, 128, … bytes) and goes to the cache for that class. And here’s the key detail: each cache keeps a small per-CPU stash of free objects. So when your code asks for a 64-byte object, the CPU it’s running on can usually take the next one straight from its own stash without touching a shared global list. Only when that stash runs empty does SLUB take the slow path, refilling it from the cache’s slabs — and grabbing a fresh slab (a new page with room for more objects) from the buddy allocator when those run out too. (In kernel code this per-CPU layer is called sheaves, backed by a per-NUMA-node structure called a barn that shuffles objects between CPUs as some run low and others pile up. Don’t worry about the names; the point is the idea: each CPU pulls from its own stash, so in the common case nobody has to compete for a shared global resource.)

If this sounds familiar, it’s because it’s similar to the trick the Go runtime’s allocator uses. Go gives every P (its scheduling processor) a private mcache holding one mspan per size class, each span carved into fixed-size slots — so a goroutine can grab small objects without ever touching a global lock. Same idea, different runtime; I went through it in detail in The Memory Allocator .

SLUB tackles objects smaller than a page. The opposite problem — needing a buffer bigger than any contiguous run the buddy allocator can spare — calls for a different trick.

vmalloc: When Physical Contiguity Isn’t Needed

The buddy allocator hands out physically contiguous runs of pages. For large allocations this gets harder and harder as memory fragments over time. But sometimes you just need a big buffer and don’t care whether the backing pages are next to each other in physical memory.

That’s what vmalloc() is for (mm/vmalloc.c). It grabs individual pages from the buddy allocator wherever they happen to be free, then sets up a run of consecutive page-table entries that point at those scattered pages. The pages are spread all over the building, but to the code using the buffer it looks like one unbroken range.

The downside: if what you really need, for hardware reasons, is a single physically contiguous block of memory (a DMA buffer, say), this won’t do.

Buddy, SLUB and vmalloc are all about the kernel feeding its own appetite for memory. A user-space process gets its memory through a different layer entirely — one built on top of everything we’ve seen.

Virtual Memory for Processes — The Address Space and Its Regions

Here’s the key idea. The page tables we saw earlier only record what’s actually mapped at this instant. But a process’s address space is mostly promises: huge ranges of addresses the process is allowed to use, but that aren’t backed by any physical memory yet. The page tables can’t represent those promises, so the kernel keeps a separate, higher-level picture of what each process’s memory is supposed to look like.

That picture starts with one record per process, its mm_struct — the single master description of that process’s entire view of memory. It points at the process’s page tables and tracks the broad layout, but mostly it’s a container for the regions the address space is divided into.

Those regions are where the real detail lives. Each one is a VMA (Virtual Memory Area) (struct vm_area_struct): a contiguous range of addresses that all share the same rules. For our purposes, the important pieces of a VMA are the range of addresses it covers, the permissions on that range (read, write, execute, and whether it’s shared), and what’s behind it: a specific file it was mapped from, or nothing at all, in which case it’s anonymous memory like the heap or the stack. A typical process is just a handful of these stitched together — one for its code, one for its data, one for the heap, one for the stack, and more for each shared library or mapped file. You can list them for any running process with /proc/<pid>/maps.

But a VMA is still just a promise — a labeled range with rules and nothing behind it. The moment a process actually reaches into one of those addresses, that promise has to be turned into real memory, and that’s the job of the page fault.

Page Faults — The Librarian Who Fetches on Demand

When you call malloc() and the C library asks the kernel for memory, the kernel usually hands back a virtual address range without allocating any physical pages yet. The address range exists, but no page is shelved and nothing in the page tables points anywhere yet.

The first time your code actually reads or writes that memory, the CPU tries to translate the address, finds nothing in the page tables for it, and raises a #PF (page fault) exception — the reader has opened an entry with no book behind it. This is the bell that summons the librarian: the CPU jumps into the kernel’s fault handler (handle_page_fault()).

The handler’s job is to figure out what should be at that address and make it so. First it finds the VMA covering the faulting address. If there isn’t one, or the access breaks the VMA’s permissions (writing to read-only memory, say), that’s a genuine bug and the process gets a SIGSEGV. Otherwise the kernel looks at what’s missing and reacts accordingly: if nothing was ever mapped there, it provides a fresh blank page — or, for a file-backed region, reads the right chunk of the file in. (There’s one neat shortcut: the very first access to anonymous memory is often a read, and a read of a page that should just be zeros doesn’t need its own allocation at all — the kernel points it at a single shared, read-only zero page, and only allocates a real page once you actually write.)

This lazy approach — demand paging — is why a process that allocates 1 GB but only uses 10 MB doesn’t consume 1 GB of RAM. The kernel only builds what the process actually touches.

Demand paging keeps a fresh allocation cheap. A closely related trick keeps copying an entire address space cheap too.

Copy-on-Write: Photocopy When You Annotate

When a process calls fork(), the child gets its own address space that starts as an exact copy of the parent’s. Actually copying all that memory would be slow and wasteful. It’d be like photocopying every book a reader owns just so a second reader can have a set they’ll probably never write in.

So the kernel cheats. With Copy-on-Write (CoW), parent and child simply share the same physical pages, but every writable private page is marked read-only in both. As long as both only ever read, nothing else has to happen — they read the same books side by side.

The trick springs the moment either one tries to write. The CPU faults on the read-only page, and only then does the kernel make a private copy: it grabs a fresh page, copies the contents across, and repoints that process’s page-table entry at the new, writable version. The other process keeps using the original, undisturbed. Each side ends up with its own copy of only the pages it actually changed.

Demand paging and copy-on-write are both about filling in a region that already exists. But all along we’ve taken those regions for granted — so let’s back up and see how one gets created in the first place.

mmap — Claiming a Range of Addresses

So where do VMAs actually come from? Most of them come from mmap(). You can think of mmap() as the call that says “give me a region of address space, and remember what it’s for” — and what it produces, under the hood, is a new VMA. The work happens in do_mmap().

The kernel doesn’t have to do much here. It rounds your requested size up to a whole number of pages, turns the protections you asked for (PROT_READ, PROT_WRITE, PROT_EXEC) into the new region’s permissions, and finds a free gap in the address space big enough to hold it. Then it creates the VMA describing that range and adds it to the process’s set of regions. If you’re mapping a file, it also notes which file backs the region, so that a later page fault knows where to fetch the contents from.

And for an ordinary lazy mapping, that’s it — mmap() returns. Notice what didn’t happen: no physical memory was handed out and no page tables were filled in. All mmap() did was reserve a range of addresses and write down the rules for it. The actual pages show up later, one page fault at a time, the first time you touch them. Some mappings ask the kernel to do more up front — MAP_POPULATE, locked mappings, huge pages, and some device mappings are examples — but lazy mappings are the normal case to keep in mind.

This is also, indirectly, how malloc() works. When the C library needs more room, it asks the kernel for raw address space — either with brk(), which simply moves the top of the heap, or with an anonymous mmap() — and then hands out small pieces of that range to your program on its own. From the kernel’s point of view there’s no such thing as malloc: there’s just a request for address space, followed by page faults as the program starts using it.

Every mechanism so far has been about handing memory out — reserving ranges, faulting pages in, sharing them between processes. Eventually, though, the machine runs out of room, and the kernel has to start taking pages back.

Memory Pressure: When the Shelves Fill Up

The shelves are finite, so sooner or later they fill up. When that happens, the library does what a real library does with books nobody has opened in years: it boxes them up and ships them to an off-site annex, freeing the shelf for something people actually want. The annex is swap — disk space the kernel keeps around as overflow — and the worker running the operation is kswapd, a kernel thread that wakes up whenever free memory runs low and quietly reclaims pages until there’s breathing room again.

So how does kswapd decide which books to box up? It goes after the coldest ones — pages nobody has touched in a while. To keep track of which those are, the kernel files pages on LRU (Least Recently Used) lists: an active list for pages in regular use, and an inactive list for pages that have gone cold. A page that stops being touched slowly drifts from active to inactive, and the inactive list is exactly where kswapd goes shopping for things to evict. (It keeps separate lists for anonymous memory and for file-backed page-cache pages, since the two are reclaimed in different ways. The kernel can also be built to use Multi-Gen LRU, which sorts pages into several finer-grained generations by age, but the goal is the same: find the coldest pages.)

Once kswapd picks a cold page, what happens next depends on what kind of page it is:

A clean file-backed page — just drop it. The original is still sitting on disk, so it can be read back later if anyone needs it.
A dirty file-backed page — write the changes back to disk first, then drop it.
An anonymous page (heap or stack — nothing on disk to fall back on) — ship it off to the annex: write it out to swap, leave a note in the page-table entry saying “this one’s at the annex,” and free the page.

Later, if a process reaches for a page that got shipped off to swap, it simply faults. The librarian spots the “at the annex” note, do_swap_page() fetches the page back from disk, and the program carries on none the wiser — just a little slower for that one access.

And if even swap fills up and there’s truly nothing left to reclaim? The library’s last resort is to throw a reader out completely. That’s the OOM (Out-Of-Memory) killer (mm/oom_kill.c): it looks over the killable processes, scores them mostly by memory footprint while respecting policy knobs like oom_score_adj, and kills the worst candidate. It’s brutal, but losing one process beats the whole machine locking up.

That’s the kernel reclaiming memory under duress. Most of the time, though, memory is handed back far more peacefully — by the programs that asked for it in the first place.

Freeing Memory: Returning Pages to the Pool

Here’s the surprise: calling free() usually doesn’t return anything to the kernel — the C library just keeps it in its own free-list to reuse for your next malloc().

Pages only really go back when a whole region is torn down (a munmap(), or the process exiting). Even then, a page is freed only once the last user lets go — its reference count hits zero — because copy-on-write and shared mappings mean several processes can point at the same page. When that count reaches zero, the page returns to the buddy allocator, ready to be handed out again.

That’s every piece of the machine on its own. The best way to make them stick is to watch them work together — so let’s trace, step by step, what happens behind a few ordinary lines of C.

End-to-End: From `malloc()` to First Write

We’ll start with the simplest thing of all: allocate some memory and write to it.

char *buf = malloc(4096);
buf[0] = 'A';

malloc(4096). Remember that malloc lives in the C library, not the kernel. If the C library already has spare space in its own heap, it may not call the kernel at all. But when it does need more room, it asks the kernel for raw address space (with brk() or mmap()), and the normal response is cheap: the kernel extends or adds an anonymous VMA — a promise covering that range of addresses. No physical memory is touched yet. So when malloc hands back buf, the address is valid but may have nothing behind it: no page-table entry, no physical page.

buf[0] = 'A'. Now you write, and that’s when the promise gets cashed in. The CPU asks the MMU to translate the address, the MMU walks the page tables, finds no entry there — and raises a page fault. That rings the bell for the librarian: the fault handler finds the VMA covering the address, confirms it’s writable, and sees this is a fresh anonymous page that needs filling in. So it grabs a free page from the buddy allocator, often through the per-CPU cache, wipes it clean, and installs a page-table entry pointing at it.

The faulting instruction restarts, and this time the translation succeeds — 'A' lands on the shelf. Every later access to that page sails straight through the TLB and costs nothing extra. That’s demand paging: the page was just a promise until the exact moment you touched it.

Allocating and writing is the simple case. It gets more interesting when a process forks and parent and child end up sharing memory.

End-to-End: `fork()` and Copy-on-Write

pid_t pid = fork();
if (pid == 0) {
    buf[0] = 'B'; // child writes
}

When fork() is called, the child gets a copy of the parent’s address space — but, as we saw, the kernel doesn’t actually copy the memory. Instead it points the child’s page tables at the same physical pages the parent is using, and marks every writable private page read-only in both. As long as both processes only read, they share the same books happily, with no faults at all.

The moment the child writes, the trap springs. buf[0] = 'B' lands on a read-only page, so the CPU raises a protection fault. The librarian recognizes this as copy-on-write: it grabs a fresh page from the buddy allocator, photocopies the parent’s page into it (still holding 'A'), repoints the child’s page-table entry at the copy, and makes it writable again. It also drops one reference on the original, since the parent still holds it.

Now each side has its own page — the parent’s untouched with 'A', the child’s a private copy with 'B'. Only the single page that was actually written ever got duplicated.

Anonymous memory is only half the story. Mapping a file into the address space runs the same machinery with one twist: where the page comes from.

End-to-End: `mmap()` a File

int fd = open("data.bin", O_RDONLY);
char *p = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
char c = p[0]; // first read

With an ordinary lazy file mapping, mmap() does almost nothing up front, just like malloc. The kernel creates a VMA for the range — but this time it labels it with which file backs it: “the books here come from data.bin.” The address p comes back right away, with no I/O and no page allocated yet.

The first read, p[0], faults — the same bell as always. The only difference is what the librarian does about it. Instead of handing back a blank page, it sees the VMA is file-backed and fetches the right page from the file: it reads it from the page cache (the kernel’s in-memory copy of recently-used file data), or from disk if it isn’t cached yet. That page gets wired into a page-table entry, and the read completes.

What happens on a write depends on how you mapped the file, assuming the mapping was created with PROT_WRITE. With MAP_PRIVATE the page starts out protected from direct modification, so writing to it triggers a copy-on-write fault — the page is photocopied to a private page and your change goes to the copy, leaving the file untouched. With MAP_SHARED, writes land directly in the page cache, and the kernel writes those dirty pages back to the file later on.

With those three traces, we’ve watched every piece of the machine work together. Here’s the whole picture in one place.

Summary

Let’s put the library back together. Every address a program uses is virtual, and the MMU translates it on every access by walking the process’s page tables (up to five levels, rooted at CR3), with hot translations cached in the TLB. Underneath, each 4 KB page has a struct page record; pages are sorted into zones and NUMA nodes, and the buddy allocator hands them out in power-of-two runs, fronted by a per-CPU cache that avoids the zone lock in the common case. On top of it, the kernel’s own allocators carve pages up further — SLUB lines a page with same-sized objects (one shelf, many identical books) for its tiny internal structures, while vmalloc stitches scattered pages into one contiguous virtual range.

User space gets its own view: an mm_struct with a tree of VMAs, each a labeled range of addresses. VMAs are just promises, though — real pages appear only when the program touches an address and trips a page fault, where the librarian provides a blank page, reads one from a file, or photocopies a shared one (copy-on-write). This demand paging is what lets a process reserve gigabytes while consuming only what it touches. Memory flows back the same way: under pressure kswapd ages pages through LRU lists and reclaims the coldest ones, dropping file-cache pages or pushing anonymous pages to swap, with the OOM killer as last resort; otherwise a page returns to the buddy allocator once its region is gone and its last reference drops.

That’s the memory manager: one mechanism giving every process its own private, lazily-filled view of a finite pile of physical RAM. The next article turns to the part of the kernel that decides who gets to run, and when — the scheduler.