Understanding the Linux Kernel: System Calls

📚 Understanding the Linux Kernel (2 of 3)
  1. 1. The Linux Kernel Startup
  2. 2. System Calls You are here
  3. 3. Memory Manager
System Calls

In the previous article we followed the kernel from the very first instruction the bootloader handed us all the way to the moment kernel_init called execve() on /sbin/init. That was a long ride, but it ended with a quiet handover: the kernel stepped aside, userspace took the wheel, and /sbin/init started spawning the rest of the services.

Here’s the thing though. Those processes that just started don’t actually have keys to anything. They can’t touch the disk. They can’t talk to the network card. They can’t even draw a pixel on the screen. Every piece of hardware in the machine is still owned by the kernel, and the CPU itself enforces this: user programs run in a restricted execution mode (ring 3 — the unprivileged mode where hardware blocks direct access to kernel memory or devices). Userspace is sandboxed by hardware, on purpose.

So how does bash print a prompt? How does cat read a file? How does curl send a packet? They have to ask the kernel to do it for them. And the mechanism for that ask is what this article is about: the system call.

One note before we dive in: the implementation details here are specific to x86_64. The concepts — privilege levels, the boundary crossing, the dispatch table — apply to every architecture Linux runs on, but the instructions, register names, and file paths will differ on ARM, RISC-V, and others.

What You See From the Outside: strace

Before we dive into how the machinery works, let me show you what it looks like from above. There’s a tool called strace that intercepts every syscall a process makes and prints it. Let’s run it on something embarrassingly simple:

$ strace -e trace=openat,read,write,close cat /etc/hostname
# ... (startup, libc loading, locale lookups omitted) ...
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
read(3, "myhost\n", 262144)             = 7
write(1, "myhost\n", 7)                 = 7
read(3, "", 262144)                     = 0
close(3)                                = 0

The actual output is longer — cat also has to load libc and look up locales before doing any real work. But this is the part that matters: open the file, read its bytes, write them to stdout, read again to confirm EOF, close. Our app interacts with the outside world through syscalls.

Linux exposes hundreds of numbered native x86_64 syscalls, all listed in arch/x86/entry/syscalls/syscall_64.tbl. Each entry has a number, an ABI (which process — 64-bit, 32-bit, or both — can call it), a name, and the kernel function that handles it:

# number  ABI     name    kernel function
0         common  read    sys_read
1         common  write   sys_write
2         common  open    sys_open
9         common  mmap    sys_mmap

So the question is: what actually happens between cat calling read(3, buf, 262144) and the kernel handing back seven bytes? Let’s take it apart.

A System Call Is Not a Function Call

This is the most important thing to get straight up front. When you call printf() from C, the CPU just jumps to another address in your own process — same memory, same privileges, same stack. A system call is not that. It’s a privilege-level transition.

The CPU has two ring levels we care about: ring 3 (user) and ring 0 (kernel). In ring 3 you can’t touch kernel memory, you can’t execute privileged instructions, you can’t talk to I/O ports. The kernel runs in ring 0, where none of those restrictions apply. A syscall is the controlled, hardware-mediated way to cross that boundary — a dedicated CPU instruction that moves execution from ring 3 to ring 0, runs some kernel code, and comes back.

Diagram comparing a regular function call to a system call, showing the ring boundary crossing

Let’s see how that works in practice, starting from the application side.

The Userspace Side

We saw from strace that cat calls read(), write(), and friends. But what does that look like in actual code? Let’s look at the bottom of that stack — the assembly stub where the actual boundary crossing happens. We’ll use Go as a concrete example — but C, Rust, Python all end up in the same place.

No matter the language, the contract with the CPU is the same: put the syscall number in RAX and up to six arguments in RDI, RSI, RDX, R10, R8, R9, then fire the SYSCALL instruction. Here’s how Go gets there.

In Go, functions like os.File.Write or os.File.Read go through a few internal layers and eventually call Syscall6 (src/internal/runtime/syscall/linux/asm_linux_amd64.s), which is where the actual SYSCALL instruction lives:

TEXT ·Syscall6<ABIInternal>(SB),NOSPLIT,$0
    // a6 already in R9.
    // a5 already in R8.
    MOVQ    SI, R10 // a4
    MOVQ    DI, DX  // a3
    MOVQ    CX, SI  // a2
    MOVQ    BX, DI  // a1
    // num already in AX.
    SYSCALL
    CMPQ    AX, $0xfffffffffffff001
    JLS     ok
    NEGQ    AX
    MOVQ    AX, CX  // errno
    MOVQ    $-1, AX // r1
    MOVQ    $0, BX  // r2
    RET
ok:
    MOVQ    DX, BX  // r2
    MOVQ    $0, CX  // errno
    RET

What’s happening here is straightforward: the first MOVQ instructions set up all the arguments in the right registers, then SYSCALL fires. After it returns, AX holds either the result or an error — Linux encodes errors as values in the 0xfffffffffffff001..0xffffffffffffffff range, so the code checks for that, and returns either the success value or the error accordingly.

Now let’s look at what the CPU actually does the moment that instruction fires.

The Hardware Instruction: SYSCALL

We just watched Go shuffle registers and fire SYSCALL. The CPU takes it from there — and it does surprisingly little. Its job is just to make the jump safely; the kernel does the heavy lifting afterward. Here’s what the hardware does on its own:

  • Saves where to return — the current instruction pointer (where userspace should resume after the syscall) gets saved in a register.
  • Saves the CPU flags — a register that tracks things like “was the last comparison equal?” gets preserved in another register.
  • Disables interrupts — no other code can interrupt us mid-transition.
  • Jumps to the kernel — loads the kernel entry point address from a special register (MSR_LSTAR) and starts running kernel code. That address was registered during boot, so every SYSCALL on every CPU core always lands in the same place: a hand-written assembly function called entry_SYSCALL_64.

So what does the kernel do the moment it takes control?

The Kernel Entry Path: entry_SYSCALL_64

The moment the CPU jumps to entry_SYSCALL_64, the kernel is in a delicate situation. We’ve just arrived from userspace — still running on the user’s stack, still using the user’s memory map, every register potentially containing user-controlled data. So the very first thing entry_SYSCALL_64 does is switch to the kernel’s own stack and memory map, so the rest of the kernel code can execute safely without touching anything that belongs to the user process.

Next it saves all the CPU registers — every general-purpose register the user process had — into a structure called pt_regs on the kernel stack. This snapshot is how the kernel knows the syscall arguments, and it’s also what gets restored when we return.

Then come a handful of security mitigations. These exist because of real, published CPU vulnerabilities — Spectre, Retbleed, and a few others — and they run unconditionally on every single syscall on affected hardware.

Only after all of that does the assembly hand off to a C function called do_syscall_64, passing it the saved register state and the syscall number.

The C Dispatcher: do_syscall_64

Now we’re in C. The first thing do_syscall_64 does is shift the kernel stack pointer by a small random number of bytes. It’s a cheap defensive trick: if an attacker ever manages to overflow a buffer on the kernel stack, the random offset makes it much harder to predict where anything useful lives.

Next, before the actual syscall runs, the kernel gives a few subsystems a chance to intercept it. First, if a debugger is attached via ptrace, it gets to inspect and even modify the call. Then seccomp runs — this is the sandboxing mechanism that containers use to block dangerous syscalls like mount(). If either of these decides the syscall shouldn’t proceed, execution stops here and control returns to userspace without ever running the real handler.

If the syscall is allowed through, the dispatcher looks up the right kernel function in a generated dispatch table and calls it, storing the return value back into the saved register state so it can be handed back to userspace on the way out.

But what does that handler actually look like, and how does it safely receive its arguments?

The Per-Syscall Stubs: Layers of Macros

Each syscall lives in the kernel subsystem that logically owns it — read in the filesystem code, socket in the network stack, fork in process management. The dispatcher doesn’t call them directly though. There are a few thin generated wrapper layers in between — built by the SYSCALL_DEFINE macros and the x86-specific syscall_wrapper.h — that take care of extracting the arguments from the saved register snapshot and cleaning them up before the real implementation runs.

Diagram showing the wrapper layers between the dispatcher and the real syscall implementation

Once inside the real implementation, the kernel does the work — reading from a file, creating a socket, whatever the syscall asks for.

At this point the kernel has the result and needs to get it back to the application. Now it has to hand control back to the application.

The Return Path

The handler stores its return value — a file descriptor, a byte count, an error code — directly in the saved register state, where it will ride back to userspace in AX.

Some syscalls also need to return data into a buffer the caller provided — read is the obvious example. For those, the kernel can’t do a plain memory copy, because the pointer the user passed might be invalid, point to kernel memory, or simply be garbage. So it uses dedicated safe functions (copy_to_user, copy_from_user) that validate the address and handle any faults gracefully instead of crashing.

With that done, the kernel drains any pending work — signals, scheduling decisions — before heading back to userspace.

Once all of that is settled, the kernel needs to undo everything it did on entry: restore the saved registers, switch back to the user’s page tables, and swap back to the user’s memory context.

Finally, control returns to userspace. There are two ways to do this: a fast path using the SYSRET instruction, which is the mirror of SYSCALL, and a slower fallback using IRET. The fast path only works when the saved CPU state is clean and normal — if anything looks unusual, the kernel takes the safe route. Either way, the CPU drops back to ring 3 and the application resumes right where it left off, with the syscall result sitting in the return register.

Now that we’ve seen each piece, let’s watch them all work together in one concrete example.

End to End: One read() Call

Diagram of the full syscall flow, from the Go Gopher in userspace through the kernel and back

Let’s trace what happens when cat calls read to fetch the contents of /etc/hostname.

It starts in Go’s standard library. os.File.Read passes the call down through a few internal layers until it reaches Syscall6, which loads the syscall number and arguments into the right registers and fires the SYSCALL instruction. In this case: syscall number 0 (read) in AX, file descriptor 3 in DI, the buffer address in SI, and the max count 262144 in DX.

The CPU saves the return address, disables interrupts, and jumps to entry_SYSCALL_64. The kernel is now running, but in a fragile state — still on the user’s stack, still with the user’s memory map. The first thing it does is switch to its own stack and page tables, save all the registers, run the security mitigations, and then hand off to do_syscall_64.

The C dispatcher adds a random stack offset, checks whether ptrace or seccomp want to intercept this call, and then looks up the handler for syscall number 0 — read. The wrapper layers extract the arguments from the saved registers, zero out the unused ones, and call down into the actual filesystem code, which reads the 7 bytes of "myhost\n" from the file, copies them into the user’s buffer with copy_to_user, and sets AX to 7 — the number of bytes read.

With the result in hand, the kernel checks for any pending signals or scheduling decisions, restores the saved registers, switches back to the user’s page tables, and returns to userspace via SYSRET. The CPU drops back to ring 3, and Syscall6 picks up exactly where it left off — the return value in the register, ready to hand back to the caller as a Go (n int, err error).

For cat /etc/hostname to print seven bytes, that whole journey runs four times — once each for openat, read, write, and close.

With the full picture in mind, let’s recap what we covered.

Summary

A system call is not a function call — it’s a hardware-mediated crossing from ring 3 into ring 0, and a lot happens along the way.

On the application side, a language like Go funnels all I/O through a small assembly stub that loads the syscall number and arguments into specific registers and fires the SYSCALL instruction. The CPU saves the return address, disables interrupts, and jumps to the kernel’s single entry point.

From there the kernel works its way through several layers: switching to its own stack and page tables, saving all registers, running security mitigations, checking whether ptrace or seccomp want to intercept the call, and then dispatching to the right handler via a compiler-generated jump table. The handler itself unpacks the arguments from the saved register snapshot, does the actual work, and uses dedicated primitives to safely move data across the user/kernel boundary.

On the way back out, the kernel drains any pending signals, decides whether to use the fast SYSRET or the safe IRET path, cleans up its microarchitectural footprint, and drops back to ring 3. The application resumes exactly where it left off, with the result in hand.

In the next article, we’ll look at another fundamental kernel responsibility: managing memory. How does the kernel track which process owns which chunk of RAM, what happens when a process asks for more memory, and how does it handle the fact that physical RAM is finite? Memory management is up next.