System Design Nuggets

System Design Nuggets

Split Brain Problem in System Design: Quorum, Fencing Tokens, and Network Partitions

What is Split Brain in system design? Learn how network partitions create two primary databases and the strategies engineers use to prevent data corruption.

Arslan Ahmad's avatar
Arslan Ahmad
Jan 17, 2026
∙ Paid

Distributed systems are built on a fundamental promise: reliability through redundancy.

To ensure an application remains online even when hardware fails, engineers duplicate data and services across multiple physical servers.

The logic is simple. If one server crashes, another takes its place, and the user experiences no interruption.

However, this reliance on redundancy introduces a specific, catastrophic failure mode that does not exist in single-server architectures.

This failure mode is not caused by a crash, but by a misunderstanding. It occurs when the safeguards of a system turn against itself, causing redundant components to fight for control rather than cooperate.

This phenomenon is known as Split Brain.

It is one of the most critical concepts in system design. It highlights the fragility of networks and the difficulty of maintaining a single source of truth in a distributed environment.

When a Split Brain occurs, the integrity of the data which is the most valuable asset of any software system is compromised.

For developers and architects, understanding the mechanics of this failure is not optional; it is a prerequisite for building scalable, resilient software.

The Standard Architecture: Primary and Replica

To understand how a Split Brain occurs, one must first understand the healthy state of a high-availability system.

Most data-intensive applications, from banking ledgers to social media feeds, utilize a Primary-Replica (or Leader-Follower) architecture.

In this configuration, the system assigns distinct roles to different nodes in the cluster:

1. The Primary Node (The Leader)

This node is the authoritative source of truth. It is the only server permitted to accept “write” operations (creating, updating, or deleting data).

When a client application needs to save data, it must connect to the Primary.

The Primary processes the request, writes it to its local storage, and logs the transaction.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Arslan Ahmad · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture