Multigres Blog

A 2.5x faster Postgres parser with Claude Code

Thu, 05 Feb 2026 00:00:00 GMT

Building a production-grade parser is an exercise in discipline. You need to translate thousands of grammar rules exactly. You need to catch subtle bugs that only surface on edge cases you've never seen. You need to verify every decision against a reference implementation. There are no shortcuts.

I know this because I've done it before. I led the effort to build the MySQL parser for Vitess. That took over a year with help from talented contributors. So when we needed a Postgres parser for Multigres, I expected a similar timeline.

It took eight weeks. 287,786 lines of code. 304 files. 130 commits. 71.2% test coverage. 2.5x faster than the cgo (Go's C interop) alternative.

The difference wasn't AI writing code for me. It was three things: a system for coordinating work across sessions, the expertise to recognize when the output was wrong, and the discipline to verify everything. Claude amplified what I brought to the table, but without all three, it wouldn't have worked.

Claude typed. I engineered.

AI multiplies your expertise, but only if you already know what right looks like.

Here's what I learned.

Why build a parser at all?

Multigres is Vitess for Postgres, a horizontally scalable layer that sits in front of your database. It distributes data across multiple database servers called shards. Each shard holds a subset of your data. When a query comes in, Multigres figures out which shard (or shards) should handle it and routes accordingly.

To route queries intelligently, we need to understand them. To understand them, we need a parser.

What does "understand" mean?

Say a user sends select * from orders where customer_id = 12345. Multigres needs to know: Which table is it hitting? What's the filter? Is it a read or a write? (Yes, a select can write! Call a non-read-only function and you've got a write operation.) The answers determine which database server handles the query.

You can't answer those questions by looking at a string of text. You need to parse it into a structure you can actually inspect, an abstract syntax tree (AST) in computer jargon.

Once we have that structure, we can do more with it. We can extract the value 12345 and use it to route the query to the correct shard, the one that holds that customer's data. We can also normalize the query, replacing 12345 with a placeholder to get select * from orders where customer_id = ?. That normalized form becomes a cache key: the next time we see the same query shape with a different customer ID, we can reuse the query plan instead of computing it from scratch.

This also means we need the reverse operation: taking an AST and turning it back into SQL. When Multigres rewrites a query, say, adding a shard filter or changing a table name, we need to serialize that modified tree back into a string we can send to Postgres. That's deparsing, and we'd need to build that too.

There's already a Go library for parsing Postgres: pg_query_go, which extracts the parser directly from Postgres source code. It works. We contemplated using it initially. But that would have required us to rely on cgo. That means cross-compilation headaches, platform-specific builds, and a runtime dependency on C libraries. And there's a real performance penalty, cgo calls have overhead that adds up fast when parsing is on the hot path, and for point queries, every microsecond matters.

All of these considerations pushed us towards choosing a pure Go parser. One that matched Postgres's grammar exactly, so we wouldn't be chasing compatibility bugs forever. One that was maintainable, well-tested, and fast.

This meant we were committing to porting the real Postgres grammar. The actual grammar rules, translated into Go's yacc equivalent. All AST node types. All the edge cases that make SQL parsing surprisingly hard.

This is the kind of project that sits on a roadmap for quarters. The kind you staff a small team for. But like I mentioned before, this wasn't my first rodeo. So we knew it was possible, but the timeline was still in question. I'd been using Claude for other work and had a hunch it could change the math on this project.

My system: the directory that ran the project

Here's the thing about working with Claude: it has very little memory. Every conversation starts fresh. There's a memory feature, but it wasn't reliable enough. Crucial context would vanish at compaction. If you're doing something that spans multiple sessions, which any real project does, you need your own system.

My system was a directory.

I broke the project into multiple phases. Inside the directory, for each phase, I kept a master checklist of every task. AST structs to port, grammar rules to implement, tests to add. Each task had a status. Each phase had sub-phases (1A, 1B, 1C…) with clear scope.

For example, I generated a list of every AST struct in Postgres and used it as a checklist while porting. Each node got a checkbox, a name, and a reference to its location in the Postgres source:

### JSON Nodes
- [ ] **JsonOutput** - JSON output specification (`src/include/nodes/parsenodes.h:1751`)
- [ ] **JsonArgument** - JSON function argument (`src/include/nodes/parsenodes.h:1762`)
- [ ] **JsonFuncExpr** - JSON function expression (`src/include/nodes/parsenodes.h:1785`)
- [ ] **JsonTable** - JSON_TABLE (`src/include/nodes/parsenodes.h:1821`)

Nothing fancy, just a markdown file claude could check off as we went. You can see an example in the commit history.

I also kept session documents. At the end of each working session, I'd get Claude to write a summary: what we accomplished, what we tried that didn't work, what the next session should pick up. When I started a new session, Claude would read these files and have full context.

You can actually check out the internals by looking at the commit history. I didn't always commit all the internal files, but sometimes they'd slip in with git add ..

This sounds simple. It was also the entire job.

The actual coding, translating a grammar rule from C to Go, writing a test case, implementing an AST node, Claude could do that. Often faster than I could, and without getting tired. Of course it doesn't get it all right, but I am better off getting it to write 3000 lines of code and fixing 500 of those, instead of writing the 3000 from scratch myself.

Deciding what to work on next, recognizing when we'd gone down a wrong path, understanding why the grammar was ambiguous and how to resolve it, that was me. I wasn't pair programming. I was directing: scoping well-defined tasks, reviewing every output, and fixing what needed fixing.

Using Claude and the expertise of knowing what right looks like

While porting the Postgres grammar to Go, Claude made mistakes. Constantly.

Sometimes it would implement a grammar rule that was subtly wrong, accepting inputs that Postgres would reject, or vice versa. Sometimes it would take an architectural shortcut that would cause problems three steps later. Sometimes it would confidently explain why its incorrect code was correct.

Example 1 (the type system subtlety):

Here's a concrete example. In Go, I had a Node interface that all AST nodes implement, and a NodeList type for holding sequences of nodes. NodeList itself also implements Node—that's important because sometimes you need to nest lists within the tree.

Claude kept using []Node (a slice of nodes) instead of *NodeList (a pointer to a NodeList). Both hold multiple nodes. Both seem interchangeable at first glance. But []Node doesn't implement Node, so it can't be placed where the grammar expects a node.

Example 2 (fixing symptoms instead of causes):

Another pattern: Claude would use the wrong type in a grammar rule, then "fix" the resulting type errors by adding conversion functions. The grammar says this rule produces an X, the code returns a Y, so Claude writes a convertYToX()function and moves on.

This technically compiles. It's also a mess.

The right fix is to change the grammar rule to produce the correct type in the first place, no conversion needed. But that requires understanding the grammar's structure and thinking a few steps ahead. Claude would take the local fix, the one that made the immediate error go away, without seeing that it was papering over a design mistake.

Example 3 (no reference to copy from):

The parser itself was relatively straightforward, we had Postgres's grammar as a reference, so the job was translation. Deparsing was harder. Deparsing means taking the AST and turning it back into a SQL string.

Postgres doesn't have a standalone deparsing module we could copy from. Claude had to generate this logic from scratch. The error rate went up noticeably. Without a reference implementation to validate against, Claude would produce output that looked plausible but was subtly wrong, missing parentheses, incorrect operator precedence, edge cases that produced invalid SQL. This was where I spent a disproportionate amount of debugging time, and it reinforced the pattern: Claude is much better at translating existing logic than inventing new logic correctly.

None of this was surprising.

The surprising thing was how much it didn't matter.

It didn't matter because I knew what right looked like.

I've been working on parsers for years. I know how yacc grammars behave. I know what grammar conflicts mean and how to debug them. When Claude's output was wrong, I could see it, usually quickly. At that point, if it's a small fix, I would just go ahead and fix it myself. If it was bigger, I'd start a new session and get Claude to fix it, telling it that I made that mistake 😝 (This works better than you'd think.)

If I didn't already know how to build a parser, Claude wouldn't have helped. I would have accepted wrong output, made bad structural decisions, and ended up with a mess that didn't work.

AI doesn't replace expertise. It multiplies it.

Trust, but verify: the discipline of working with AI

Fast output means nothing if the output is wrong. A parser that mostly works is a parser that fails in production at 3am on queries you've never seen.

So I verified, and then re-verified.

There were parts with small enough scope that I could let Claude take over with less oversight. The lexer, for example, is mostly a massive switch statement on characters and states. But the grammar was different.

First: As we kept implementing the grammar, I would read every grammar rule. Not skim, read. I compared each one against the Postgres source to confirm it matched. This took time, and it was boring, but it was necessary. Grammar rules are precise. A missing optional clause or a wrong precedence and you're silently accepting invalid SQL.

Second: I ported Postgres's own regression tests. The Postgres source tree includes extensive SQL test files in src/test/regress/sql/. I wrote a script to extract the queries and run them through our parser. Thousands of queries, covering edge cases the Postgres team has accumulated over decades.

When tests failed, I investigated each one. Some failures were bugs in our parser, we fixed those. Some were tests that were supposed to fail, checking that invalid syntax was rejected. I confirmed each of those too.

The result: 71.2% code coverage, validation against the real Postgres test suite, and confidence that when we say "Postgres-compatible grammar," we mean it. Will we find bugs? Probably. I'm 71.2% sure we won't 😝. But I wouldn't have fared any better writing it by hand. The test coverage would look the same.

And the performance? We benchmarked against pg_query_go, the cgo-based alternative. On individual queries, our pure Go parser is 2-3x faster:

Query Type	Multigres	pg_query_go	Speedup
Simple SELECT	1.6µs	3.1µs	2x
Complex SELECT	3.2µs	11.0µs	3.5x
CREATE TABLE	7.7µs	26.4µs	3.5x

Across the full regression test suite, thousands of queries, Multigres parses in 145ms versus pg_query_go's 366ms. That's 2.5x faster, with no cgo overhead and no cross-compilation headaches.

The memory numbers are harder to compare directly since Go's runtime doesn't track allocations on the C side of cgo. But for a hot-path component like a parser, the speed difference alone justifies the pure Go approach.

The new normal

I think this is what the future looks like for a lot of software work. Not AI replacing engineers, but engineers operating at a different level of abstraction. Less time typing, more time thinking. Less time on the mechanical translation of ideas to code, more time on the ideas themselves.

The MySQL parser took a year because the bottleneck was implementation. The Postgres parser took two months because the bottleneck moved. Now it's about how fast you can make good decisions, verify correctness, and course-correct when things go wrong.

The leverage is real. But it requires systems (my coordination directory), genuine expertise (knowing when Claude is wrong), and discipline (reading every grammar rule). This isn't a shortcut. It's a different way of working. One that demands more from you as an engineer, not less. That's how a year became eight weeks.

Generalized Consensus: Recap

Tue, 04 Nov 2025 00:00:00 GMT

We covered a lot of ground in this series. We started with the following objectives:

Propose an alternate and more approachable definition of consensus.
Expand the definition into concrete requirements.
Break the problem down into goal-oriented rules.
Provide algorithms and approaches to satisfy the rules with adequate explanations to prove correctness and safety.
Show that existing algorithms are special cases of this generalization.

Below is a summary of the topics we covered.

Definition

We introduced an alternate informal definition for a consensus system:

A consensus system must ensure that every request is saved elsewhere before it is completed and acknowledged. If there is a failure after the acknowledgment, the system must have the ability to find the saved requests, complete them, and resume operations from that point.

Durability Policy

We declared that durability policies can be specified externally, such as a plugin. The algorithm should not have to change if the rules change. The rules have the following restrictions:

The rules must depend on the current nodes in the cohort.
Properties of cohort nodes (like AZ) can be used, as long as they are static.
The rules cannot depend on external variables, such as time.
Each leader can have different rules.

The ruleset data structure would conceptually look like this:

List of cohort nodes
List of eligible primaries, each containing:
- A list of node groups, where each node group is a valid durability combination

It may be possible to add more flexibility to the rules, but we think this is sufficient for most of today’s requirements.

Governing Rules

We introduced a set of governing rules.

Using these rules as a foundation, we proposed multiple ways to achieve consensus by focusing on different sections of the rules. We also included existing approaches and explained how they adhered to the governing rules.

The rules are as follows:

Definitions

A consensus system executes a series of consistent distributed decisions made by multiple agents.
A decision is an intent to make a change to the state of the system.
An agent fulfills decisions.

Rules

Durability: Every decision is a distributed decision.
1. A distributed decision must be made durable.
2. A decision that has been made durable can be applied.
Consistency: Decisions must be applied sequentially.
1. Every agent must revoke the ability of all previous agents to make further progress before taking any action.
  1. Inference: Every agent must provide a way for future agents to revoke its ability to make progress.
2. Every agent must discover decisions that were previously made durable, but not applied, and honor them.
  1. There are situations where it will be impossible to know if a decision met the durability criteria. If so, the agent must honor such decisions because they might have been applied.
  2. Decisions that get honored must be made durable and applied as a new decision made by the current agent (rule 1).
  3. Inference: If an agent discovers multiple conflicting timelines, the newest one must be chosen.

These rules have a hierarchy. If you can satisfy the top-level rule, you do not have to follow the sub-parts. To accommodate all possible algorithms, the rules also avoid dictating any approach or implementation.

We demonstrated that these rules could be applied to the three types of decisions that a leader-based system would make:

Fulfilling requests
Changing leadership
Changing durability rules

Scoping

For the sake of practicality, we narrowed down the scope of our analysis when exploring solutions:

We adopted Raft’s log replication as a requirement.
We assumed a leader-based approach.

We re-introduced the following terminology from existing consensus protocols:

A leader is an agent. It is a designated node in the cohort that is empowered to accept and complete requests. It continues to serve requests until its leadership is revoked.
The rest of the nodes in the cohort are followers. Their role is to assist the leader in its workflow to make requests durable.
Observers are nodes that are not part of the cohort. They are replicas that only accept requests that the leader completes.

Coordinator

We introduced a specialized agent called the coordinator. This separate role is necessary because a generalized approach allows for a large number of nodes in the cohort, making it impractical for every node to health check and respond to failures.

The coordinator is responsible for the following actions:

Perform health checks on all the nodes of the cohort.
In case of a failure, perform a failover by appointing a new leader and ensuring that requests that the previous leader might have applied are honored.
A coordinator can optionally provide the functionality to perform a “planned leader change”.

Coordinators are not part of the cohort. Multiple coordinators can be deployed, and they do not need to be aware of each other’s existence.

In Raft, leaders are agents that fulfill requests, and followers act as agents when they choose to become candidates. In our approach, the task of changing leadership is taken on by the coordinators instead.

For small cohort sizes, nodes can take on the role of coordinators, just like in Raft.

Term Numbers

We analyzed the problem of ordering in a distributed system. We concluded that assigning monotonically increasing and unique term numbers to each decision resulted in safer solutions.

The agents would use this term number to recruit nodes from older terms. If they succeed at recruiting a sufficient number of them to execute a leadership change, they move forward with the rest of the actions.

As a counterpoint, we demonstrated an engineering approach that utilized locks and timeouts to achieve ordering without relying on term numbers. However, it has trade-offs due to the reliance on clocks and execution time.

A Raft inspired approach

We will now cover an example inspired by Raft that demonstrates one approach to implementing a system that can accommodate an externally specified durability policy.

As a bonus, we will also show how a change in durability rules can be trivially included as part of this algorithm.

This approach assumes that you are familiar with Raft. For brevity, we will skip over the common parts.

Components

Coordinator

The coordinator does not need to persist any information. However, it needs either the current ruleset or a way to discover the existing cohort nodes to initialize itself. While active, it needs the following information:

Term number
Ruleset

Node

A node needs to persist the following:

Term number
Ruleset
A log that allows requests to be appended. You can also truncate the log at a specified point, which will cause all entries up to the end of the log to be deleted.
- Every log entry contains the term under which the request was made.
An “applied index” that trails behind the end of the log. It is the point up to which it is safe to apply requests that are present in the log.

Term number rules for cohort nodes:

A node must honor requests from an agent with a matching term number.
A node must reject requests from an agent with a lower term number.
In response to a recruitment, a node must agree to participate in a term number that is higher than the current one.
A node responds to only one agent for a new term number. If another agent attempts to recruit the node with the same term number, it is rejected.

When recruited, each node returns the following information:

The current log index
The term number of the last log entry
The current ruleset
The list of ruleset changes in the unapplied parts of the log

Ruleset

The Ruleset is a data structure that is embedded in the node and persisted by it. Functionally, the ruleset must answer questions about durability, revocation, and candidacy. The following is an example of what a ruleset could look like.

Name
List of cohort nodes
List of eligible primaries, each containing:
- A list of node groups, where each node group is a valid durability combination

Completing requests

Unlike Raft, which validates durability by waiting for acks from a majority of followers, the generalized approach validates the acks against the ruleset. Aside from this, the entire algorithm remains the same.

If the request is a ruleset change, the durability rules must satisfy both the previous and the new ruleset. After the ruleset is applied, the leader can proceed with the newer ruleset.

Leadership Change

In Raft, failure detection and leadership change are handled by individual nodes. In our generalized approach, separate coordinators perform these tasks. However, the actions taken by a coordinator still closely resemble those performed by the candidate in Raft. We do want to highlight how it “thinks” differently.

A coordinator that has decided to change leadership has the following goals:

Obtain a term number
Revocation
Candidacy
Discovery
Propagation
Establishment

Note that these are a restatement of rule 2, except that they are goal-oriented.

Obtaining a term number, Revocation, Candidacy, and Discovery

A single step achieves the above four goals: The coordinator increments its current term number and sends a message to recruit all the nodes in the cohort.

For the nodes that were successfully recruited, it discovers the most progressed node:

The log with the highest term number is the most progressed.
For logs with identical term numbers, the one with the highest index is the most progressed.

From the most progressed node:

It saves the ruleset returned by that node for subsequent attempts.
If additional rulesets were returned, they are stored in a temporary list.

It validates revocation and candidacy by ensuring that the recruited nodes satisfy all the rulesets, which include the one returned by the node and the secondary list of in-flight ruleset changes:

No leader of an older term must be able to complete any more requests.
The nodes must contain a candidate (or the intended candidate) with a sufficient set of nodes needed to reach quorum.

If these criteria are not met, then the change of leadership cannot proceed. This can happen either because the coordinator was unable to reach all the necessary nodes or because a different coordinator recruited those nodes under the same term.

Propagation

By now, the coordinator should have identified a candidate and the most progressed node. At this stage, the propagation mechanism can be the same as how Raft’s AppendEntries works. However, if there are multiple rulesets, the propagation must satisfy the quorum rules for all of them.

Raft requires that you can commit only when the log's term number matches the current term. We interpret this as an additional constraint on the durability requirements. This requires the entire timeline to become durable under the new term before it can be applied. It is an indirect way to satisfy Rule 2b(ii).

Propagation succeeds when the quorum rules for the candidate are met.

Establishment

The coordinator moves the applied index to the end of the log and delegates its term number to the candidate. At this point, the candidate becomes the new leader.

Conclusion

We believe that we have satisfied our goal of generalizing consensus in the following ways:

Accommodating arbitrarily complex durability requirements.
Providing a set of governing rules that can be used for different approaches and implementations.

We also have a goal of implementing this approach in Multigres.

This is the last part of the series. To start from the beginning, visit the Full Series Overview.

Generalized Consensus: Addenda

Mon, 03 Nov 2025 00:00:00 GMT

This section covers topics that were deferred in the previous parts. However, these are necessary to deploy consensus systems in production.

Health checks

In a distributed system, there are no accurate methods of detecting failure. When a node becomes unreachable, it could be one of the following problems:

It could be a network partition
The node could have crashed
The network could be too slow

Attempting to make decisions based on failure with an incorrect diagnosis may actually end up disrupting an already healthy system.

However, we must do the best we can.

We have previously stated that coordinators will perform health checks on all nodes in the cohort. We also assume that the coordinators are strategically positioned to handle expected failure scenarios. This approach offers several advantages because it allows us to draw reliable inferences.

Responsibilities

Each coordinator must connect to all cohort nodes and perform regular health checks. This can be achieved either through polling or by having the nodes stream their health status at regular intervals.

During health checks, the coordinator can keep the current leader, term, and ruleset up to date.

Each leader must send regular heartbeats to all nodes in the cohort.

Failure detection

This is a topic that requires its own study. However, Raft’s simple approach seems to have satisfied most deployments. The coordinator performing health checks is slightly better than Raft because it checks the health of all nodes before making a decision. In Raft, a follower makes a decision simply because it hasn’t received a heartbeat from the leader.

When the coordinator detects a failure, it must answer these two questions:

Is the leader able to complete requests? We determine the answer to this question using the following data:
- Is the leader itself reachable?
- Among reachable nodes, are they receiving heartbeats from the leader?
- Among those receiving heartbeats, are they enough for the leader to complete its requests?
- Are the nodes still completing requests from the leader?
- How long has this been going on?
Can the coordinator reach enough nodes to perform a leader change?

Answers to these questions should lead us through a decision tree where the outcome is either a decision to perform a leadership change or not to take any action. If the decision is to change the leader, the coordinator can follow the steps described in the previous sections.

Term numbers

We previously promised that we would cover ways to generate term numbers. Here are some options:

The Raft approach

Raft uses a clever method that lets nodes compete by using the same term number. The first coordinator to reach a majority of nodes wins that term number and gets permission to change leadership.

Those who do not win must wait for a timeout period and then try again using a higher term number. If the cluster is healed by that time, they have no action to take. This approach provides a mitigation for the livelock problem, where nodes can continuously race with each other, preventing anyone from succeeding.

The animation above is a reproduction of the one from the section on Revocation and Candidacy.

Applying the same approach to our pluggable durability, the coordinators do not need to reach a majority. If you examine the rightmost recruitment options, you will see that each option shares at least one node with every other option. This is a necessary property of recruitment.

We can utilize this property, similar to Raft, to have the coordinators compete against each other to recruit the necessary nodes for a leadership change. Whoever succeeds first wins the term. By definition, others must fail.

Time

The current time can be used as a term number. There are a couple of risk factors associated with this:

Timestamps can theoretically collide. Adding extra bits, such as a unique coordinator ID, may be necessary to ensure collision avoidance.
Rogue clocks can accelerate by a vast margin. Such incidents will require human intervention to reset the system.

etcd

You can use an external system, such as etcd, to acquire a lock and generate a unique, monotonically increasing number. This method also solves the livelock problem. Some might say that this is impure. But it is still a wise engineering choice.

Alternate durability

We previously discussed the need to revoke all possible leaderships for a safe leadership change. With this assumption, it is sufficient that a request reach any leader’s quorum. The current leader can consider that the request is durable and apply it.

If there is a failure, the act of global revocation will also discover any unapplied logs from the alternate group of nodes.

Navigating the Series

In the next part, we'll bring it all together with a recap, and conclude with a complete Raft-inspired design of a generalized consensus algorithm.

Full Series Overview

Generalized Consensus: Consistent Reads

Sun, 02 Nov 2025 00:00:00 GMT

This section discusses various approaches to achieving consistent reads in a consensus protocol.

Most official publications of consensus protocols have paid lip service to the issue of consistent reads. Implementors of these protocols have each developed their own methods for achieving consistency, and they all involve trade-offs.

The reason for this avoidance is that no solution is both perfect and performant. These properties determine the trade-offs.

Consensus is a replicated system. There is no guarantee that a follower has the latest data.
The current leader is guaranteed to have the latest data, but there is no guarantee that you know who the current leader is.

One important factor to keep in mind is that leader terms are expected to last a long time in the order of days. A planned leader change typically happens once a week. Unplanned leader changes might be even less common. This is a key factor to consider when choosing your solution.

At this point, we have the opportunity to reintroduce observers. These nodes are not part of the cohort, but they receive completed requests and can be used for reads. We will refer to the combination of followers and observers as replicas.

Here are a few approaches:

Leader lease

The lease approach involves giving a leader a lease once appointed. During this period, the system will not revoke its leadership. The leader can renew its lease either by completing requests or through heartbeats. If a leader cannot renew its lease, it will stop serving reads before the lease expires.

There are a few disadvantages:

We trust the clock.
If the leader becomes unreachable, you have to wait till the lease expires before appointing another leader.
We lose the opportunity to distribute reads across the replicas.

For reference, Spanner supposedly uses this approach with a lease period of ten seconds.

Leader heartbeat read

In this approach, the leader sends out heartbeats for every read. If a valid quorum of followers respond with the same term number, then it knows the leadership has not been revoked yet. It can respond to the read request.

Downsides:

The cost of a read is as high as the network cost of completing requests.
We lose the opportunity to distribute reads across the replicas.

Log index based read

This method works for a single client. For each successful write request, the leader returns the log position of the request. The client can request a read from any replica by requiring it to wait until it reaches that position before serving the read.

An advantage of this approach is that reads can be load-balanced across multiple replicas.

Downsides:

Replica lag or network partitions can impact read performance.
Only the client that wrote the last request knows the latest position of its request.

Replica heartbeat read

This is a combination of the leader heartbeat read and the log index-based read. In this case, the read is sent to a replica, which sends out a heartbeat to the current leader and its quorum nodes. For its response, it must receive matching term numbers as well as the latest apply index from the leader. The replica waits until its own apply index reaches that of the leader, and then it can serve the read.

This allows reads to be load-balanced across multiple replicas. Also, this read works even if the client did not perform the last write.

Downsides:

The cost of a read is as high as the network cost of completing requests.
Replica lag or network partitions can impact read performance.

Quorum read

This approach was pointed out by @huesflash. It was presented at the hostrage '19 conference by Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas. In this approach, the client multicasts the quorum-read request to a majority quorum of replicas and waits for their replies. At this point, it knows the highest accepted slot. It then waits for any of the replicas to apply that slot. Following this, the client can read from that replica.

This approach can be mapped to the generalized consensus model by multicasting the reads to enough replicas to ensure that they cover the revocation requirements of all possible primaries. After that, the approach described above can be used to fetch a consistent read.

We have not seen this approach used in practice. Theoretically, we believe it has the following trade-offs:

Upsides

It has the same advantages of the log index based read.
Unlike the log index based read, it can be used by any client.

Downsides

Network packet amplification: The number of packets sent is multiplicative of the number of replicas that need to be reached. Some cloud providers have a limit on packets sent per second, and you may hit this limit sooner than other methods.
There is a possibility that the highest accepted slot may later be abandoned due to a failover. If so, the client will need to detect this and take corrective action.

Eventually consistent reads

If the application can tolerate stale reads, those reads can be directed to any replica. There are many use cases where a certain level of staleness is acceptable. Based on this, we recommend setting a staleness tolerance and having the system reject reads that exceed this limit.

Navigating the Series

In the next part, we'll cover topics we deferred, like health checks, term numbers, etc.

Full Series Overview

Generalized Consensus: Changing the Rules

Sat, 01 Nov 2025 00:00:00 GMT

This section discusses ways to change the durability rules for a consensus system. As you will see below, any approach that respects the governing rules is sufficient to safely effect these changes.

So far, we have analyzed fulfillment of requests and leadership changes for a consensus system. In reality, these two actions alone are not sufficient to maintain long-running clusters. In addition to these, we also need the following capabilities:

Adding and removing nodes to the cohort.
Changing the durability rules
Adding and removing agents

The ability to add and remove agents is already satisfied since the proposed approach has no explicit constraints on them. However, there was an implicit assumption that the agents knew the current durability rules. If these rules are going to change, we need to discuss how the agents will learn about these changes and maintain the cluster's safety.

Conceptually, adding and removing nodes to the cohort is a change in the durability rules. We wanted to list them out separately because they are different use cases. Otherwise, the general approach of changing durability rules should work equally well for adding and removing nodes.

We will present two approaches for changing the durability rules.

Policy vs Rules

So far, we have not explicitly distinguished between the terms 'policy' and 'rules.' They are subtly different: A policy is an abstract requirement. For example, “I want my data to be in more than one AZ” is a policy. When the policy is combined with the list of nodes in the cohort, it results in a set of rules.

A change of policy may require you to install a new plugin. This type of change will be out of scope for this discussion. Any type of rule change that a single plugin can handle is in scope.

We will call this the ruleset. This ruleset must be known and understood by all agents. Additionally, since changes to rulesets are treated as distributed decisions, they must also reach quorum, which means that each cohort node must store the ruleset.

This also makes the cohort nodes the authoritative source for rulesets.

A coordinator can be initialized by pointing it at one of the nodes of the cohort. From that node, it can fetch the ruleset and the current term. Using this information, it can discover the rest of the nodes in the cohort.

Coordinator method

We can use the coordinator to modify the ruleset. For this, we have to interpret and apply the rules for the type of change we are making.

In this cluster, let us assume that we want to change the leadership rules for N1 from “both N2 and N3” to “either N2 or N3”. We will call them rs1 and rs2, respectively.

The coordinator performs the same actions as a leader change, but validates the recruited nodes for revocation and candidacy against both rulesets, as shown in Figure 1. Additionally, instead of inserting a standard completion event, it inserts a special ruleset change event.

Every node that receives and applies this ruleset change event updates its ruleset accordingly. If N1 is the new leader, it changes its behavior to “either N2 or N3” for all subsequent requests.

It is actually sufficient if the coordinator satisfies rs1. However, recovery from subsequent failures will need to satisfy both rulesets. For uniformity, it is preferable to apply both rulesets to all situations.

Figure 2 above shows an example where a coordinator in term 6 propagates N3 to N1 and N2, thereby satisfying rs1 and rs2 for N1’s candidacy.

Corner case

Suppose the coordinator made the ruleset change durable and delegated leadership to N1. This allows N1 to apply the change and proceed. At this stage, if N3 gets partitioned, N1 can still complete requests because it now uses rs2, which can be satisfied with an ack from N2.

Let's consider a scenario where a different coordinator (C7) assumes the system is still using rs1 and attempts to change leadership. From its perspective, recruiting N3 is enough to revoke N1’s leadership. This recruitment leads to the discovery of a pending ruleset change in the logs. This discovery informs the coordinator that it needs to recruit N2 to revoke N1’s leadership successfully.

There are two possibilities here:

Scenario 1

After the new ruleset rs2 became durable, N1 gets promoted and completes additional requests. But it just uses N2 for its acks, which is sufficient to satisfy rs2.

C7 assumes rs1 is currently active. It recruits N2, which it thinks is sufficient to revoke N1’s leadership. However, it notices the ruleset change in the unapplied logs. Therefore, it must continue its revocation and also recruit N2. Recruitment of N2 leads to the discovery of a more progressed timeline. It must therefore propagate N2's timeline instead of N3’s timeline. In this case, it can use rs2 because the ruleset change has already been applied. At this point, it will realize that the minimum conditions are already met, and it could delegate leadership of the 7th term back to N1. N1 will eventually propagate the changes to N3. This scenario is shown in the animation below:

Scenario 2

In this scenario, let us assume that no further progress was made after rs2 became durable.

The story starts off the same as scenario 1: C7 recruits N3, discovers the ruleset change, which makes it recruit N2. This time, it discovers the same timeline as N3. This allows it to append a completion event for the 7th term. However, the log now contains a mix of events from both rulesets. Therefore, its propagation must satisfy both rulesets: the requests must reach N1, N2 and N3. This scenario is shown in the animation below:

In reality, the coordinator would try to recruit all nodes. We presented it as a two-step process to demonstrate safety. If it could only recruit N3 and not N2, it would mean N2 was unreachable, which would cause the attempt to fail.

Summary of rules

A coordinator that intends to change leadership must perform an initial discovery using its last known ruleset.

Among the discovered nodes, it must obtain the ruleset of the most advanced node. It must also inspect the logs for any changes to the ruleset. If changes are present, then its recruitment and propagation must satisfy the ruleset of the current node as well as the rulesets present in the log.

For this to work, we need to make one change to the node’s behavior: upon recruitment, the node should return the current term number as well as the current ruleset. The coordinator must correspondingly preserve the last known ruleset.

Leader Method

One problem with the coordinator method is that it is disruptive because it requires revoking the previous leadership. However, there is a way to implement this exact ruleset change with no disruption in traffic.

For this, we issue a request to the leader for the ruleset change. The leader fulfills this like any other request. The only difference is that the quorum rules for this specific request are expanded to include both rulesets. This is the same rule that was followed by the coordinator method. Once this is applied, the leader can switch to the new ruleset.

In this case, there is no change in the term number. Other than the different quorum rules, there is nothing special about this request.

If a failure occurs during this process, the above coordinator method can be used to appoint a new leader safely.

Planned leadership change

The request-based approach of changing rulesets can also be used to make planned leadership changes. In this case, we create a special request called leadership change, and the quorum rules are expanded to include those of the intended leader.

Once the request is successfully applied, the current leader can step down to be a follower. The intended leader will observe this event and promote itself as the leader. The followers will also start expecting requests from the new leader as they see this event.

Again, there is no need to start a new term number for this method.

Adding and removing cohort nodes

Adding and removing cohort nodes are, in fact, a special case of a ruleset change. This is because the ruleset contains the list of cohort nodes. There are policies where the addition or removal of a node changes the quorum rules of a leader. A majority quorum is one such example. If so, that has to be taken into account while applying this special event.

Navigating the Series

In the next part, we'll cover how to serve consistent reads.

Full Series Overview

Generalized Consensus: Discovery and Propagation

Fri, 31 Oct 2025 00:00:00 GMT

This section covers the second part of a leadership change: The discovery of the best timeline and its propagation to prepare for the next leadership.

Rules covered in this section:

Rules

Durability: Every decision is a distributed decision.
1. A distributed decision must be made durable.
2. A decision that has been made durable can be applied.
Consistency: Decisions must be applied sequentially.
1. (skipped)
2. Every agent must discover decisions that were previously made durable, but not applied, and honor them. Clarifications:
  1. There are situations where it will be impossible to know if a decision met the durability criteria. If so, the agent must honor such decisions because they might have been applied.
  2. Decisions that get honored must be made durable and applied as a new decision made by the current agent (rule 1).
  3. Inference: If an agent discovers multiple conflicting timelines, the newest one must be chosen.

In the previous section, we covered how new coordinators ensured that they followed rule 2a, essentially ensuring that only one was able to take action at a given point of time. We discussed revocation and candidacy.

In this post, we will discuss:

Discovery of timelines
Propagation
Establishment of leadership

Discovery

The act of revocation has a serendipitous side effect: It also lets you discover all completed requests. The nodes that were recruited were necessary for the leader to complete its requests. By definition, it means that one of those nodes must have all the requests that were completed.

Beyond the completed requests, some of those nodes may also contain requests that were attempted.

Let us take the example in Figure 1. The durability rules are the same as the previous examples: N1 requires requests to reach both N2 and N3 for completion. In the above scenario:

N1 has completed A. This request must not be lost.
B has met the durability criteria. This request must also not be lost.
C and D have not met the durability criteria.

If the coordinator manages to recruit all three nodes, it will know the whole truth: Requests A and B must be completed. C and D can be discarded. This is the hard requirement from rule 2b.

The follow-up question is: Is there harm in also completing C and D? There is no harm. After all, they were valid requests that the leader was trying to complete. We will need and use this flexibility in other failure scenarios.

Suppose there is a network partition, and the coordinator is only able to recruit N3, with no visibility into N1 or N2. Based on the log information, all it can infer is that A and B might have been applied. This is where we use the inference 2b(i): We honor A and B.

If the coordinator recruits N2, the same logic applies. But in this case, all it can infer is that A, B, and C might have been applied. Here, we honor A, B, and C.

If the coordinator recruits N2 and N3, it knows that C was not complete, and it has the option of discarding it. In this case, we can choose to honor just A and B, or A, B, and C. However, a general rule to honor the most progressed timeline is safe and simpler.

The outcome of what will be honored after a failure is non-deterministic: If N3 were the only discovered node, C would be abandoned. If N2 were discovered, C would be included in the recovery. However, B will not be abandoned because it has already met the durability criteria.

This algorithm would be simple if a coordinator always succeeded in establishing leadership. However, multiple failures can occur during propagation. If that happens, newer coordinators may see conflicting timelines. We will discuss these scenarios after analyzing propagation.

Propagation

For a leader, the decisions it was fulfilling were requests.

A coordinator’s intent is not to fulfill requests. The decision it needs to fulfill is to establish a leadership using the timeline it has selected.

If the rules from the previous post were followed, the coordinator would have already recruited the candidacy nodes into the current term number. The goal now is to propagate this timeline to those nodes. Once this occurs, it can delegate its term to the candidate. This will establish its leadership, allowing the new leader to begin accepting external requests.

Since a timeline includes multiple requests, the standard action performed by a leader cannot be used for propagation; A leader has the right to apply each request individually. According to Rule 1, the entire timeline must be made durable before it is applied.

Before discussing implementation options, let's briefly review Rule 2b(ii). It states that propagation must be made durable as a new decision. This means that decisions should be versioned and their sequence should be known. We need to do this because we assume these attempts may fail. If they do, we must know the order in which these propagations were attempted. Without this, we cannot apply Rule 2b(iii).

This gives rise to a few implementation choices:

The Paxos way

Paxos is a protocol meant for finalizing a single value. The way to reconcile this with logs that have multiple values is to treat each timeline as a composite value. Our goal will be to finalize a chosen timeline.

For those that are not familiar with Paxos, we actually need to track three variables:

The proposal number the node has agreed to participate in
The current value
The proposal number that was used to accept the value

The proposal number for the value is stored in a variable different from the proposal number that was accepted from the prepare request. We will now explain why it should be tracked separately.

Figure 2 shows an illustration of how this works.

As mentioned above, we will treat each timeline as a composite value like T0, T1, and T2, as shown in Figure 2.

In this scenario, let us start with N6, which contains timeline T1 which has two requests in its log.

The node has the following variables:

Node name: N6
Node’s term: 5
Value: T1 (two requests)
Value’s term: 5. This is the new variable we are introducing, which stores the term number when the value was accepted.

When C6 recruits N6, this will update the node’s term to 6. However, the value T1 and the value’s term 5 stay the same. At this point, if a new coordinator asks about N6, it will see the node’s term as 6 and the value as T1. But that timeline was set under term 5 and is not the correct value for term 6. That’s why we need to add a new variable to track the value’s term.

When C6 requests the node to change its timeline to T2, then the timeline and the value’s terms are updated. This, in essence, is rule 2b(ii). The decision made by C6 to change N6’s timeline from T0 to T1 is executed as a new decision under term number 6.

This change must be atomic: If C6 crashes while still writing T2, then no change should happen. It would not be acceptable for part of T2 to overwrite T1. In this state, it would have destroyed the previous timeline and replaced it with an incomplete portion of itself. This can lead to data loss.

The change is authoritative: The previous timeline may conflict with the new one. No matter, the new timeline must completely overwrite the previous one.

An extension of this rule is that regular leadership requests also have the value’s term associated with them. But they don’t change during the completion of requests because they are all under the same term.

The value’s term must be persisted for the same reasons why the node’s term must be persisted.

The Raft way

Raft has a different approach.

Raft does not have the term as a separate variable. Instead, each request includes a term number, which is part of the log. When it comes to completing requests, Raft and Paxos are equivalent. One could say that Paxos is more storage-efficient than Raft because storing the single value is functionally equivalent to what Raft achieves by storing the term number for every request.

But they differ on how the timeline is propagated. Let us take the following animated example:

Let us assume that term 7 is trying to propagate N1’s timeline ABCD to N6, which initially has AB in its log.

In Raft, the log is propagated non-atomically. When N1 has two additional entries, it could propagate to N6 in two steps (steps 1 and 2). Additionally, the term number associated with those log entries remains the old term 5. This appears to violate rule 2b(ii). According to this rule, propagation must use the latest term number.

However, the rule is valid, and Raft follows it. The reason: Raft has an addendum to how it implements durability. The updated rule is as follows:

For a request to be durable, it must reach quorum. Additionally, the term number of the request must match the current term.

In other words, from a new term’s perspective, events from all previous terms are considered non-durable. They only become durable when a new event using the current term is appended to the logs. This requires the entire timeline to become durable under the new term before it can be applied.

On Step 3, a new request with term 7 is created and replicated. This is what makes the timeline meet the term number matching requirement for durability. Once the necessary followers have also received the amended timeline, it can be safely applied. This behavior meets the requirements of rule 2b(ii).

We intentionally left a gap to accommodate a complication that term 6 might have brought in, which term 7 is not unaware of. We will cover this in a later section.

The timestamp way

Sometimes, you might not have control over the data you can add to the log replication. The specific use case is Postgres WAL replication: There is no simple way to add extra metadata, like a term number, to that log.

However, WAL commit events have timestamps. Assuming that clock skews are within tolerable limits, these timestamps can serve the same purpose as term numbers: they mark the order in which decisions are made.

This means that algorithms that resolve conflicting timelines will work equally well if we use event timestamps instead of term numbers.

In part 5 of our series, we discussed an alternate way of using locks and timeouts to enforce the sequencing of coordinators. Combining this timestamp method with the locks and timeouts approach creates a complete system that satisfies all our rules. This combination eliminates the need for term numbers entirely.

Discovery revisited

Selecting Timelines

In a network with intermittent failures, multiple coordinator attempts can fail, and each time, a coordinator may get to see only a subset of the nodes. Over time, a coordinator may see variety of timelines, and has to ensure that it chooses a safe one that does not violate the requirements of durability and consistency.

The rules for selecting a safe timeline are simple:

The coordinator must recruit enough nodes to ensure that all possible leaderships are revoked. If this is not possible, then no progress can be made.
Among the recruited nodes, the timeline with the latest decision (term) is always safe.
If there are multiple timelines with the same term, then the most progressed timeline is always safe.

The reasoning is as follows:

Every previous decision that was made was a safe one for that term. This applies recursively back to the oldest decision.
The last discovered decision might have reached durability. It is even possible that some nodes have started applying that decision. This possibility makes all decisions previous to the last one unsafe.

What if we encounter a timeline that is more progressed than a newer decision? This only means that the timeline did not reach durability. Otherwise, the newer decision would have honored it. But now, we have to discard that progressed timeline because there is a chance that the newer decision has been applied already.

What if there exists a decision that is newer, but we don’t see it among the recruited nodes? If so, the decision did not reach durability, and need not be honored. We can choose the most appropriate timeline among those we discovered, and make sure to propagate it as the newest decision.

Failure scenarios

We will now cover the following failure scenarios:

A coordinator may not be able to reach enough nodes to make any progress.
A coordinator may attempt to propagate a timeline and fail before making it durable.
A coordinator may attempt a timeline that differs from the previous one, try to propagate it, and fail.
A coordinator may succeed at propagating a timeline, but fail before promoting the leader.
A final coordinator may see all these attempts and must make a decision that does not compromise safety.

In the above sequence of failures, the most critical requirement is that attempt 5 must successfully discover attempt 4 and honor it.

We will analyze these scenarios assuming that we are using the Raft method of propagation. However, the strategy will work for all methods.

Let us restart with the example shown in the Raft section:

N1 is the primary at term 5. It has received requests ABCD.
N2 is a quorum requirement for N1. It has received requests ABC.
N3 is a quorum requirement for N1. It has received requests AB.
N5 is not a quorum requirement of N1 and has received A.
N4 and N6 are not quorum requirements of N1 and have both received AB.

Scenario 1

Scenario 1 is a no op. The coordinator cannot make any progress.

Scenario 2

A coordinator may attempt to propagate a timeline and fail before making it durable.

C6 recruits N3, N4 and N5:

N3 & N4 for revocation
N4 & N5 for candidacy of N4

C6 crashes after propagating N3 to N5.

Scenario 3

A coordinator may attempt a timeline that is different from the previous one, try to propagate it, and fail.

C7 recruits N1, N4 and N6:

N1 & N4 for revocation
N4 & N6 for candidacy of N4

In this scenario, C7 did not discover any of C6’s activity. Based on what it discovered, it decides to propagate N1 to N6.

C7 crashes at this point.

Scenario 4

A coordinator may succeed at propagating a timeline, but fail before promoting the leader.

Let us now assume that Coordinator 8 (C8) attempts another leadership change. Let us also assume that it recruits the same nodes that C6 recruited. It will discover the following terms in the timeline:

N3: 556
N5: 556
N4: 55

From this, C8 infers that C6 tried to propagate timeline AB (55), which makes it a legitimate decision. It propagates 6:ok to N4. Following this, it appends 8:ok to N5, and propagates it to N3 and N4.

This action makes the timeline durable. C8 can delegate leadership to N4, which can then apply this timeline and request that N5 and N3 apply it as well.

But let us assume that C8 crashes at this point.

Scenario 5

A final coordinator may see all these attempts and must make a decision that does not compromise safety.

After scenario 4, the cluster’s state is as shown in the animation above, which shows three distinct timelines.

In this particular scenario, the coordinator C9 sees all the nodes. It can see that N6 has a more progressed timeline. However, its term is lower than the highest term so far, which is 8.

It could make a "smart" inference and choose N6's timeline. However, the most safe decision would be to choose the timeline with the highest term. This is because choosing the highest term can never be wrong.

The animation shows the outcome of C9 choosing the timeline of N4 that is on term 8. You will also notice that the propagation overwrites any conflicting timelines by truncating the logs of the targets as needed.

Let us repeat the reasoning from above using this specific example:

When C6 made its decision, that decision was based on its visibility. Even though it did not discover the most progressed timeline, its decision was valid because it satisfied the requirements of revocation and candidacy, which transitively satisfies the durability requirements.
C7 also made a decision, but it did not discover the actions of C6. That means that C6 failed at reaching quorum. C7 had the authority to choose the most progressed timeline among the nodes it recruited, which it propagated to N6.
C8 discovered artifacts of C6, but not of C7. That only means that C7 also failed at reaching quorum. However, C8 does not know that there was even a C7. From its point of view, it sees the work by C6. For safety, it has to assume that C6 might have reached quorum. So, it must honor every action taken by term 6. This time, C8 succeeds at reaching quorum.
We finally come to C9, which may discover any combination of the above nodes. However, every combination is guaranteed to include the work done by C8.

In other words, we expect each term to make a safe decision. This remains true even if the decision conflicts with a previous term’s decision. For a new term, the only safe option is to honor the actions of the most recent term among those discovered.

This, in essence, is rule 2b(iii).

The timeline selection priority is as shown in Figure 6. If timeline 5568 was not discovered, 55557 would be chosen, and so on.

If N4 had applied its timeline before C9 intervened, the system would stay consistent, and the end result would remain the same. The only difference is that the applied indexes would be at different points.

As mentioned earlier, the action that supersedes these timelines must be either non-destructive or atomic. In other words, events A and B should not be deleted before accepting the new timeline. Instead, anything following A and B should be truncated, and any remaining events from the source should be appended after the truncation.

At this point, C9 can delegate its term to N4, allowing it to accept new requests.

Intermission

This completes the expected parts of consensus systems that are traditionally required to prove correctness. However, we will discuss a few more points in upcoming blog posts. These are necessary for a consensus system to work effectively.

Navigating the Series

In the next part, we'll show how the governing rules can be used to safely make changes to the durability rules of a system.

Full Series Overview

Generalized Consensus: Revocation and Candidacy

Thu, 30 Oct 2025 00:00:00 GMT

This section covers ways to satisfy the prerequisites for a leader change: Revoke previous leaderships and recruit nodes for the new leader candidate.

Reiterating the relevant part of the rules.

Rules

Durability: Every decision is a distributed decision.
1. A distributed decision must be made durable.
2. A decision that has been made durable can be applied.
Consistency: Decisions must be applied sequentially.
1. Every agent must revoke the ability of all previous agents to make further progress before taking any action.
  1. Inference: Every agent must provide a way for future agents to revoke its ability to make progress.
2. (skipped)

In the previous post:

We covered the need for term numbers and some guidelines about how they should be generated.
We discussed the need for nodes to participate in terms, and also covered the governing rules about what they can and cannot do.
We also concluded that a leader can execute multiple requests within its current term.

In this post, we will conclude Rule 2a by focusing on Revocation along with its counterpart: Candidacy.

Recruitment

For a coordinator to successfully give instructions to a node, their term numbers must match. To enable this, the coordinator should first recruit the node to participate in its term. If the node’s own term number is lower, it will accept the recruitment and update its term number to match the coordinator’s term. Otherwise, it will reject the recruitment.

The coordinator does not need not specify a reason at the time of recruitment. It can choose what to do with the recruited nodes at a later time.

Leader revocation

To revoke an existing leadership, a coordinator can:

Directly recruit the leader. This will make it relinquish its leadership and wait for further requests.
Recruit its quorum nodes. This will stop them from accepting requests from the current leader.

Performing one of these actions will satisfy rule 2a.

Recruiting the leader, if reachable, gives us the advantage of a clean shutdown. The leader can ensure that in-flight requests are completed. It could inform its callers of an impending change in leadership, among other things.

The advantage of the second method is that it can succeed when the leader may be unreachable. This works even if it is still attempting to process requests on the other side of a partition. In our use case, if N1 is the leader, recruiting N2 or N3 is sufficient to revoke its leadership.

Both these examples are illustrated in Figure 1.

Coordinator revocation

Rule 1 states that every decision is distributed and must be made durable. This rule applies to coordinators also.

We also noted that durability rules depend on which node is the leader. The coordinator’s role is to appoint a new leader, hereafter called the candidate. The coordinator is expected to interact with the candidate and the nodes it relies on for its quorum. The specifics of these changes will be explained in the next post. For now, we can assume that it will need to:

Recruit the candidate.
Recruit the minimum number of nodes necessary for the candidate to fulfill requests successfully.

These recruitment actions with the intent to establish leadership are what constitute the Candidacy. This satisfies the rule 2a(i) requirement, because this candidacy can be revoked by requesting these nodes to participate in a newer term.

The revocation action for the candidacy is the same as the revocation action for leadership. This is not a coincidence because revocation is achieved by disrupting the ability for decisions to be made durable, and the only durability rules that exist in the system apply to leaderships.

What if the coordinator completes its work of appointing a leader before the revocation process begins? The answer to this question depends on whether we want the appointed leader to start a term that is newer than the process that is performing the revocation. Deciding to go this route makes things more complicated: As a newer agent, it must also follow rule 2a, which will be a repetition of the coordinator’s work.

It is simpler for the established leadership to inherit the same term as the coordinator. The reasoning is that, under the coordinator's assigned term, the goal is to follow all the steps needed to appoint a leader, which involves rules 2a, 2b, and 1. Once this is accomplished, the term is delegated to the leader. Because this is a delegation, the leader does not have to revoke anything. It can therefore fulfill requests by continuing to iterate on rule 1. This is also consistent with Raft, where the candidate acts as the coordinator and eventually becomes the leader, all within the same term.

A newer agent will be capable of revoking the progress of the above term at any stage, even long after the leadership is established.

This sounds suboptimal for the use case we are trying to address: A newer coordinator may unnecessarily disrupt a leadership that was just established. Since our rules do not allow the usage of elapsed time, we have to accept this as a possibility. However, there are other ways to avoid such disruptions, and we will cover those options later.

All possible leaders

As discussed earlier, a coordinator is unlikely to know upfront whether other coordinators are active and, if so, who their candidate is. For this reason, a coordinator must assume that there may be multiple other coordinators racing with it, and they could be aiming to promote any of the eligible leaders. Therefore, it must revoke all possible leaderships in the cohort. We will cover how to do this with an example.

Overlapping nodes

Can there be an overlap between the nodes that are needed for revocation and the nodes that are needed for the candidate?

The answer is yes. In fact, it is likely the case for most practical scenarios. Fortunately, the act of recruitment does not have to differentiate between these two intents. This is also simpler and more efficient because a single recruitment message can be sent to all the nodes in parallel.

Once recruited, the nodes will be asked to do different things depending on their role in the new Candidacy.

Example

In this example, we will first illustrate targeted revocations and then outline the requirements for a general revocation.

As a reminder, the example config is as follows:

The cohort has six nodes: N1-N6.
Only N1 and N4 are eligible leaders.
Durability criteria for N1: Data must reach both N2 and N3
Durability criteria for N4: Data must reach either N5 or N6

Let us assume that the current leader is N1 at term 5.

Revocation

A coordinator decides to appoint a new leader and begins term 6, now called C6. Method 1 revocation requires recruiting N1 into term 6, which will cause N1 to step down from leadership. For method 2, recruiting the quorum nodes N2 or N3 into term 6 is sufficient. This will cause them to reject requests from N1, which is still on term 5. Both actions meet the requirements of rule 2a for the current leader.

This is illustrated in Figure 1. Ideally, the coordinator would try to recruit all the nodes. However, the two examples shown are sufficient for the revocation.

Candidacy

For the following scenarios, we will assume that N3 has been recruited by C6 for the sake of revoking N1’s leadership.

Scenario 1: no race

C6 must now satisfy 2a(i) by recruiting the nodes needed for candidacy. Let us assume that it chose N4 to be the candidate. Then it must recruit N4 and N5, or N4 and N6, or all three. After the recruitment, those nodes will be on term 6. In the animation below, C6 recruits N4 and N5, which is sufficient for candidacy. This action is sufficient even if the other three nodes, N1, N2, and N6, are not reachable.

If the network partition was what caused C6 to act, then N1 might not have known that N3 was recruited and may still think that it is the leader. But it would not be able to fulfill any requests.

Scenario 2: newer term steals the nodes

If a different coordinator decides on a newer term 7 (C7), it must attempt to revoke both terms 5 and 6. For revoking term 5, it has the same goal as C6, but does not have to follow the same method. For revoking term 6, it must recruit N4, or both N5 and N6.

If this happens before C6 reaches these nodes, then C6 will fail to recruit them due to them being on a higher term.

In the above example, C7 revokes N1’s leadership by recruiting N2, which is different from what C6 recruited. This is acceptable because it is still a successful revocation of N1’s leadership. C7 also revokes the candidacy for N4 by recruiting N5 and N6, which is different from what C6 recruited. This is also sufficient because C6 will fail to make progress. After all, N5, which it recruited, is now in term 7.

In other words, coordinators can each recruit a different set of nodes for revocation and candidacy, and they will still preserve safety.

Scenario 3: newer term starts after scenario 1

If C7 started after scenario 1 finishes, it will still end up recruiting the nodes that were recruited by C6, which will prevent C6 from making further progress.

C6 could have completed the rest of the actions needed to establish the new leadership. If so, C7 will end up revoking that leadership.

The result of scenario 3 would look the same as the result of scenario 2.

All possible leaders

So far, we targeted specific nodes for revocation and candidacy. This was mainly to illustrate the logic. As explained before, a coordinator must actually attempt to revoke all possible leaderships in the cohort. To achieve this, it must recruit a combination from each group:

For N1:

For N4:

N4
N5, N6

For example, N1, N4 is a valid combination. N1, N5, N6 is also a valid combination, etc.

To recruit for leadership:

For N1, it must recruit N1, N2, N3.
For N4, it must recruit N4, N5 or N4, N6.

To perform a leadership change to N4, a coordinator must recruit for both revocation and candidacy. This would be any combination from the first set and a combination needed for N4’s leadership. A valid set would be: N3, N4, and N5, which is illustrated in scenario 1. The animation below shows a few examples of valid combinations:

Summarizing the rules

The summarized rules are more straightforward than the explanation: The coordinator must try to recruit all reachable nodes to participate in the new term. After the recruitment, the following criteria must be met among the nodes that were successfully recruited:

No leader of an older term must be able to complete any requests.
They must contain a candidate (or the intended candidate) with a sufficient set of nodes needed for its quorum.

Which parts of Paxos or Raft do this?

For Paxos, this is the prepare message where it sends a proposal number to all nodes. For Raft, it is the RequestForVote message.

For both algorithms, the requirement is that the candidate recruit a majority of the nodes. This is sufficient because a majority satisfies both the requirements of revocation and candidacy.

Suppose a majority is not needed for quorum, like in the case of FlexPaxos. In that case, the nodes required for revocation will be different from those that are necessary for candidacy. FlexPaxos used an approach of intersecting quorums to ensure safety. However, it was essentially implementing Rule 2a without being explicit about it.

It took a lot of explanation to unravel the concepts behind such a simple action. But without this understanding, we can't safely modify these algorithms. Additionally, this understanding will help us when discussing rule changes in a later post.

Navigating the Series

In the next part, we'll look at how to discover timelines and ways to propagate them.

Full Series Overview

Generalized Consensus: Ordering Decisions

Wed, 29 Oct 2025 00:00:00 GMT

In this section we will discuss the challenges of ordering decisions when multiple agents are involved.

Let us reiterate the relevant rules from part 3.

Rules

(not needed for this section)
Consistency: Decisions must be applied sequentially.
1. Every agent must revoke the ability of all previous agents to make further progress before taking any action.
  1. Inference: Every agent must provide a way for future agents to revoke its ability to make progress.

There will come a time when a new Leader must be chosen. It may be due to a planned event, such as a software rollout, or a failure.

A leadership change is essentially a new decision, but it differs from a traditional request because it involves a change in roles. This requires applying the complete ruleset to execute a new decision. The outcome will be a leadership change. In this post, we will lay the groundwork for the approach we will use to satisfy rule 2a.

Before diving into the details, let us introduce some concepts.

Failure detection

Failure detection is a necessary component for high availability in consensus systems. However, it is not a necessity for reasoning about safety. Therefore, we will cover this topic after we have completed the analysis of the core algorithms. For now, we can assume that failures can trigger an action to change leadership.

The coordinator

Majority-based systems, such as Raft, typically have either three or five nodes in their cohort. A larger number becomes inefficient because the number of acks needed to make a request durable becomes too high. Due to this limited number, the nodes also take on the tasks of health checking and performing leadership changes.

However, in a generalized setup, the number of nodes can be much larger. You might have ten or twenty nodes in the cohort. In that case, it isn't practical for all of them to perform health checks or coordinate leadership changes.

This task is logically separate and can be performed by agents that are not part of the cohort. We will name them the coordinators. Figure 1 illustrates an example setup of six nodes deployed across three availability zones, each with its own coordinator.

To summarize, the role of a coordinator is as follows:

Perform health checks on all the nodes of the cohort.
In case of a failure, perform a failover by appointing a new leader and ensuring that requests that the previous leader might have applied are honored.
A coordinator can optionally provide the functionality to perform a “planned leader change”.

A smaller number of these coordinators can be strategically placed in different availability zones so that at least one of them has the necessary connectivity to appoint a new leader.

This does not preclude a cohort node from acting as a coordinator. We are highlighting that it is an independent role.

A detour

One way to satisfy Rule 2a is to ensure that no two coordinators act simultaneously.

For example, a coordinator could obtain a distributed lock with exclusive rights to take action until a timeout, and then act. The Redis distributed lock is one such example. The coordinator that obtains the lock must ensure that it finishes its work before the timeout.

The advantage of this approach is that it eliminates races, thereby simplifying implementation. As we will see below, an algorithm that allows coordinators to race will be substantially more complex.

Unfortunately, this approach cannot be used for a theoretical proof due to the following reasons:

It is impossible to guarantee how long a process will take to complete its work. While it is in the middle of taking a critical action, the timeout may pass, and a new coordinator may start to act, thereby violating sequentiality.
Clocks are imperfect: Clock skews could cause the coordinator to think that it still has time to finish its work, while the time might have elapsed for the other clocks. Another coordinator may then start to act, and again, violate sequentiality.

We shouldn’t dismiss this approach. After all, real-life systems rely on clocks and timing. Even High Availability, which is essential for consensus protocols, depends on timing. From a practical standpoint, using locks and timeouts remains viable as long as we understand the trade-offs and implement safeguards against potential issues. In fact, Vitess employs this approach.

The bigger point we’d like to make here is that you can approach this problem in radically different ways.

Elapsed time

A theoretically correct solution should not depend on elapsed time: there should be no reliance on a clock or assumptions about how long actions take. For example, our reasoning should consider that an action can take one microsecond or one year. The same assumption applies to observing a previous action also: it might have occurred a few seconds ago or many weeks ago.

On the other hand, multiple coordinators could decide to act simultaneously and compete with each other. If so, we must ensure a consistent outcome.

The intuitive approach to solving a race condition is to favor the first coordinator. However, if the first coordinator takes too long, or even crashes before completing its work, no other coordinator can ever supersede it. In other words, this works only if we set a time limit for the completion of the task. This is the same as the lock-based approach described above.

We are now left with the alternate approach where a newer coordinator must be able to supersede an older one. This is why rule 2a uses the terms “current” and “previous” agents. It’s a lock-free algorithm and, therefore, naturally more complex than a lock-based algorithm.

Ordering

When two coordinators decide to act and are expected to race, we need a way to ensure that their actions are serialized. This means that the system must assign an order to those decisions. In a distributed system, there are two types of ordering:

Time ordering

Time ordering is the use of timestamps to determine the order in which coordinators make their decisions. The problem with timestamps is that clocks are unreliable.

In other words, time ordering is inaccurate.

Encounter ordering

Encounter ordering refers to the physical sequence in which coordinators interact with a common node. This is also equivalent to causal ordering. It is accurate.

However, encounter ordering is unpredictable.

This unpredictability is acute because a coordinator can crash and never finish. Per the FLP theorem, this is theoretically indistinguishable from a slow coordinator.

The lock-based approach is an attempt to control this unpredictability.

Choosing the order

The unpredictability of encounter ordering is unavoidable because coordinators need to ask the cohort nodes to do work. If they race against each other, their actions are likely to be interleaved.

We need an algorithm that can withstand this unpredictability. The best approach is to assign an order to these coordinators in advance and set rules that cover actions occurring out of order.

Assigning an approximate timestamp when a coordinator decides to act can satisfy these requirements as long as we can ensure that the timestamps don’t collide. Additionally, we need to consider rogue clocks.

Raft offers a better approach: have the coordinators visit a set of overlapping nodes and use that information to determine a sequence. The benefit of this method is that it is precise due to the usage of encounter ordering. The clever part of this approach is that it does this before taking any action, meeting the above constraints. We will explain this in part 10.

The term number

The assignment of an order between two independent nodes deciding to act is what Paxos calls a proposal number, and Raft calls a term number. This number must be universally unique, and is expected to increase monotonically. For clarity, we will use the RAFT terminology and refer to it as the term number. The rules around term numbers apply to all agents. This includes coordinators as well as leaders.

To handle agents acting out of sequence, we’ll specify that a newer agent always supersedes an older one. This is a prerequisite for Rule 2a(i). To achieve this, we will make agents recruit nodes into their term:

An existing agent is expected to give instructions to a node using its current term number as authority.
A newer agent can use its term number as authority and instruct those nodes to stop accepting further requests from the existing agent.

For this to work correctly, the nodes should obey the following rules, also shown in Figure 2:

Every node in the cohort must have a persistent term number.
A node must honor requests from an agent with a matching term number.
A node must reject requests from an agent with a lower term number.
A node can be recruited into a term whose number is higher than the current one.

In Figure 2, the last example shows an agent implicitly recruiting a node that is from a lower term. This is allowed because it is equivalent to a recruitment followed by a request.

The term number must be persisted to survive restarts. Otherwise, a restarted node that does not remember the term number it last agreed to may accept requests from a coordinator with a lower term and break rule 2a.

Reinterpreting Leadership

In the previous post, we talked about a leader being able to fulfill multiple requests. We now need to define how the term numbers interact with these actions.

One approach would be to assign a term number for each request. This has a disadvantage: a new coordinator that intends to change the leadership must come up with a number that is not only greater than the current term but also greater than other terms the leader could be generating as it fulfills more requests.

An alternate approach is to treat these requests as sub-terms. So, if a leader started under term 5, its requests would have the terms 5-1, 5-2, etc., or alternatively, a log position under term 5. This way, a coordinator that starts a term 6 is guaranteed to supersede the current leadership.

This is the reason why we chose the term “term”: It implies that it is long-lived and can fulfill multiple requests.

This also simplifies our reasoning: Within a term, we only pay attention to rule 1. To start a new term, we have to follow the entire ruleset.

Do we need a term number once leadership is established? Yes. If the system becomes chaotic with multiple coordinators and leaderships partially succeeding and failing, we need the ability to know the order in which events took place. In these situations, the term number can be used as an authoritative source to determine this order.

Essentially, every leadership starts under a term that is newer than the previous one, until a newer term replaces it.

In the next section, we will see how to safely perform these leadership changes by following rule 2a.

Navigating the Series

In the next part, we'll discuss how to Revoke Leadership and recruit for Candidacy.

Full Series Overview

Generalized Consensus: Fulfilling Requests

Tue, 28 Oct 2025 00:00:00 GMT

In this section we will cover how to fulfill requests in a generalized consensus system.

Let’s restate the subset of rules that are relevant to this section:

Durability: Every decision is a distributed decision.
1. A distributed decision must be made durable.
2. A decision that has been made durable can be applied.

Definitions

The cohort is the full set of nodes that are responsible for fulfilling the durability requirements of the system. In other words, these nodes are responsible for persisting their logs.
A quorum is any combination of nodes that are needed to meet the durability criteria. We will use these terms interchangeably depending on the context.

Roles

A leader is a designated node in the cohort that is empowered to accept and complete requests. It continues to serve requests until its leadership is revoked.
The rest of the nodes in the cohort are followers. Their role is to assist the leader in its workflow to make requests durable.
Observers are nodes that are not part of the cohort. They are replicas that only accept requests that are ready to be applied by the leader.

Sample use case

For better understanding, we will use the following example setup:

A six-node cohort: N1-N6.
Only N1 and N4 are eligible leaders.
Durability criteria for N1: Data must reach both N2 and N3
Durability criteria for N4: Data must reach either N5 or N6

This is an impractical configuration. However, it has some unique properties that will help us demonstrate that a system can work with arbitrary rules:

Eligible leaders have different durability rules.
Not all nodes are eligible leaders.
It has an even number of nodes.

The role of a leader is well understood in practical terms: It is a node that is authorized to accept requests from the application and fulfill them while also making the requests durable. As covered before, this is achieved by replicating a log to other followers.

Initial State

Let us start with the initial state as follows:

N1 is the leader
Nodes N2-N6 are followers

Processing requests

The algorithm explained below is very similar to the way Raft fulfills requests. The only difference is in the method used to determine if the durability criteria is met. For Raft, durability is determined by counting the number of followers that have acked a request. If it’s a majority, the request has become durable. In the generalized approach, the specific criteria has to be met. For N1’s leadership, acks from N2 and N3 must be received. No other acks count.

In Raft, each node must have a log of the requests being processed. Requests get appended to the log, and there is a trailing commit index (aka applied index) that determines the point up to which the events of the log have been applied.

The figure below shows a step-by-step animation of how requests get fulfilled by a generalized consensus system.

When a leader receives a request, it appends it to the log and also sends it out in sequential order to all the nodes of the cohort. Every node that receives the event appends the request to its own log and responds with an acknowledgement (ack) stating that the event has been received. At this point, nothing has been applied.

The leader may receive other requests while waiting for an ack. If so, it can continue to append them to the log and transmit them to the followers.

The leader (N1) must wait until it receives the necessary acks to reach quorum. In this case, the acks must come from nodes N2 and N3. Once both those acks are received, N1 is allowed to move the applied index forward and apply the event. At this point, N1 can also return to the caller with a success response. At the same time, it must send an apply message (update applied index) to all the followers to apply the event.

We call this method of replication “Two-Phase Sync”.

Additional Observations

Acks from any nodes other than N2 and N3 do not count towards the durability criteria, and must be ignored by N1.
If N4 were the leader, a single ack from either N5 or N6 would be sufficient.
Other nodes are still required to apply the logs as they receive the apply messages.
While N1 is the leader, acks from N4, N5, and N6 do not count. They could optionally be configured as observers for as long as N1 is the leader. However, as we will see much later, there are some advantages to them continuing to act as followers, and have N1 ignore their acks instead.

Followers

A follower can be in one of these states:

It might have just been rebuilt from a backup, or it may be lagging in replication: In this state, the follower’s latest logs would be behind the commit index of the primary. If so, the leader sends the committed logs as final until the commit index is reached.
Caught up: In this state, the follower’s latest logs are past the commit index of the primary. The primary sends events in two-phase mode, and the follower responds with corresponding acks.
Conflicting entries: In case of a conflict where a follower’s unapplied log does not match that of the leader, the follower should accept the leader’s logs as authoritative and discard conflicting entries.

Observers

Observers only receive finalized apply requests.

Validation

Let us now validate if the above algorithm follows Rule 1. In this scenario, the fulfillment of a request is a decision. From this perspective:

Sending the request to all followers and waiting for the necessary acks is rule 1a.
Moving the commit index forward and applying the event is rule 1b.

There are other replication modes, but they don’t follow rule 1:

Async replication: The leader appends and applies the request, and asynchronously sends the events to the followers. This breaks rules 1a and 1b. This can lead to data loss if the leader node crashes.
Synchronous replication: The leader sends the request to the followers as final. The followers apply the change and send an ack to the leader. The leader applies the change when the ack is received. This follows rule 1a, but breaks rule 1b. This is because the followers apply the change before the request has met the durability criteria. This can lead to inconsistent “split-brain” states.

Postgres can rewind transactions. When a split-brain scenario happens, it is possible to identify the transactions that must be rewound to restore system consistency. Using this approach, it is possible to design a system that meets the necessary durability criteria. Details about this are covered in our earlier blog post.

Rule 2

During this explanation, we did not pay attention to rule 2: Consistency: Decisions must be applied sequentially.

Because this is the initial state, there is no previous agent. So, there is no need to revoke anything or honor any previous work.

However, rule 2 does apply to subsequent requests that follow the first one. In this case, we follow the title of rule 2: “Decisions must be applied sequentially”. As long as we append to the log, we are following the entire rule, which is sufficient.

Roles

In the above scenarios, nodes have taken on two roles: Leader and Follower. Of these, the Leader is the active agent making decisions. The followers support the leader by making requests durable and by responding with acks.

An alternate type of leadership

Reviewing our cohort setup again, we had instinctively linked a log to each node. However, the rules do not require this; they only specify that decisions must be made durable.

For example, we could detach the leader from its log and make it an independent node. It This detached leader will still need to send events to all nodes required for a quorum. In this case, it would be N1, N2, and N3, where all three nodes are followers. This is also valid because it satisfies the same rules.

This is how systems like Aurora and Neon achieve distributed durability even though they don’t resemble Paxos or Raft.

In the interest of focus, we will not expand on this scenario.

Steady State

A leader can continue to serve a large number of requests in the current state. This is usually interrupted if there is a need to update the software or if a failure occurs.

When the leader has to change, we have new decisions to make. We will talk about those in the next post.

How is this different from traditional consensus?

The two-phase mechanism of sending out the requests, waiting for the necessary acks, and then sending out messages to apply the requests is nothing new. This is how RAFT also works. The part that differs is that the rules for what constitutes durable can be arbitrary.

If we provided a plugin mechanism for the rules, these acks would be handled by the plugin, which would validate them against the durability rules. This would allow the main algorithm to remain agnostic of the durability policy.

We have also shown that the rules allow for an alternate way to meet the durability criteria with a leader that is detached from its logs.

Navigating the Series

In the next part, we'll look at how to order events consistently.

Full Series Overview

Generalized Consensus: Governing Rules

Mon, 27 Oct 2025 00:00:00 GMT

This section discusses the fundamental governing rules of any consensus system.

As we solve the problem of durability, we will realize that there is a simple set of governing rules that we will be applying repetitively. We will develop these as we progress in our design. However, we will share their entirety upfront.

If you followed these rules, you should actually be able to implement any kind of consensus system. Here are some definitions and rules:

Definitions

A consensus system executes a series of consistent distributed decisions made by multiple agents.
A decision is an intent to make a change to the state of the system.
An agent fulfills decisions.

Rules

Durability: Every decision is a distributed decision.
1. A distributed decision must be made durable.
2. A decision that has been made durable can be applied.
Consistency: Decisions must be applied sequentially.
1. Every agent must revoke the ability of all previous agents to make further progress before taking any action.
  1. Inference: Every agent must provide a way for future agents to revoke its ability to make progress.
2. Every agent must discover decisions that were previously made durable, but not applied, and honor them. Clarifications:
  1. There are situations where it will be impossible to know if a decision met the durability criteria. If so, the agent must honor such decisions because they might have been applied.
  2. Decisions that get honored must be made durable and applied as a new decision made by the current agent (rule 1).
  3. Inference: If an agent discovers multiple conflicting timelines, the newest one must be chosen.

For a leader-based system, there are three types of decisions:

Fulfilling requests
Changing leadership
Changing durability rules

In the next few posts, we will discuss implementation strategies for these decisions.

In Raft, a leader is an agent. In our analysis, we will introduce one other agent: the coordinator.

Questions

Are we claiming that algorithms like Paxos and Raft follow these rules?

Yes. We’ll validate this as we expand on rules.

If one were to implement a system that followed these rules, but didn’t follow anything like what Paxos or Raft did, would it still work?

Yes.

What do these generalizations allow that previous algorithms didn’t?

Durability rules can be arbitrarily complex.
The number of nodes need not dictate the durability rules. This was already demonstrated in FlexPaxos. This generalization includes this flexibility.
The rules don’t dictate implementation: We have the flexibility to separate concerns in an implementation or implement them differently. We can also reuse existing parts of other systems to compose a full system.

Durability vs Discoverability

Durability and discoverability are two sides of the same coin. We need to define durability rules for two purposes:

Data must survive node failures.
Data must be discoverable if there are network partitions.

For a majority quorum, if there is a single network partition, data that reached durability can always be discovered. This is because one side of the partition will have a majority and one of those nodes will have the data. However, more than one network partition can cause the data to not be discoverable.

In real life, network partitions are not totally random. So, you can craft durability rules based on expected failure patterns.

If there is a failure, the agent that performs the discovery can compute the minimum set of nodes that need to be visited to ensure that it discovers all completed requests. If one of those nodes is not reachable, the recovery will stall. People will get paged, and everyone can panic.

In other words, if the durability criteria do not take discovery into account, the system can stall. For all practical purposes, it is equivalent to a data loss. This is because production systems are required to meet specific availability requirements. The business priorities may force us to abandon the unreachable node in favor of serving new requests.

Meaning of Apply

The meaning of ‘apply’ depends on the system that implements the protocol. For example, in the case of a database, a commit would count as an apply. For a file system, an fsync would count as an apply.

An apply is considered to be an irreversible process. It should be done only when we are certain that a request will not be abandoned.

The consensus system is not concerned with the semantics of apply. However, the request stored in the log should be such that the outcome of the apply is deterministic.

Missing terminology

You’ll notice that there are some expected terms that are missing:

Leader, Follower, Candidate: These are states that agents go through during the process of fulfilling their decisions. We will introduce these as needed.
Proposal number/term: These are implementation details, used to enforce the ordering of decisions. There are other options.
Majority quorum: We’ve already covered this. This is not a necessity.
Intersecting quorums: This concept was introduced by FlexPaxos in place of majority quorums. We will instead discuss discovery, revocation, and candidacy.
Voting is also not used, because it is misleading. There is no election either. A leader is appointed, not elected.

With all the groundwork laid out, it’s time to jump into the actual algorithms.

Navigating the Series

In the next part, we'll look at how to fulfill requests.

Full Series Overview

Generalized Consensus: Setting the Requirements

Fri, 24 Oct 2025 00:00:00 GMT

This section sets the requirements for a generalized consensus system.

In our previous post, we came up with an informal definition :

Let us stick to this definition and expand on some of these rules.

Single value vs log replication

The original Paxos paper was for a set of nodes to accept a single value. Although not practical, it is foundational. Understanding the single-value behavior will help us extend it for multiple values.

If we ask a Paxos system to accept a value and it succeeds, subsequent attempts to set a different value will fail. If the first attempt had an ambiguous outcome, the system might still finalize it later. A subsequent attempt may succeed or fail depending on the outcome of the first. This is shown in Figure 1 below.

Most practical systems need to accept multiple requests. To accommodate this, we have to modify this rule a bit: If the first attempt (A) succeeds, then a subsequent attempt (B) must be accepted and recorded as having occurred after A. If the outcome of A was ambiguous, then B requires the system to make a final resolution on A. If A is recovered and accepted, B is recorded after A. Otherwise, A is discarded, and only B is accepted. The system changes into one that consistently orders a series of requests. This is illustrated in Figure 2 below.

This was well understood by Raft, which is why it redefined this as a log replication problem. Since this is more practical, we will adopt Raft’s approach of replicating a log.

Depending on the type of system being implemented, these attempts can mean different things. For a key-value store, it may be a SetKey. For a database, it may be a transaction. For the sake of uniformity, we will generalize these as requests. Also, the data needed to persist a request may be physically different from the request sent by the application. For simplicity, we will treat them as equivalent.

Consensus state diagram

Figure 3 above shows the state diagram for a request.

A node can crash as soon as a request is received. This results in an abandonment.
A received request could have been logged, but might not have met the durability criteria. If there is a failure, the request may not be discovered by the recovery process. If so, it will be abandoned.
A request that has not yet become durable might get discovered by the recovery process. The process will replicate the request to make it durable.
A request that has become durable will not be abandoned. This gives confidence for every node in the system to apply the request.

If a request gets applied without experiencing any failures, it will be acknowledged as a success to the requester. Otherwise, its outcome will be resolved later by a recovery process.

Rejections and failures

The system can reject an invalid request. If so, the application can assume that it was not accepted. However, if a failure occurs due to a timeout or a node failing, the outcome would be unknown. The application must reconnect to the system and verify if the previous request was accepted or not. It is the application’s responsibility to know the difference between these two errors.

Many of us would have experienced this when we click on the “Pay” button while shopping online, and it spins and times out 😂.

Durability Requirements

The problem definition states: “every request is saved elsewhere”.

This requirement is open-ended because durability requirements are user-defined. We want to accommodate all reasonable use cases.

Today’s cloud environments have complex architectures with nodes, racks, zones, and regions. They have pricing structures that may encourage specific layouts. Additionally, enterprises often bring in their own policies. Combining these could result in complex requirements.

Here are some examples:

We want X nodes to receive the data before a request is deemed durable.
We need Y total nodes to ensure availability when there’s a failure.
Something more sophisticated: We want to deploy eight nodes across four zones, with two nodes in each zone. Our durability requirement is that at least one node in a zone other than the primary must hold the data. This ensures protection against a zone failure and a network partition between zones. We choose two nodes per zone to prevent leadership from switching zones during routine maintenance.

These requirements do not necessarily fit the pattern of a majority quorum. What ends up happening is that we configure a majority quorum system in such a way that these requirements are met. Sometimes, the configurations end up being sub-optimal.

We need a design that can accommodate these types of complexities.

Pluggable Durability

Since such durability requirements can be arbitrarily complex, let’s make these rules pluggable, but add some restrictions:

The rules must depend on the current set of nodes.
Properties of nodes (like AZ) can be used, as long as they are static.
The rules cannot depend on external variables, such as time.
Each leader can have different rules.

Additionally, the rules must be sensible for the system to function effectively. If not, it may lose data, not perform well, or stall.

The ruleset data structure would conceptually look like this:

List of participants
List of eligible primaries. For every primary:
- A list of node groups, where each node group is a valid durability combination

This could be further generalized by removing leadership from the picture and specifying durability as a set of acceptable node combinations. This approach would be more theoretically pure, but it would not improve the flexibility of the system. On the other hand, a leader-based approach is easier to reason about.

This sounds ambitious, but it is possible to build such a system.

Orders of Magnitude

In real-life scenarios, a leader is expected to fulfill a large volume of requests, in the range of thousands of requests per second. A leadership term also lasts a long time, typically many days, and sometimes longer. The durability policies can be tuned to take this into account.

For example, you can choose to have a five-node system, but require the leader to reach only one other node for durability. This configuration will give you the performance benefit of a three-node cluster. At the same time, a node crashing will cause less anxiety because you still have four other nodes running. The trade-off is that a leadership change will require the coordination of more nodes.

You might find this hard to believe: Vitess operated a consensus system at YouTube with over fifty replicas worldwide. We mainly depended on the fact that a neighboring replica is likely to have received the transaction before the distant ones. There was one incident when a transaction somehow reached a single node at a remote location. Fortunately, the system detected this and still managed to preserve the transaction.

Although I wouldn't recommend something this audacious, it shows that you can run a system with an unusually large number of nodes without sacrificing performance and safety.

Leader-Based Consensus

We will focus on leader-based consensus systems. I am aware of the existence of some leaderless algorithms, but I am not familiar with how they operate. I also don’t know if the principles we discover during this design will cover those approaches.

Navigating the Series

In the next part, we'll establish the governing rules that every consensus system must follow to maintain safety guarantees.

Full Series Overview

Generalized Consensus: Defining the Problem

Thu, 23 Oct 2025 00:00:00 GMT

In this blog series, I have the following goals:

Propose an alternate and more approachable definition of consensus.
Expand the definition into concrete requirements.
Break the problem down into goal-oriented rules.
Provide algorithms and approaches to satisfy the rules with adequate explanations to prove correctness and safety.
Show that existing algorithms are special cases of this generalization.

The first research paper that gained popularity was Paxos, and it was intimidating. Most people still don't fully understand it. Around the same time, another paper called Viewstamped Replication was published, but it didn't achieve as much popularity. Later, Raft was introduced, providing an alternative approach that was easier to understand. It also included practical improvements that made it more usable in real-world scenarios. Specifically, it added failure detection and an enhancement to support log replication instead of the single-decree algorithm used by Paxos.

However, Raft remains a monolithic algorithm and is mostly used as a black box these days. Making changes to it is risky because you don't know what rules you might break. This fear has halted most progress in this area.

There are two reasons why consensus has remained a mystery for most:

The problem is not well-defined.
Previous research has focused on proving the correctness and safety of specific algorithms, rather than conceptualization.

Let's conceptualize instead. If we succeed, verifying the correctness of existing algorithms will become easier. More importantly, we can be bolder about modifying them to meet our needs better or creating entirely new ones.

There is a paper by Heidi Howard on Generalized Consensus. I have read it, but I cannot claim to fully understand it. The paper is too theoretical for me, and I couldn't find an easy way to adapt it to real-world problems. It's quite possible that, if translated, it would be even more generic than what I intend to propose. However, I believe the goals differ: the paper focuses on a unified algorithm that can accommodate all existing consensus protocols. My goal is to develop a conceptual framework that enables the adaptation of consensus systems across diverse environments. Still, I did notice some overlaps between the topics discussed here and the paper. The concepts of revocation and flexible durability rules are definitely present in that paper.

I've made a previous attempt at this in my earlier blog series, but it was incomplete. The series also had a bias because I wanted to demonstrate how to achieve this in Vitess, despite its constraints. This time, I intend to be more precise and provide a foundation for something that can lead to a formal proof.

Why are we even doing this?

Above all, it never hurts to gain a better understanding of a system we depend so much on.

Additionally, the existing implementations are based on a majority quorum, which is too rigid. We are continuing to live with them because we don't have better options. FlexPaxos proved that you don't need a majority quorum. However, no implementation has yet adopted those learnings.

We are also stuck with implementations that cannot be separated into meaningful concerns. This makes it hard to adapt them to other systems.

For this reason, there is still no native consensus protocol in Postgres. The few commercial organizations that offer solutions appear to have utilized Raft, but the details are not publicly known. Anecdotal information seems to imply that they used Raft as a black box.

Instead, we should ask how to make consensus work for the Postgres WAL replication. In Multigres, we plan to do precisely this. The additions we will add to Postgres will enable the implementation of many consensus protocols, including Raft.

Redefining the problem

I've asked people about what they think consensus is. I've heard a variety of answers:

An algorithm to make a group of nodes agree on a value
Consistency
Majority quorum

There is some truth to all those answers. But there is a more appropriate definition:

Key Definition

Consensus solves the problem of Distributed Durability.

If you look back at all the places where consensus has been used, you'll realize that durability is the primary reason why it gets used.

Beyond durability, we want the system to recover and resume operation quickly in case of a failure. For this, we need automation that detects and responds to such failures. From a theoretical viewpoint, failure detection isn't in scope. However, we can't build a usable system without this capability. Therefore, we should make it a requirement:

Key Definition

Consensus also solves the problem of High Availability.

Of course, we also want to ensure that nodes don't diverge while fulfilling the above two requirements. In a way, this is an implicit requirement, because a system that diverges has essentially failed at durability.

To restate in simple words:

With the problem defined this way, we will work on a focused solution. As we progress, we will learn concepts and establish rules. We will also explore different implementation options. Then, we can verify the current algorithms against these rules.

Full Series Overview

Introducing Generalized Consensus: An Alternate Approach to Distributed Durability

Mon, 20 Oct 2025 00:00:00 GMT

Today, we're releasing a series that presents a fresh perspective on consensus algorithms. Rather than treating consensus as a monolithic black box, we propose a conceptual framework that makes these systems more approachable, adaptable, and flexible.

Why Another Take on Consensus?

Consensus algorithms like Paxos and Raft have been foundational to distributed systems for decades. Paxos, while powerful, has been notoriously difficult to understand. Raft improved accessibility but remains a monolithic algorithm that's risky to modify. This has effectively limited our flexibility in adapting consensus systems to modern cloud architectures.

The problem is twofold:

The problem itself is not well-defined - most explanations focus on what consensus does rather than what problem it solves
Research has focused on proving specific algorithms rather than building conceptual frameworks

This series takes a different approach.

What We Cover

A New Definition

We start by redefining consensus around the actual problem it solves:

Consensus solves the problem of Distributed Durability and High Availability.

In simpler terms: A consensus system must ensure that every request is saved elsewhere before it is acknowledged. If there is a failure, the system must have the ability to find the saved requests, complete them, and resume operations.

This definition shifts the focus from the algorithm to the goal, making it easier to reason about different approaches and implementations.

Breaking Free from Majority Quorum

Today's cloud environments have complex topologies with nodes, racks, availability zones, and regions. Yet we're stuck with rigid majority quorum requirements that don't align with these realities.

What if you want durability that requires:

At least one replica in a different availability zone?
Two replicas across any two distinct regions?
A specific combination based on your network topology and cost structure?

The series demonstrates how to accommodate arbitrarily complex durability policies without changing the core algorithm. We introduce the concept of pluggable durability policies that can be specified externally, like a plugin.

Goal-Oriented Rules

Instead of prescribing a specific algorithm, we establish a set of governing rules that any consensus implementation must satisfy:

Durability: Every decision must be made durable according to the policy
Consistency: Decisions must be applied sequentially, with proper revocation and discovery mechanisms

These rules avoid dictating specific implementations, allowing for diverse approaches while maintaining safety guarantees.

Practical Applications

The series isn't purely theoretical. We demonstrate:

How existing algorithms (Paxos, Raft) are special cases of this generalization
A concrete implementation approach inspired by Raft that supports flexible durability policies
How to separate concerns like failure detection into independent components (coordinators)
Practical considerations for building real systems, including handling ruleset changes

The Complete Series

The series consists of 11 parts:

Defining the Problem - Reframing consensus around distributed durability
Setting the Requirements - Log replication, durability requirements, and pluggable policies
Governing Rules - The core rules every consensus system must follow
Fulfilling Requests - How leaders process and durably commit requests
Ordering Decisions - Finding a way to order events consistently
Revocation and Candidacy - Satisfying the prerequisites for a leader change
Discovery and Propagation - Finding and honoring previously committed decisions
Changing the Rules - Changing the ruleset dynamically
Consistent Reads - How to serve consistent reads
Addenda - Topics we deferred, like health checks, term numbers, etc.
Recap - Bringing it all together with a complete Raft-inspired design

Why This Matters for Multigres

This work directly supports our goals for Multigres. Postgres currently lacks a native consensus protocol. Existing solutions appear to use Raft as a black box, which limits their ability to optimize for Postgres's WAL replication model.

By building on this generalized framework, Multigres will support:

Flexible durability policies (cross-AZ, cross-region, custom combinations)
Better integration with Postgres's replication mechanisms
The ability to scale to larger cohort sizes without performance degradation
Native two-phase sync replication as a foundation for consensus

Start Reading

The series is designed to be read sequentially, with each part building on previous concepts. Familiarity with Raft or Paxos will be beneficial, though not required.

Start with: Defining the Problem

We believe this framework opens new possibilities for consensus systems that can adapt to modern cloud architectures while maintaining the safety guarantees we depend on.

Credits

Many people from the community have reviewed this series and provided valuable feedback. It will be hard to name all of them. My heartfelt thanks go to the following members of the multigres maintainer team, who went above and beyond to ensure the series is clear and approachable:

High Availability and Postgres full-sync replication

Tue, 22 Jul 2025 00:00:00 GMT

In order to achieve High Availability (HA) in Postgres, we need to have at least one other standby replica that keeps up with the primary’s changes near-real-time. Using Postgres physical replication is the most common method to achieve this.

High Availability also requires us to solve the problem of distributed durability. After all, we have to make sure that no transactions are lost when we failover to a standby. So, if we can make this work, we can avoid the need for an external system like a mounted cloud drive or other exotic solutions to ensure that we don’t lose data. We could just have all the servers use their local NVME drive for storage. This will serendipitously improve performance due to the drives being an order of magnitude faster than the network, and reduce costs since disk I/O does not incur network cost.

Multigres features

We plan to support HA by configuring Postgres in full-sync replication mode, mainly because this is a widely used feature. However, there are some pitfalls to watch out for. We will cover these in the next section.

Multigres will bring the flexibility of a pluggable durability policy. There will be a few predefined ones, but if your needs are bespoke, you can just write an extension. Examples of durability policies include:

cross-availability zone (cross-AZ): a replica in at least one other AZ must have my data
cross-region: same as cross-AZ, but for regions
at least two AZs: replicas in two distinct AZs must have my data
majority quorum: traditional consensus rules like RAFT
etc.

The advantages of policy based durability go beyond ease of use. Essentially, the policy does not need to depend on the number of nodes in the quorum. This flexibility allows you to deploy more nodes or additional zones without affecting the performance of the cluster.

More about how this works will be covered in subsequent blog posts related to generalized consensus.

Existing replication pitfalls

Postgres has many options on how you can configure replication. They broadly fall into two categories:

Asynchronous replication (async)
- the primary commits the data and returns success to the caller.
- changes are then sent asynchronously to the standby replica(s).
- replica applies the changes.
Synchronous replication (full-sync)
- primary flushes the commit information to the WAL.
- changes are shipped to the standby replica(s).
- replica acknowledges the message: Sub-configurations here control the exact time when the replica acknowledges the message and applies the changes.
- primary externalizes the transaction by releasing locks, etc., and returns success to the caller.

The problem with async replication is that there are failure modes where you may lose data. For example, if a primary node crashes after acknowledging a commit but before sending the data to the replica, that commit is lost. In the case of a network partition, this can go on for a long time risking substantial data loss. Patroni has a feature controlled by the maximum_lag_on_failover setting that allows you to limit how much data you can lose in such a scenario. This feature is out of scope for Multigres as we are not planning to support async replication.

Synchronous replication addresses this problem, but it also has its own pitfalls explained below. This Video from Kukushkin covers many possible ways full-sync replication can fail on you. Overall, the problems are due to one of the following reasons:

The Primary writes to the WAL and crashes before sending the data to the standby:
- setups where there are clients subscribed to the WAL on the primary (like a CDC process), they would view and transmit this as a committed transaction. However, a failover will abandon this transaction, while the subscriber might have irreversibly acted on this.
- the primary is restarted after a failover, it will have a timeline that is incompatible with the rest of the cluster, and will not be able to rejoin the cluster.
- the client that initiated the transaction breaks the connection, the transaction becomes visible, and can be read by other connections. After this read, the node can fail before the WAL replicates, resulting in phantom reads. Although this is rare, it’s theoretically possible.
- replica that is not a hot standby may receive a transaction before the standby. This becomes a problem if the failover process does not account for this possibility.
The primary externalizes a transaction only after receiving an ack from the replica. But the replica itself will externalize it as soon as it’s received it: An application may see this committed transaction on the replica, but may not be able to find it on the primary.
Failures during failover: If there are multiple failures during a failover, we may end up with multiple conflicting timelines. There is no methodical way to determine which one is authoritative.

The above failure modes get progressively more unwieldy as the number of nodes increases.

Multigres solution

Some of these problems have mitigations, but not all are solvable. With Multigres, we plan to address some of these issues as follows:

1a: CDC picking up a transaction early: Run logical replication on a replica. Also, the new Synchronized Standby Slots feature in PG17+ helps with this issue.
1b: Primary crashes after writing a transaction to the WAL: Multigres can automate the repair of the primary when it comes back up.
1c: Client reading phantom records: There is no mitigation, but this is a rare occurrence.
1d: Replica that is not a standby receives errant transaction: Multigres can automate the repair.
2: Replica committing before primary: There is no mitigation, but the application can work around this possibility on a case-by-case basis.
3: Failures during failover: Assuming clocks are reliable, Multigres can use the WAL timestamps to determine which of the failover attempts was the latest and can use it. We will cover why this is the most authoritative timeline when we explain consensus algorithms in upcoming blogs.

Many of these problems can be tolerated by the application, but it will be nice to eventually eliminate them altogether. This is why we intend to implement a two-phase sync replication solution, which can be the foundation for consensus protocols.

Furthermore, the two-phase approach will have the same performance overhead as the full-sync approach, so you have nothing to lose and everything to gain by transitioning to it. In any case, until it becomes robust, the full-sync approach can continue be used as an acceptable compromise.

Relationship with consensus

Can you implement a consensus protocol using full-sync replication? The answer depends on how you want to define consensus.

If we used the minimal definition of consensus: “If a system acknowledges a transaction, it will not lose it, and all its nodes must eventually converge”, the answer is yes. But clearly, there are intermediate states in the system that are inconsistent. So, some may argue that this loss of consistency disqualifies it.

Regardless of which is authoritative, the minimal definition allows us to implement consensus algorithms on top of full-sync and still benefit from their other features. This is why we can implement the flexible “policy based durability” approach for the full-sync and the two-phase sync implementation.

What’s next?

In the subsequent blog posts, we will cover generalized consensus algorithms, two-phase sync replication in Postgres, and how they can be used to achieve the policy based durability we mentioned earlier.

Interview: Multigres on Postgres.FM

Fri, 11 Jul 2025 00:00:00 GMT

Sugu discusses Multigres on the Postgres.FM YouTube channel.

Chapters

00:00:00 - Meet Sugu & Multigres
00:01:39 - Why Sharding Now?
00:03:20 - Timing is Everything
00:04:55 - Building Postgres-Native
00:06:26 - Go vs Rust Decision
00:08:56 - Local Storage Strategy
00:10:42 - MySQL to Postgres Port
00:12:18 - License Philosophy
00:14:18 - Apache vs BSD Choice
00:15:30 - RDS Compatibility Trade-offs
00:17:04 - Managed-Only Approach
00:19:24 - Learning from Vitess
00:21:34 - Protection Before Sharding
00:23:20 - Observability Built-in
00:24:52 - OLTP vs OLAP Focus
00:26:42 - YouTube's Scale Lessons
00:28:22 - When to Start Sharding
00:30:16 - Small Instances Win
00:31:52 - Physical Replication Limits
00:33:00 - Logical Replication Plans
00:35:12 - Schema Change Handling
00:36:36 - Sync Replication Problems
00:38:24 - Data Loss Scenarios
00:40:12 - Two-Phase Sync Solution
00:41:38 - Beyond Raft Consensus
00:43:58 - FlexPaxos Introduction
00:46:00 - Durability Over Quorums
00:47:56 - Wild Goose Chase Recovery
00:49:42 - Distributed System Reality
00:51:36 - Query Planner Decisions
00:53:18 - Parser Compatibility
00:54:40 - Function Routing Challenge
00:56:36 - Select Function Writes
00:58:14 - Aurora Global Inspiration
01:00:02 - Cross-Shard Transactions
01:01:36 - Materialized Views Magic
01:03:48 - Reference Table Distribution
01:05:38 - 2PC Performance Reality
01:07:20 - Isolation Trade-offs
01:08:42 - Distance Matters
01:10:42 - Local Disk Advantages
01:13:54 - Backup Recovery Speed
01:15:32 - Edge Case Problems
01:17:00 - Current Progress Update
01:18:20 - Team Building Plans
01:19:14 - Final Thoughts

Transcription

Meet Sugu & Multigres

Hello and welcome to Postgres FM, a weekly show to share about all things PostgreSQL. I am Nikolai Samokhvalov of Postgres AI, and I'm joined as usual by Michael Favaro. Hey Nick, hey Michael, and welcome all to our guests. Yeah, we are joined by a very special guest, Sugu, who is a co-creator of Vitess, co-founded PlanetScale, and is now at Supabase working on an exciting project called Multigres. So welcome Sugu. Thank you. Glad to be here.

Alright, it's our pleasure. So it's my job to ask you a couple of the easy questions to start off.

So what is Multigres and why are you working on it?

Multigres is a Vitess adaptation for Postgres. It's been on my mind for a long time, many years, and we even had a few false starts with this project. And I guess there is a timing for everything, and finally the timing has come. So I'm very excited to get started on this finally.

Yeah, timing is an interesting one. It feels like for many years I was looking at PlanetScale and Vitess specifically, very jealously, thinking you can promise the world, you can promise this, you know, horizontal scaling with a relational database for OLTP. And it's, you know, all of the things that people want, and we didn't really have a good answer for it in Postgres, but all of a sudden in the last few months, it seems almost, there are now three or four competing.

Why Sharding Now?

All Doing It, So Why Now, Why Is It All Happening Now?

About Vitess for Postgres - we started and a couple of times we had calls where I tried to involve a couple of guys, and from my understanding it never worked because people could not do it themselves being busy with, I guess, MySQL-related things, and guys looking at the complexity of Postgres and didn't proceed. And actually in one case they decided to build from scratch - it was a spectacular project, it's still alive and there is sharding for Postgres.

Yeah, and you borrowed and you borrowed and they borrowed it.

Timing is Everything

Yeah, so other folks were also involved, and so for me it was disappointing that it didn't work, and at some point I saw a message in Vitess, I think, that we are not going to do it, so like don't expect it. I felt so bad because I was so excited about doing it, and then I realized, oh my god, you know. But now PlanetScale started to support Postgres, so what's happening? I don't understand - just right time, right? Enough number of companies using Postgres which really needed it, at least one horse will win. So yeah, it's great, but yeah, long, long story to this point.

Yeah, sometimes when there are multiple projects there's kind of slight differences in philosophy or approach or trade-offs, like willing to trade one thing off in relation to another, and I saw your plan, I really liked that you mentioned building incrementally, so Vitess is a huge project, lots and lots of features, but I've heard you talking in the past about building it quite incrementally while at YouTube, you know, it didn't start off as complex as it is now, obviously, and you did it kind of one feature at a time, and it sounds like that's the plan again with Multigres, is that different to some of the other projects, or what do you see as your philosophy and how it might differ slightly to some of the others?

Building Postgres-Native

I think my philosophy is that I would say I don't want to compromise on what the final project is going to look like. Which is a project that should feel native as if it was for Postgres by Postgres kind of thing. I wanted to be a pure Postgres project. And Go definitely will bring a few hundred microseconds of latency overhead. Usually it's not a big deal, but maybe in some cases it's some deal, right? Are you happy with Go?

Yeah, because you're one of the first big Go language users building Vitess as we know from various interviews and so on. So it's still a good choice because now there is Rust, right?

Go vs Rust Decision

Yes, I would say, by the way, when we started, compared to where Go is today, it was a nightmare. Like 10 milliseconds or something round-trip is what we were paying for. Those days we had hard disks, by the way, so that's another 3-5 milliseconds just within the database. But things are a lot better now, and at this point, the way I would put it is, like the trade-offs are in favor of Go. Let's put it that way, mainly because there is a huge amount of existing code that can just lift and port. And rewriting all of that in Rust is going to just delay us. And at least in Vitess, it has proven itself to scale for like multi-hundreds of terabytes. And the latencies that people see are not affected by a couple of hundred microseconds. So I think plus there's this inherent acceptance of this network latency for storage and stuff. And if you bring the storage local, then this actually wins out over anything that's there.

That's exactly what I wanted to mention.

Yeah, I see PlanetScale right now. They came out with Postgres support, but no Vitess. I'm very curious how much it will take for them to bring it like in the competition with you. It's an interesting question, but from past week, I see my impression is like, my take is on local storage. And this is great because local storage for Postgres, we use it in some places where we struggle with EBS volumes and so on. But it's considered not standard, not safe, blah, blah. There are companies who use it. I know myself, right? And it's great. Today, for example, Patroni, and since Postgres 12, we don't need to restart nodes when we have failover. So if you lose a node, forget about the node, we just failover and so on. And with local storage, not a big deal. But now I expect with your plans to bring local storage, it will become more... I expect it will be more and more popular, and that's great. So you shave off latency there and keep going...

Local Storage Strategy

It's a win because one network hop has completely eliminated a language level overhead. It might be that Go will improve additionally but yeah good. I wanted to go back so you mentioned not wanting to compromise on feeling Postgres native and that feels to me like a really big statement coming from Vitess being very MySQL specific. Saying you want to be Postgres native feels like it adds a lot of work to the project or you know it it it feels like a lot to me. What is it - is it is that about compatibility with the project like what what does it mean to be Postgres native. There's two answers - one is why do we still think we can bring Vitess if it was built for MySQL and how do you make it Postgres native. That's because of Vitess's history - for the longest time Vitess was built not to be tied to MySQL, it was built to be a generic SQL 92 compliant database - that's actually that was actually our restriction for for a very long time until the MySQL community said you need us to you know you you need to support all these MySQL features otherwise we won't see it, right - common table expressions with... right it's yeah I guess it's SQL 99 feature not 92.

Yeah I think the first part that I built was SQL 92 which is the most popular one.

MySQL to Postgres Port

So, that's Answer 1. Answer 2 is more about the behavior of Postgres. What we want is to completely mimic the Postgres behavior right from the beginning. Basically, in other words, we plan to actually copy or translate what we can from the Postgres engine itself, where that behavior is very specific to Postgres. And the goal is not compatibility just at the communication layer, but even internally, possibly even recreating bugs at the risk of recreating bugs. In this case, it's very, so there are products, in the hands of Microsoft, it got everything open sourced, so before resharding was only in paid version now, it's in free version open sourced, so it's fully open sourced, and they put Postgres in between, so they don't need to mimic it, they can use it right. And latency overhead is surprisingly low, we checked it.

Well, let's see, but it's whole database in between, but it's sub millisecond, so it's acceptable as well. I think it's half millisecond overhead in our experiments with simple select 1 or something, select. So don't you think it's like in comparison, it's quite a challenging point, when you say I'm going to mimic a lot of stuff, but they just use Postgres in between.

License Philosophy

Yeah, I think there's a difference in architecture between, or approach between Multigres vs Citus. I think the main difference is it's a single coordinator for Citus and there is some bottleneck issue with that. If you scale to extremely large workloads like that goes into millions of QPS, hundreds of terabytes. So having that single bottleneck I think would be a problem. In the future, when I understand you can put multiple, like you can have multiple coordinator nodes there and also you can put a load balancer to mitigate the connection issues. So it can scale as well.

That's good, that's not that something I haven't known before. So it's possible then that they may also have something that can viable scales for OLTP. So we're still exploring this and more benchmarks are needed.

Actually I'm surprised how few and not comprehensive benchmarks are published for this.

Yeah, what I know of Citus is probably what you told me when we met. So it's about five years old.

Another big difference and this is typically Nikolai's question is on the license front. I think you picked about as open a license as you could possibly pick which is not the case I think for many of the other projects. So that feels to me like a very Supabase thing to do and also in line with what Postgres is, and that seems to me like a major advantage in terms of collaborating with others, other providers also adopting this or working with you to make it better - what's your philosophy on that side of things? My philosophy is my metric for success.

Apache vs BSD Choice

The only way to have a project widely adopted is to have a good license, a license that people are confident to use. That has been the case from day one of Vitess. We actually first launched a BSD license, which is even more permissive than Apache. Why do they say it? Do you know? Why CNCF wants Apache?

I think Apache is a pretty good license. They just made it a policy. I mean, had we asked to keep the BSD license, they would have allowed us, but we didn't feel like it was a problem to move to Apache. I remember you described when you did it at YouTube, you thought about external users. You need external users for this project. And I guess at Google, GPL is not popular at all, we know.

RDS Compatibility Trade-offs

Also, compared to Citus, I think you have chances to be compatible with RDS and other Managed Postgres, to work on top of them, right? Unlike Citus, which requires extensions and so on, right?

Correct, correct, yes. This was actually something that we learnt very early on. Wanting to work, like, we made a five-line change on MySQL. Just to make Vitess work initially, and it was such a nightmare to keep that binary up, to keep that build running fork. And, yeah, to keep that fork alive. So we decided, no, it's like, we are going to make this work without a single line of code change in MySQL, and that actually is what helped us move forward, because people would come in with all kinds of configurations and say, you know, make it work for this.

So in this case, actually, we will probably talk about the consensus part, that is one part that we think it is worth making a patch for Postgres, and we're going to work hard at getting that patch accepted. But I think what we will do is, we will also make Multigres work for unpatched Postgres, for those who want it that way, except they will lose all the cool things about what consensus can give you. I'm smiling because we have so many variations.

Managed-Only Approach

This sync might happen as well. Don't they claim full compatibility with Postgres? Not fully, but most of it. They did interesting stuff in memory, like column storage in memory for tables. It's row storage on disk, but column storage in memory. But it looks like kind of Postgres and we actually even had to get some questions answered from my team unexpectedly because we don't normally work with OLAP. But it looks like Postgres. So I could imagine the request, let's support OLAP as well.

But my question, I remember featuring Vitess, we work with RDS and managed MySQL. Did this feature, has this feature survived?

No, actually later we decided that at least we call it actually managed versus unmanaged. Managed meaning that Vitess manages its databases. And unmanaged means that the database is managed by somebody else with just access and proxy to serve queries.

At some point in time, we realized that supporting both is diluting our efforts. And that's when we decided it's not worth it to try and make this work with every available version that exists out there in the world. And we said, okay, we will do only managed, which means that we will manage it ourselves. And if you want, we will build the tools to migrate out of wherever you are. And we'll make it safe, we'll make it completely transparent. In other words, you deploy with us on both and then we'll migrate your data out without you having to change your application. But then

Learning from Vitess

Vitess can be more intentional about its features, more opinionated about how clusters are to be managed, and we were able to commit to that because at that point Vitess had become mature enough people were completely trusting it. They actually preferred it over previous other managed solutions, so it wasn't a problem at that time.

Yeah, five-nines is like what Vitess shoots for, and like most big companies that run Vitess do operate at that level of availability with this team.

So what's the plan for Multigres? I go into support, so not only managed version, right?

Yes, it would be only managed versions because I believe that the cluster management section of Vitess will port directly over to Postgres, which means that once it goes live, it will be coming with batteries included on cluster management, which should hopefully be equal to or better than what is already out there. So I don't see a reason why we should try to make it work with everything that exists today. So it means this is the same with Citos, it doesn't work with Vardias on one hand, but on another hand I don't see it's only a sharding solution, it's everything which is great. I mean it's interesting, super interesting. A lot of problems will be solved, and I expect even more managed services will be created. I don't know how it will continue, like in terms of super bits, because very open license and so on, but also I expect that many people will think we consider their opinion about managed, we have that episode about this. This is my usual opinion about managed services because we hired super user from you, we don't have your access.

Protection Before Sharding

It's hard to troubleshoot problems. In this case, if problems are solved with this, and this gives you a new way to run Postgres, so if many problems are solved, it's great. Right? If you want to solve this.

Yeah, if you may not know, the initial focus of Vitess was actually solving these problems first. Sharding actually came much later. Protecting the database, making sure that they survive abusive queries. Basically, that's what we built Vitess for initially. And the counterpart of taking away power from the user, like you said, is one is, well, we now know exactly how to make sure that the cluster doesn't go down. And two, we countered that by building really, really good metrics. So when there is an outage, you can very quickly zero in on a query. If a query was responsible, Vitess will have it on top of, like on the top of the list.

I'm saying that this is a query that's killing your database. So we build some really, really good metrics, and which should become available in multigress, probably from day one. That's interesting. I didn't see, maybe I missed, I didn't see in the read me, you were writing right now in the project. There's a last section called observability. I missed that. We're actually building something there as well. I for a regular pause, because I have a very curious, I will definitely revisit this interesting. So yeah, great. And yeah, also quite a big, I feel like this is quite a big difference on the, at least with Citus in terms of the philosophy, or at least the origin story. I feel like that started.

Observability Built-in

Added much more with OLAP-focused features in terms of distributed queries and parallelized across multiple shards and aggregations and columnar, and loads of things that really benefit OLAP workloads, whereas this has come from a philosophy of, let's not worry about optimizing for those cross-shard queries, this is much more, let's optimize for the single-shard, very short, quick OLTP queries, and let's make sure we protect it against abuse of query. So it feels like it's coming, the architecture, it's coming from a very different place of what to optimize for the first. And historically, that was YouTube's problem, surviving the onslaught of a huge number of QPS, and making sure that one single QPS doesn't take, you know, the rest of the site down. Yeah, perfect, it makes loads of sense. So actually before we move on too much from that, where do you see sharding as becoming necessary? Is it just a case of a total number of QPS, or like rights per second type, we've talked about sharding in the past and talked about kind of a max that you can scale up to, perhaps in terms of rights, in terms of wall, the wall per second I think was the metric we ended up discussing. Are there other reasons, or kind of bottlenecks that you see people getting to that sharding then kind of make...

OLTP vs OLAP Focus

What makes sense is now time where you should be considering this point? Well, there is a physical limiting factor which is the single, if you max out your single machine, that is your Postgres server, then that's the end of your scale. There is nothing more to do beyond that. And there are a lot of people already heating those limits from what are here. And the sad part of it is they probably don't realize it. As soon as that limit is hit, in order to protect the database, they actually push back on engineering features. Indirectly, saying that, you know, this data, can you make it smaller, can you just sum all over the QPS or could you put it elsewhere?

Let's stop showing this number on front page and so on.

Yeah, and it affects the entire organization. It's a very subtle change, but the entire organization slows down. Like we experienced that at YouTube, when we were at our limits, the default answer from a DBA was always no. We used to even kid know, the answer is no, what's your question?

And when we started sharding, it took us a while to change our answer to say that, you know, bring your data like we can scale as much as you want. Believe it or not, we went from 16 shards to 256 in no time. And the number of features in YouTube exploded during that time because there was just no restriction on how much data you wanted to put. And coming back here, the upper like reaching the limit of a machine is actually something you should never do. It's very unhealthy for a large number of reasons.

YouTube's Scale Lessons

Like, even if there is a crash, how long is it going to take to recover? Like, the thing that we found out is once you can shard, it actually makes sense to keep your instances way, way, small. So, we used to run like 20 to 50 instances of MySQLs per machine, and that was a lot healthier than running big ones. For a couple of reasons, one is if you try to run so many threads within a process, that itself is a huge overhead for the machine. And it doesn't do that very efficiently, whereas it does it better if you run it as smaller instances. I think it's more of a feeling, but I don't know if there is proof or whatever. But in, like, go for example, wouldn't do well. Go, I think, beyond a certain memory size, or beyond a certain number of go routines would start to slow down, would not be as efficient as it was before. Mainly because the data structures to keep track of those threads and stuff, they are getting, they are growing bigger. But more importantly, on an outage, a smaller number of users are affected. If you have 256 shards and one shard goes down, it is 1,256th of the outage. And so, the site looks a lot healthier, behaves a lot healthier, there's less panic if a shard goes down. So, people are, you know, a lot less stressed managing such instances. Right, I wanted to mention that this discussion was with Lev Kukotov.

When to Start Sharding

In previous conversations, competitors, new sharding solutions written in Rust, we discussed that there is a big limitation when Postgres - so replication, physical replication has limitation, because it's single threaded process on standby. If we reach like, something like, 150, 200, 250 megabytes per second, depending on core, and also number of, not number, structure, we hit one single CPU, 100% one process, and it becomes bottlenecked, and replication standbys, they start lagging. It's a big nightmare, because you usually buy data, but it's high scale, you have multiple replicas, and you offload a lot of read-only queries there, and then you don't know what to do except as you describe, let's remove this feature and slow down development, and this is not fun at all. So, what I'm trying to do here, is trying to move us to discussion of replication, not physical, but logical. I noticed your plans involved heavy logical replication in Postgres, but we know it's improving every year. So, when we started the discussion, 5, 6 years ago, it wasn't much worse, right now it's much better, many things are solved, improved, but many things still are not solved. For example, schema changes are not replicated. And sequences, there is work in progress, but if it's committed, it will be only in Postgres 19, not in 18, so, it means, like, long wait for many people. So, what are your plans here, are you ready to deal with problems like this? In Postgres, pure Postgres problems, you know? Yeah! Yes! Yes! Ah! How did you ask me, everything?

Small Instances Win

I think the Postgres Problems are less than what we faced with my sequel. I wanted to involve Physical as well because this great talk by Kokushkin which describes very bad anomalies when data loss happens and so on.

Yeah, let's talk about this. Yeah, we should talk about both. I think overall the Postgres design is cleaner, is what I would say. Like you can feel that from things. Like the design somewhat supersedes performance which I think in my case is a good trade-off especially for sharded solutions because some of these design decisions affect you only if you are running out. If you are pushing it really, really hard then this design decisions affect you but if your instances are small to medium size, you won't even know and then you benefit from the fact that these designs are good. So I actually like the approaches that Postgres are taken with respect to the wall as well as logical replication and by the way I think logical replication theoretically can do better things than what it does now and we should push those limits.

But yes, I think the issue about schema not being as part of logical replication, it feels like that is also a theoretically soluble problem except that people haven't gotten to it. I think there are issues about the transactionality of details which doesn't even exist in my sequel so at least in Postgres it's like it exists in most cases there are only a few cases where

Physical Replication Limits

We don't want you to get the wrong impression, we'll let you do it non-transactionally, and we know that it's non-transactionally, and therefore we can do something about it. Those abilities don't exist previously. But eventually, if it becomes transactional, then we can actually include it in a transaction.

Yeah, just for those who are curious, because there is, there is, like, concept, all-DDL-in-positive systems. Actually, here we talk about things like creating this concurrently, because we had discussion offline about this before recording. So yeah, creating this concurrently can be an issue, but you obviously have a solution for it. That's great. The way I would say it is, we have dealt with much worse with MySQL, so this is much better than what was there then. Sounds good.

Logical Replication Plans

This is an interesting talk by Kukushkin. He presented it recently on one conference by Microsoft, describing that synchronization in Postgres is not what you think. Correct. Correct. What are you going to do about this? Well, I was just chatting with someone. And essentially, synchronous replication is theoretically impure when it comes to consensus. I think it's provable. But if you use synchronous replication, then you will hit corner cases that you can't handle. And the most egregious situation is that it can lead to some level of definitely split brain. But in some cases, it can even lead to downstream issues. Because it's a leaky implementation. There are situations where you can see a transaction and think that it is committed. Later, the system may fail and in the recovery, you may choose not to propagate that transaction or may not be able to. And it's going to discard that transaction and move forward. But this is same as with synchronous replication, it's the same.

We're just losing some data, right?

It is same as asynchronous replication. It's just the data loss. It's data loss.

Correct. It's data loss. But for example, if you're running a logical replication of one of those, then that logical replication may actually propagate it into an external system. And now you have a corrupted downstream system.

Schema Change Handling

So, those risks exist, and at Vitess scale people see this all the time, for example, and they have to build defenses against this, and it's very, very painful. It's not impossible, but it's very hard to reason about failures when a system is behaving like this. So, that is the problem with synchronous replication. And this is the reason why I feel like it may be worth patching the post-cress, because there is no existing primitive in-post-cress on which it can build a clean consensus system. I feel like that primitive should be in-post-cress.

I now remember from Kukushkin's talk, there is another case when on primary transaction looks like not committed, because we wait replica, but replica somehow lost connection or something, and when we suddenly, and client thinks it's not committed, because the commit was not returned, but then it suddenly looks committed. It's like not data loss, it's data un-loss somehow, suddenly, and this is not all right as well. And when you think about consensus, I think it's a very good describing these things, like concept and distributed systems, it feels like if you have two places to ride, definitely there will be corner cases where...

Sync Replication Problems

We'll go off if you don't use two-Phase Commit, right? And here we have this, but when you say you're going to bring something with consensus, it immediately triggers my memory, how difficult it is and how many attempts it was made to bring pure Echa to Postgres, just to have auto-failure, all of them failed, all of them. And let's be outside of Postgres. So here, maybe it will be similar complexity to bring these two inside Postgres. Is it possible to build outside this thing? It is not possible to build it outside, because if it was, that is what I would have proposed. The reason is because building it outside is like putting bandaid over the problem. It will not solve the core problem. The core problem is you've committed data in one place and if that data can be lost and there is a gap when the data can be read by someone, causes is the root cause of that problem, that is unsolvable, even if you later raft may choose to honor that transaction or not, and that becomes ambiguous, but we don't want ambiguity.

What if we create something extension to commit, make it extendable to talk to some external staff to understand that committed can be finalized or something? I don't know, consensus will bring. Correct, correct. So essentially if you reason to about this, your answer will become a two-phase system.

Yeah, without a two-phase system. But as I told you, two-phase commit in all TPP world, Postgres all TPP world, consider to read.

Data Loss Scenarios

It's a really slow and the rule is less than just avoided. I see your enthusiasm, and I think I couldn't find good benchmarks, zero, published. This is not two-Phase Commit, by the way. This is two-Phase Synchronization. I understand, it's not the two-Phase Commit, it's more communication happens. I understand. So, two-Phase Synchronization, the network overhead is exactly the same as full sync. Because the transaction completes on the first sync. Later, it sends an acknowledgement saying that, yes, I'm happy you can commit it. But the transaction completes on the first sync. So, it will be no worse than full sync.

Yeah, compared to current situation, when primary commit happens, but there is a lock, which is being held until... It is the same custom. We wait until standby. And for user, it looks like a lock is released, it thinks, okay, commit happens. But the problem with this design, if, for example, standby starts, lock is automatically released and commit is here, and it's unexpected. This is a data-unloss, right? So, you are saying we can redesign this network cost will be the same, but it will be pure.

Yeah, that's great. I like this. I'm just thinking, will it be acceptable? Because bringing out a failover is not acceptable. There was another attempt last year from someone. And with great enthusiasm, let's bring out a failover inside Postgres. Actually, maybe you know this guy, it was Constantine Osipov, who built a rental database system. It's like a memory. He was X-MySQL in performance. After X-MySQL, Osipov was my SQL.

Two-Phase Sync Solution

Let's build this, great enthusiasm, but it's extremely hard to convince such big thing to be in core. So, if you say it's not big thing, this already? So, I'll probably have to explain it in a bigger blog. But essentially, now that I've studied the problem well enough, the reason why it's hard to implement consensus is that it's not a big thing. In Postgres, with the wall, is because they are trying to make Raft work with wall, and there are limitations about how the Raft, how commits work within Postgres, that mismatch with how Raft wants commits to be processed, and that mismatch so far I have not found a way to work around that. But a variation of Raft can be made to work. Interesting. I don't know if you know about my blog series that I wrote when I was at Planet Scale, it's an eight-part block series about generalized consensus. People think that Raft is the only way to do consensus, but it is one of a thousand ways to do consensus. So, that block series explains the rules you must follow if you have to build a consensus system.

Beyond Raft Consensus

If you follow those rules, you will get all the properties that are required by a Consensus System. So, this one that I have, the design that I have in mind follows those rules and I am able to prove to myself that it will work but it's not Raft. It's going to be similar to Raft. I think we can make Raft also work but that may require changes to the wall which I don't want to do. So, this system I want to implement without changes to the wall, as possibly a plugin.

Well, now I understand why you, like, another reason you cannot take Patroni, not only because it's Python versus Postgres but also because you need another version of Consensus Algorithm. Correct, correct. And among those hundred thousand millions of ways. By the way, Patroni can take this and use it because it's very close to how full thing works.

I was just thinking, watching Alexander Kukushkin's talk, he said a couple of things were interesting. One is that he was surprised that this hasn't happened upstream. So, you definitely have an ally in Kukushkin in terms of trying to get this up streamed but also that he thinks every cloud provider has had to pass to in order to offer their own high availability products with Postgres. Each one has had to patch it and they are having to, or you mentioned earlier today, how painful it is to maintain even a small patch on something. I don't think it's every, I think it's Microsoft for sure knowing where Kukushkin works at. But maybe more, not everybody. All I mean is that there are growing number of committers working for hyperscale and hosting providers. So, I suspect you might have more optimism for Consensus or at least a few allies in terms of getting something committed upstream. So, I personally think there might be growing chance of this happening even if it hasn't in the past for some reason.

Yeah, I feel like also being new to the Postgres community, I am feeling a little shy about proposing this.

FlexPaxos Introduction

So, what I am thinking of doing is at least show it working. Have people gain confidence that, no, this is actually efficient and performant and safe. It's actually very hard to configure if your needs are different, which actually FlexPaxos does handle. It's actually something I'm co-inventor of of some sort.

And this block post. I don't hear the name, that's it. Can you explain it? It will be not super interesting.

Oh sure, yeah, so actually let me explain what is the reason why, so FlexPaxos was published a few years ago, about seven years ago or so. And if you see my name mentioned there, which I feel very proud of. And this block series that I wrote is actually a refinement of FlexPaxos. And that actually explains better why these things are important. The reason why it's important is because people think of consensus as either a bunch of nodes agreeing on a value. That's what you commonly hear. Or you think of reaching majority, reaching core on is important. But the true reason for consensus is just durability. When you ask for a commit and the system says, yes, I have it. You don't want the system to lose it. So instead of defining core and all those things, define the problem as it is and solve it the way it was asked for is, how do you solve the problem of durability in a transactional system?

Durability Over Quorums

The simple answer to that is, make sure your data is elsewhere. If there is a failure, your challenge is to find out where the data is and continue from where it went. That is all that consensus is about. Then all you have to do is have rules to make sure that these properties are preserved. Raft is only just one way to do this. If you look at this problem, if you approach this problem this way, you could ask for something like, I just want my data to go across availability zones. As long as it's in a different availability zone, I'm happy. Or you can say, I want the data to be across regions. Or I want at least two other nodes to have it. So that's your durability requirement. But you could say, I want two other nodes to have it, but I want to run seven nodes in the system or 20 nodes.

It sounds outrageous. But it is actually very practical in YouTube. We had 70 replicas. But only one node, the data have to be in one other node for it to be durable. And we were able to run this at scale. The trade off is that when you do a failover, you have a wild goose chase looking for the transaction that went elsewhere. But you find it and then you continue. So that is basically the principle of this consensus system. And that's what I want to bring in multigress. While making sure that the people that want simpler majority base forums to also work using the same primitives. Just quickly to clarify it, when you say the wild.

Wild Goose Chase Recovery

It's the same one, but you have to know which one that is. There was a time when we found that transaction in a different country, so we had to bring it back home and then continue. It was once it happened in whatever the 10 years that we ran. It's interesting that talking about Sharding, we need to discuss these things, which are not Sharding per se, right? It's about a chain inside each Shard, right? It's actually what I would call healthy database principles, which is, I think, somewhat more important than Sharding. It is true that it is to do with it being a distributed system, and that is because it's Sharded, right? I think they are orthogonal.

Yeah, I think Sharding, you can do Sharding on anything, right? You can do Sharding on RDS, somebody asked me, what about Neon? I said, you can do Sharding on Neon too, you put a proxy in front.

Distributed System Reality

The problem with Sharding is it is not just a proxy. That's what people think of it when they first think of the problem because they haven't looked ahead. Once you have Sharded, you have to evolve. You start with 4 Shards, then you have to go to 8 Shards. At some point of time, it changes because your Sharding Scheme itself will not scale. Like if you for example are in a multi-tenant workload and you say Shard by tenant. At some point of time, a single tenant is going to be so big that they won't fit in an instance and that we have seen. At that time, you have to change the Sharding Scheme.

So how do you change the Sharding Scheme?

Slack had to go through this where they were a tenant based Sharding Scheme and a single tenant just became too big. They couldn't even fit one tenant in one Shard. So they had to change their Sharding Scheme to be user based. They actually talk about it in one of their presentations. And Vitesse has the tools to do these changes without actually using you incurring any kind of downtime which again multi-grace will have.

I keep talking about Vitesse but these are all things that multi-grace will have which means that your future proved when it comes to and these are extremely difficult problems to solve. Because when you are talking about changing the Sharding Scheme, you are basically looking at a full crisscross replication of data. And across data centers.

Yeah, and also I know Vitesse version 3, right? It was when you...

Query Planner Decisions

We've changed, basically created a new planner to deal with arbitrary query and understand how to route it properly and where to execute it. Is it a single shard or it's global or it's different shards and so on? Are you going to do the same with Postgres? I think yes, right?

So that's the part that I'm still on the fence.

That, by the way, the V3 now has become Gen 4, it's actually much better than what it was when I built it.

The problem with V3 is that it is still not a full query. It doesn't support the full query set yet. It controls supports like 90% of it, I would say, but not everything.

On the temptations side, there's the Postgres engine that supports everything.

So I'm still debating how do we bring the two together?

If it was possible to do in a simple git merge, I would do it. But obviously this one isn't C, this was in Go.

And the part that I'm trying to figure out is how much of the sharding bias exists in the current engine in VTS?

If we brought the Postgres engine as is, without the sharding bias, would this engine work well for a sharded system?

So this looks like side of storage if you bring the whole Postgres.

There's a library, LibPigy Query, my locust fetal, which is...

Parser Compatibility

It takes the parser part of Posgas and brings it, and there is a Go version of it as well. So, I mean, don't talk on top of it.

Yes, it also uses it in the last version.

Is it like 100% Postgres Compatible?

Well, it's based on Postgres Source code. So, parser is fully brought, but it's not whole postgres.

So, maybe you should consider this.

If you're thinking about parsing, I mean queries and so on, but I'm very curious.

I also noticed you mentioned routing. It's only queries routed to replicas automatically, and this concerns me a lot because many Postgres developers, I mean, who use it?

Users. They use PL-produced PL functions, all PL Python functions and anything, which are writing data, and the standard way to call function is select.

Select Function Name.

Function Routing Challenge

So, understanding that this function is actually writing data is not trivial, right? And we know in PG-Pool, which I hold my life, I just avoid, I touched it a few times, decided not to use it all because it tries to do a lot of stuff at once, and always considered, like, no, I'm not going to use this tool. So, PG-Pool solves it, like, saying, okay, like, let's build a list of functions which are actually writing. Or something like this. So, it's like patch approach, you know, work around approach. So, this is going to be a huge challenge, I think, if you, for automatic routing, it's a huge challenge.

Yeah, I think this is the reason why I think it is important to have the full Postgres Functional Engine in Multigress, because then these things will work as intended, is my hope. What we will have to do is add our own shard at understanding to these functions, and figure, oh, what does it mean to call this function, right? If this function is going to call out to a different shard, then that interpretation has to happen at the higher level. But if that function is going to be accessing something within shard, then push the whole thing down and just let the push the whole select along with the function down and let the individual Postgres instance do it. Yeah, but how to understand function can contain another function, and so on, it can be so complex in some cases. Yeah. It's also funny, but there is still, there is actually Google Cloud SQL supports it, like, kind of language, it's not language called PL proxy, which is sharding for those who have workload only in functions. This can route to...

Select Function Writes

Welcome to PL/pgSQL - it still exists, but not super popular these days. But there is a big requirement to write everything in functions. In your case, if you continue, like, I have to expect in some case you would say, okay, don't use functions, but I'm afraid it's not possible, like, I love functions. Actually, Supabase loves functions because they use PostgREST, right? PostgREST like, it provokes you to use functions. Oh, really? Oh, yeah, yeah. Actually, I saw that, yeah. But in Vitess, I feel like this was a mistake that we made, which is if we felt that anything, any functionality that you used, didn't make sense. If I were you, I wouldn't do this, right? Because it's not, it won't scale, it's a bad idea, you know, it's like, those we didn't support. We didn't want to support. We said, no, we'll never do this for you, because, you know, we'll not give you a rope long enough to hang yourself. Basically, that was our philosophy. But in the new, in Multigres, we want to move away from that, which means that if you want to call a function that writes, have at it. Just put a comment, it's going to write something. Yeah, if you like, I don't know, if you want a function that calls a function that writes, have at it, we, if we cannot, like, the worst case scenario for us is, we don't know how to optimize this. And what we'll do is, we'll execute the whole thing on the coordinator.

Aurora Global Inspiration

There is another interesting solution in AWS or DS Proxy which, as I know it may be I'm wrong, when they needed to create a global, I think Aurora Global Database maybe or something like this. So there is a secondary cluster living in a different region and it's purely redone but it accepts rights. It comes, this proxy routes it to original primary, waits until this right is propagated back to replica and response.

Oh wow! I don't think that feature can be supported.

No, it's just some exotic, interesting solution I just wanted to share. Maybe, you know, if we, for example, if you originally route right to a replica then somehow in post this you understand oh it's actually right.

Yeah, so maybe 100% is theoretically impossible to support.

Yes, it's super exotic, okay. But I think if people are doing like that, doing things like that it means that they are trying to solve a problem that doesn't have a good existing solution.

Exactly. So if we can find a good existing solution I think they'll be very happy to adopt that instead of whatever they were trying to do.

Well, this is just multi-regions set up and I saw not one city or which wanted it like dealing with Postgres like say we are still single-region. We need to be present in multiple regions in case if one AWS region is down.

Right, it's also over here.

Cross-Shard Transactions

Yeah, so we will availability and the business characteristics. So, yeah. Anyway. Okay. Yeah, it's exotic. But, but interesting still. Yeah. So you've got a lot of work ahead of you, Sergey. I feel like we, we barely covered like one of so many topics. Let's touch something else. Like, maybe it's very little, but it's a lot of work ahead of you, Sergey. I feel like we barely covered like one of so many topics. Let's touch something else. Maybe it's very little. It's a long episode, but it's worth it, I think. It's super interesting. What else? What else? I think the other interesting one would be 2PC and Isolation. Hmm. Isolation from what? Like, the one issue with the Sharded Solution is that, again, this is a philosophy for the longest time in the test. We didn't allow 2PC. You said you shard it in such a way that you do not have distributed transactions. And many people lived with that. And some people actually let me interrupt. Let me interrupt you here because this is a, this is like the most the best feature I liked about the test. It's this materialized feature when data is brought. Oh, yeah, materialized is another time. That's actually a better topic than 2PC. Well, well, yeah, because this is your strength, right? So this is like, I love this idea. Basically, distribute it. The materialized view, which is...

Materialized Views Magic

Incremental Update,

That's great, we need it in Postgres ecosystem, maybe as a separate projective, we lack it everywhere, so yeah, this is how you avoid the distributed transactions basically, right?

No, this is one way to avoid it. There are two use cases where materialized views are super awesome. You know the table that has multiple foreign keys, but that has foreign keys to two different tables, is the classic use case, where the example I gave was a user that's producing music and listeners that are listening to it, which means that the row where I listen to this music has two foreign keys, one to the creator and one to the listener. And where should this role live, should this role live with the creator or should this role live with the listener is a classic problem and there is no perfect solution for it, it depends on your traffic pattern. But what if the traffic pattern is one way in one case and another way in another case, there is no perfect solution.

So, this is where in multi guess what you could do is you say, okay, in most cases this row should live with the creator, let's assume that, right?

So, then you say this row lives with the creator and we shard it this way, which means that if you join the creator table with this event row, it'll be all local joins. But if you join the listener's table with this event row, it's a huge cross-shard while Google's chase. So, in this case, you can say materialize this table using a different foreign key, which is the listeners foreign key, into the same shard at database as a different table name. And now you can do a local join with the listener and this event table. And this materialized view is near real time, basically the time it takes to read the wall and apply it. And this can go on forever. And this is actually also the seek.

Reference Table Distribution

Distributed Behind Re-Sharding, Changing the Sharding Key, This is essentially a table that has real-time presenting with two Sharding keys. If you say, oh, at some point of time, this is more authoritative, all you have to do is swap this out. Make one the source, the other is the target, you change your Sharding Key. Actually, the change Sharding Key works exactly like this for a table. This is the built-in generalization technique. This is what it works for. Yeah. Yeah. Exactly. And the other use case is when you re-shard, you leave behind smaller tables. Reference tables, we call them. And they have to live in a different database because they are too small. And even if you shard them, they won't shard well. Like, if you have, you know, a billion rows in one table and, you know, 1000 rows in a smaller table, you don't want to shard your 1000 row table. And there is no benefit to sharding that either. So, it's better that that table lives in a separate database. But if you want to join between these two, how do you do it? Right? The only way you join is to join at the application level, or read one and then read the other. And so at high QPS, it's not efficient.

So, what we can do is actually materialize this table on all the shards as reference. Yeah. And then all joins become, become local. Yeah. And you definitely need logical application for all these. So, this is where we started, like, challenges with logical application.

Yeah. Yeah. Great. You do have the, so the reason why two PC is still important, because there are trade offs to this solution, which is, there's a lag. So, it is, it takes time for the things to go to the

2PC Performance Reality

2PC is essentially basically the transaction system itself trying to complete a transaction which means that it will handle cases where there are race conditions, right? If somebody else tries to change that role elsewhere while this role is being changed, 2PC will block that from happening whereas in the other case you cannot do that. There will be some video like on YouTube, we can say, okay, there will be some lag, probably some small mistake it's fine but if it's financial data it should be 2PC but latency of right will be high, throughput will be low, right?

This is... I actually want to... I read the design of... which is again, by the way, very elegant API and I assume I can see the implementation on the API and I don't think we will see performance problems with 2PC. We need to benchmark it. We will benchmark it but I will be very surprised.

I think there are some isolation issues that we may not have time to go through today because it's a long topic. Like the way 2PC is currently supported in Postgres, I think it will perform really well. The isolation issues when we sit in a read committed and use 2PC, you mean this, right?

Not an repeatable read. The read committed I think will be... there will be some trade-offs on read committed but not the kind that will affect most applications. MVCC will be the bigger challenge.

Isolation Trade-offs

What they hear is, most people don't use, like, the most common use case is lead committed. Of course, as default, yeah, as fast as default, yeah, so people won't even, yeah, I don't, I think this is already on some, they're already in bad state, it won't be worse. It won't be worse, yes.

Yes, yeah. As for 2PC, it of course depends on the distance between nodes, right? A lot, like if they are far they need to, we need to talk, like, client is somewhere, two nodes are somewhere, and if it's different, it will build the zones, it depends, right?

So this distance is, it's a big contributor to latency, right?

Network, because there are four communication messages that are needed. So, correct, correct. Actually, you can, I have actually the mathematics for it, but you're probably right, it's about double the number of round trips.

Yeah, if you put everything in one AZ, client and both, both primaries, we are fine, but it's, yeah, in reality, if they will be in different places, and if it's different regions, it's not there, of course, but at least, yeah, there are two PC is not done by the client, by the way, the two PC would be done by the VT gate, which would be, it should have the nodes.

Distance Matters

The Availability Zone is only for durability for replica level, but a two-PC coordinating between two primaries, which may actually be on the same machine for all your care. Imagine a real practical case, every shard has primary and a couple of standby. So are you saying that we need to keep primaries all in the same availability zone? That's usually how things are.

Interesting, I didn't know about this. I wanted to rattle a little bit about Planescale Benchmarks last week. They compare to everyone. It's not like I'm sorry, I will take a little bit of time. They compare to everyone and they just publish Planescale versus something. And this is very topic.

On charts, we have Planescale in single AZ, everything client and server in the same AZ. And line, which is normal case, client is in different AZ. And line with the same AZ is active, line is normal, not active. And others, like neon, super bass, everyone, it's different. And of course Planescale looks really well, because by default they presented numbers for the same availability zone. And below like the chart, everything is explained, but who reads it, right? So people just see the graphs. And you can unselect, select proper Planescale numbers and see that they are similar. But by default, same AZ numbers.

Local Disk Advantages

This is like benchmarking. If you look at the architecture, even fair comparison planets should come out ahead, like the performance of a local disk, of course, should. But this was Select 1, Disk 1 is not a benchmark. Well, it was part of benchmark, it's just checking query path, but it fully depends on where client and server are located.

So, what's the point showing better numbers just putting client closer? I don't like that part of that benchmark. Also, I saw the publications, but I didn't go into the details because it has to be faster because it's on local disks. For data which is not full in cache, of course, local disks are amazing. You're right, if the data is in cache, then all performance of everything would be the same.

Yeah, well, I wanted to share this, I wasn't knowing about this. But I fully support the idea of local disks, it's great. I think we need to use them more and more systems. I think I wouldn't be surprised if you reached out to planet scale. If you want to run your benchmarks, they may be willing to give you the...

This was called published, and in general, benchmarks look great, the idea is great. And actually, with local disks, the only concern is usually the limit, hard limit. We cannot have more space, but if we have started solution, there is no such limit. But speaking about the hard limit, today's SSDs, you can buy 100 plus terabytes SSD, single SSD, and you can probably stack them up on the next together. But about... I saw AWS SSD over 100 terabytes. In Google Cloud, 72 terabytes is hard limit for Z3 metal, and I didn't see more. So 72 terabytes, it's a lot, but sometimes it's already... At that limit, your storage is not the limit, you will not be able to run a database of that size on a single machine. Why not? We have cases CPU. Well, again, this problem will be replication. If we talk about single node, we can...

Or replication. 360 cores in the AWS almost 1000 cores already for Z1 scaleable, generation 5 or something. So hundreds of cores. Well, the problem is supposed to be design. If replication, physical replication was multi-threaded, we could scale more. By the way, replication is not the only problem. Backup recovery. If your machine goes down, you are down for hours.

Backup Recovery Speed

[Discussion about backup and restore speeds] Always saying that one terabyte per hour is what you should achieve for restore, if it's below, it's bad. Now I think one terabyte per hour is already not enough. Yes, yes. So with the best EBS volumes, we managed to achieve, I think, seven terabytes per hour to restore with WAL-G. And that's great. The greatest danger there, you could become a noisy neighbor. So we actually built throttling in our restore, just to prevent being noisy neighbors. With local disks, you lose the ability to use EBS snapshots, cloud disk snapshots. Correct, correct. That's what you lose, unfortunately. And they're great and people enjoy them more and more.

Yeah. So I agree. And for, I just remember, for 17 terabytes, it was 128 threads of all G or PGPCR, I don't remember. Wow. But with the focal discs S3, I need to update my knowledge.

Edge Case Problems

Technology is changing too fast. Certainly, hundreds of course. Terabytes of RAM already, right? But it does go straight to your point of the smaller they are, the faster you can recover still. And you don't hit some of these limits, like these systems were not designed with these types of limits in mind. Some weird data structure, you know, suddenly the limit of this is only, you know, hundred items, you know, and you hit those limits and then you are stuck.

Like recently, Metronome had an issue. Yeah, they had that, that outage, the multi-exact thing, which nobody has ever run before, but they hit that problem. Yeah, we have so many problems also when you only add, and yeah, it pushes forward positive, actually, sometimes. But if you want to be on the safe side, but I really like it, it's kind of resilience, characteristics, even if it's down, it's only a small part of your system is down. That's great.

Yeah, that's, that's mature architecture already. That actually makes it easier to achieve five nines of time, because that's the way you calculate, like if only one node is down, you divide it by the number of users.

Current Progress Update

Let's go. Cool. I think it's maybe one of the longest episodes we had. Enjoy that.

Oh, my God. I enjoyed it. I hope we will continue a discussion of issues with logical, for example, and so on. And maybe if things will be improved and so on. Looking forward to test POC once you have it. Thank you so much. Thank you.

It's any last things you wanted to add? Or anything you wanted to help from people on? I would say it feels like nothing is happening on the repository, except me pushing, you know, a few things changes. But a huge amount of work is happening in the background. Like some of these design work about consensus are all like almost ready to go. And there's also hiring going on. There are people coming on board very soon. So you will see this snowball. It's a very tiny snowball right now, but it's going to get very big as momentum builds up. So pretty excited about that. We may still have one or two spots open to add to the team, but it's filling up fast. So if any of you are very familiar, this is a very high bar to

Team Building Plans & Final Thoughts

You have to understand Consensus, Query Processing, but if there are people who want to contribute, we are still looking for maybe one or two people and also on the orchestration side and the Kubernetes side of things. I wish. Oh my God. I almost hope that day never comes, but it is so fun working on this project, creating it. Why do I want to give it to an AI to do it, you know?

Good, thank you, enjoy it a lot.

Yeah, thank you so much for joining us, it's great to have you as part of the Postgres community now and I'm excited to see what you get up to. And we too. Thank you.

Wonderful, thanks so much. Thank you. Bye bye.

Interview: Multigres on Database School

Tue, 01 Jul 2025 00:00:00 GMT

Sugu discusses Multigres on the Database School YouTube channel. He shares the history of Vitess, its evolution, and the journey to creating Multigres for Postgres. The conversation covers the challenges faced at YouTube, the design decisions made in Vitess, and the vision for Multigres.

Chapters

00:00 - Intro
1:38 - The birth of Vitess at YouTube
3:19 - The spreadsheet that started it all
6:17 - Intelligent query parsing and connection pooling
9:46 - Preventing outages with query limits
13:42 - Growing Vitess beyond a connection pooler
16:01 - Choosing Go for Vitess
20:00 - The life of a query in Vitess
23:12 - How sharding worked at YouTube
26:03 - Hiding the keyspace ID from applications
33:02 - How Vitess evolved to hide complexity
36:05 - Founding PlanetScale & maintaining Vitess solo
39:22 - Sabbatical, rediscovering empathy, and volunteering
42:08 - The itch to bring Vitess to Postgres
44:50 - Why Multigres focuses on compatibility and usability
49:00 - The Postgres codebase vs. MySQL codebase
52:06 - Joining Supabase & building the Multigres team
54:20 - Starting Multigres from scratch with lessons from Vitess
57:02 - MVP goals for Multigres
1:01:02 - Integration with Supabase & database branching
1:05:21 - Sugu's dream for Multigres
1:09:05 - Small teams, hiring, and open positions
1:11:07 - Community response to Multigres announcement
1:12:31 - Where to find Sugu

Transcription

Intro

Welcome back to Database School, I am your host, Aaron Francis. In this episode, I talk to the co-creator of Vitess and the co-founder of PlanetScale. The same person. His name is Sugu Sougoumarane. We talk about his time at YouTube and the invention of Vitess, moving on to PayPal, then founding PlanetScale. And finally, his time in the wilderness. He took a little sabbatical and then he comes back to create Vitess for Postgres. He's joining Supabase to bring Vitess to Postgres. You're really going to enjoy this - he's incredibly smart but also very, very humble and very thoughtful. So, enough yapping from me, let's enjoy this episode of Database School.

This is kind of some breaking news.

So, I'm super excited to be here. I am here with Sugu Sougoumarane. He is the co-creator of Vitess, the co-founder of PlanetScale. And he is back to talk with us about his next adventure. But before we get into that, we're going to dive into a little bit about the history of Vitess. And then we'll talk about what's up next. So, Sugu, thank you for being here. Do you want to introduce yourself a little bit? Sure, yeah. I'm Sugu. And I have been involved with databases for a pretty long time, I think since the 90s. Wow. I was working at Informix and then I moved on to PayPal in the early days and took care of scalability there. And from there, I moved

The birth of Vitess at YouTube

to YouTube, and that is where I co-created Vitess with my colleague Mike Solomon. This was because YouTube was falling apart on the database side, and we had to come up with something that would leap us ahead in terms of the troubles it was going through. We can talk about that more later, but eventually Vitess, we decided to donate it to CNCF, and by that time Vitess had a lot of adoption - Slack started using it, Uber was using it, and more users were coming on board. At that time CashApp was also using it, so it was time to start a company fully dedicated to backing this software, which is how I ended up co-founding PlanetScale with Jiten, who was also with me at YouTube. He was actually on the SRE side, taking care of Vitess. And about three years ago, by the time it had been 12 years since I was working on Vitess, and three years ago, I had successfully built a full team that was taking care of the project, and I said, okay, it is time to take a break, so I went on sabbatical. And recently this itch came back that I need to do something, and when I saw Postgres, I just couldn't control myself, and so here I am. I love it.

The spreadsheet that started it all

That's a lot, and we're going to cover it all. Let's start, take me back to the YouTube days. So you're at YouTube, and YouTube is taking off, and y'all just can't keep up. Everything's falling apart all over the place. And so, what was the conversation like, when you or you and your co-creator went to somebody and said, what if we invented something entirely new and wrote it from scratch? How did that come to be? The solution that ended up working? So that's actually a very good story. It was actually my co-creator who did this, who thought of this idea, who said that this seems like a losing war, because every day was actually worse than the previous day. And every time there was an outage, it was the database. So what he did was he went to a Starbucks - it wasn't a Starbucks, it's actually a coffee shop called Dana Street in Mountain View - and wrote a huge spreadsheet. I still actually have this spreadsheet, where he listed all the problems that we were facing, and also all the problems we are going to face as this thing grows, and listed a bunch of solutions, saying that, okay, how do we solve this problem, how do we solve this problem, and then you take a step back and look at everything, we knew what we had to build. And the answer at that time was, we needed a proxy layer that stands between the application and MySQL and protects it. That was kind of the high level statement, and that is kind of how Vitess was born, and we basically said that in order to build this, we need a very different focus from where we are today, which is, our focus was always like, how do we fight the next fire, or how do we survive till next Tuesday? So those were the kind of thoughts that were in our mind. We actually pulled ourselves out of the day-to-day operations and said, we are going to take ourselves out, you guys are on your own, and we'll build this and come back with something that will solve all our problems. It was very ambitious.

Yeah, but it worked, so you and your co-creator set up a special unit, and you said, we're not involved with firefighting, we're not involved with fixing problems, we're going to do this big bet, and this big bet is, we're going to basically rewrite, not really, we're going to write a proxy layer that pretends to be MySQL and sits between the application and the actual databases. And we will do a bunch of stuff in there, so tell the people, what exactly, just kind of at a high level, what is Vitess, and why did...

Intelligent query parsing and connection pooling

What is it about Vitess that solves all of those problems?

So, what Vitess is today is nothing like what we initially conceived. It evolved kind of iteratively over time. The initial conception of Vitess was just cluster management and protection of the database. And the MySQL database at that time had connection problems, which is actually still a problem with Postgres now, but now there are solutions coming up, but that was a problem with MySQL also. So, the first product that we built was just a connection pooler. Get traffic and pool it across just a few connections. And that took a while actually because we didn't build it as a simple connection pooler. We kind of knew that a dumb connection pooler will not take us far. We actually built an intelligent connection pooler that actually understood queries. We actually parsed every query because we knew that we had to filter queries, judge them and say that this is a bad query. We did not let it go to the database. Those were kind of the thoughts that we had because the bad queries would be the ones that could actually take our databases down all the time. So, that's the reason why we went one level above the immediate requirement. And it took us about a year finally to get it launched. And at that time, it was not Vitess, but what happened was once that server was out there in the middle, people realized, oh my God, there are things we can do here before we reach the database. So, feature requests started coming in and slowly, and because we had the parser, we could do some very intelligent things with our queries. And so, that is kind of what led to the large number of features that we could add to Vitess - without the parser, we wouldn't have been able to add them. And that eventually evolved into what Vitess is today, essentially emulating an entire database.

Yeah, so that's an interesting, that was an interesting and insightful call there at the beginning to say, instead of just doing dumb connection pooling, we're going to add a little bit of intelligence here to like inspect the queries. And it sounds like that was the thing, that was like the golden insight there. And then once you have that server in the middle, you now have, it's like, you now have a new bucket in your brain where you can put things. And you're like, oh, because we have this server in the middle, what if we did this and what if we did that, whereas when that doesn't exist, your brain doesn't even think like, well, we can't do that because we're just talking to the database. And I'm not going to go rewrite my sequel, but now that you have this thing, you can cast all of your hopes and dreams upon it. So what were some of the first things that came in, you're in this, you're in this, you know, secluded team writing this thing and you put it out to the world and everybody's like, great, you solved some of our problems, now can you do this, what were some of the first few big ones that came in.

Preventing outages with query limits

So, the first one is adding a limit clause to every query that the applications sent. Interesting, because in OLTP, you are not expected to fetch 100,000 rows to serve your web page. Right. So, what we did was, if your query had no limit clause, we would put one, and 10,000 was the limit. If the number of rows exceeded that limit, we would return an error. So, that was actually one of the biggest problems at YouTube, because people would assume how many videos can somebody have, and you would have a hypothetical answer. So, you would say, oh, just fetch all the videos, but there are people who have a huge number of videos, for example. Sure, yeah. And so, when this happened, this protected against a whole bunch of outages.

Another really, really cool feature that we added in the beginning was, this happened actually when, again, it's related to the number of videos. The YouTube home page, we used to highlight certain users, and show their videos on the home page, on YouTube's home page. So, one of the users that got highlighted had 250,000 videos, and that query had no limit clause. So, which means that every page hit would go and fetch those 250,000 videos. So, when you put this limit on there, and stuff just starts hitting a hard wall, it's just like, sorry, that doesn't work.

What do the application developers think at that point? Are they super happy that you've surfaced some sort of error, or where's the, as far as the internal teams go, you put this blocker on the database, and then what happens on the application developer side? Not all of them were happy.

Yeah. I think we had some level of arrogance to tell them to live with it.

Yeah. I'm not going to change this. And I guess in our case, it came from the fact that protecting the database is more important than keeping one developer happy. So, yes, I think we might have pissed off a few developers. So, are you on, at that point, is it just the ops team that you're on? So, at that point, it's ops versus, you know, application developers?

Yeah, I was. Yeah. We were kind of the, at that time, we were the architecture team. So, we also kind of dictated how applications were written, what rules to follow to access the database, we were deploying software. So, we were kind of - later, we became the ops team, but at that time, we were the architecture team.

Yeah. You're talking 2007? I know. That's why I was thinking probably DevOps probably wasn't around. No, DevOps didn't exist.

Yeah. So, you're just ops or architecture. Man, that's so, that's so interesting.

Okay. So, the first one was taking this, you know, query that shouldn't ever exist and could result in degraded performance or just knocking the whole thing over and stopping it at the border and saying, sorry, you can't come in here, error, go fix your stuff. That seems pretty reasonable. You said at the beginning, um...

Growing Vitess beyond a connection pooler

Vitess was fundamentally smaller and way different than what it's grown into now, So how did you guide that growing from an Intelligent connection pooler to what it is now, which is this massive sprawling thing?

Talk us through that journey of I assume a few years, I know that the whole thing has been running for several, but I have to imagine that first few years was pretty active. These are good questions, because nobody has asked these questions. So when we first deployed Vitess, it was just VT tablet by the way, and even VT tablet faced a lot of obstacles, because it was an additional layer and it added latency.

2011 is when we first launched, we started working on Vitess in 2010. In 2011, Go 1.0 was not even out yet. I think Go 1.0 came out in 2012 or something, so we were launching on a non-released version of Go and it was not efficient. I think at that time the latency was like 10 milliseconds, just horrendous. Now it's sub millisecond, and we were on hard disks by the way, there were no SSDs in those days. So pause there for one second. What was the choice to go with Go with it being so early? So it must have had some massive upsides that papered over all of these other problems. So what was the decision there? You wouldn't believe it - it was like a very quick decision. I hated Java and my co-creator hated C. Great, I love it. He said not Python, obviously because this needs to perform. How about this new language called Go? It's incredible. I love how human it all is. You all picked Go because it was the mutually agreed upon non-hated language. It's still pre-1.0.

Choosing Go for Vitess

Okay, so carry on, you're working with Go, it's got a 10 millisecond latency, and you're starting to grow. Yeah, and fortunately, I would say when we later learned about Go, we really, really liked the Go team's approach to the language. They were exactly thinking the way we were approaching problems. Like, you know, there is this fascination with complexity - we were like, things should be as simple as you can, you know, you should not overcomplicate. So they were like - we were pretty much in sync with them. So we really liked them. And since they were also at Google, they got wind that we were developing this. So they helped us quite a bit. They actually prioritized our problems and worked very hard at making things work for us, which is pretty awesome. So, yeah, so in VTTablet, we were basically adding more features to VTTablet. And by that time YouTube wasn't a happy place. But then came the first resharding experiment, because we had, I think, I don't remember anymore, I think we had 8 shards and we were going to 16. And so quaint, so long ago, that's crazy. And that was quite a nightmare. And then we said, oh, we need to add resharding, make it more automatic. And so that's when we introduced this whole metadata layer, where we actually, at the time, it was ZooKeeper. There was no etcd. So ZooKeeper, we used to store metadata in there and then use that to manage resharding. And that required the application to connect to ZooKeeper to figure out what to connect to, which made the application more complex. So every time you resharded, you had to basically push things to ZooKeeper and the application to reload that information saying that, you know, this is a new sharding configuration, you have to go there. And that's when we decided to introduce the VTGate layer. Because we can't be having changes in the resharding to affect the application. So we'll have this VTGate layer, which will watch over this topology. And you just send traffic to VTGate and VTGate will know which shard to send it to. Perfect. And you can start. Yeah. No, go ahead. It was not a database connection. It was actually a specialized RPC connection. Because the application still had to know which shard, which at the time we had this concept of a keyspace ID, which mapped to a shard. And the keyspace ID is now called - it's in Vitess. So previously, the application would say, send this query to that shard. And we changed the API to say, send this query to this keyspace ID. And VTGate would take the keyspace ID, look at the topology and say, oh, that maps to this shard. I'm sending it there. So if you resharded, then the application is unaffected. So that's what happened.

Okay. So you started with the VTTablet and then as a function of the painful reshard, you introduced the VTGate. So for the people listening, at this point in the history, talk us through from the application to the actual data, where are we connecting and who's talking to whom, where does VTGate, VTTablet and the running MySQL instance fit in from the application all the way through?

The life of a query in Vitess

Let's tell you the life of a query. I don't know if you've heard of the life of stories. That's it. That's perfect.

So, the application uses gRPC. Actually, in Google it is Stubby, right?

Stubby, yes. It's a...

You tell me, I don't know. It's Google's version of gRPC.

Okay. gRPC was born off that project. I should actually confirm the project. I think it is Stubby. Anyway, it connects via gRPC to one of the VTGates. Then sends a query, you know, SELECT *. But in the query, it says, this query belongs to this keyspace ID, which means that it's a SELECT from a user. It also says, the user lives at this keyspace ID. And so then VTGate says, Oh, okay, let me go look up this keyspace ID in my topology. And the topology says,

Oh, this keyspace ID is in this key range. And that maps to this shard. Okay. And so, okay, then VTGate says,

Okay, which VTTablets are present that serve this shard? That is also in the topology. And it gets a list. And it chooses one randomly for a read query. It's a random choice.

And then it sends the query to VTTablet saying,

Please serve this query for me. And VTTablet has all the protection there. If there's a limit clause missing, it'll add it. If it's an invalid query, it'll reject it.

And then VTTablet sends the query to MySQL, which serves your query. And the results are returned back to the application - the life of a query. I love it. That's very easy. That's very straightforward. So is it fair to say that at this point in the history of Vitess, the VTGate just handles the pointing to the correct VTTablet. And so who or rather how is the application developer signaling to VTGate the what did you call it keyspace ID that is not mapped to shard, but is loosely mapped to shard by somebody else, how's the application developer saying this user Aaron Francis is in keyspace ID, one, two, three, like where does that come from? So that knowledge was already present when we sharded - when we did the initial shard. Got it.

Okay. And there were two methods. One is the first sharding technique was hashing of the user ID. Actually, the first sharding technique was actually not a hash of user ID - it was actually a random assignment of a user ID to a shard. And we had a table that was a lookup table that said where that user lived. We moved away from that and then changed it to a hash of user ID. So user ID would get hashed to

How sharding worked at YouTube

a keyspace ID, and then that keyspace ID is mapped to a range-based shard that was split in our case equally split, like 8, 16, 32, 64, that way. That was for user IDs, but if you had a SELECT by video ID, we wanted to keep our video IDs with our users. For video IDs, you had to know which user ID the video ID belonged to, and we held it in a table. You were smiling, because you know where this is going. And that video ID, so we would look up the table, find out the user ID for the video ID, and then do a hash, and send the query to a shard that had that user.

Okay. And so where do we live now? Like, you know, this is all groundbreaking at the time, but I'm sure looking back on it, you're like, oh my goodness, what in the world?

So where does it live now - all of the pieces of Vitess, and who is responsible for what?

Because from an outsider's perspective of somebody just like, we'll say connecting to a PlanetScale database seems pretty easy to me. I don't really have to do anything. And so some of those components have gotten a lot smarter, and taken a lot of that off of the application developer. So what is the state of it right now, or when it became mature, maybe even not right now, but when it became mature, how do those components fit in with each other? So the biggest leap that we made, which transitioned Vitess from where it was then to where it is today, was a decision at that time that was really, really scary. This keyspace ID that I talked about was physically present as a column on every table. And if you wanted to know what is the keyspace ID of a row, you could select that row, and there's a column that said, this is your keyspace ID. The question that I asked myself is, can we live without that column? And can we live with the fact of hiding this from the application - what if the application didn't know the keyspace ID? Would the system continue to work as efficiently as it was before? And to push that question...

Hiding the keyspace ID from applications

question further, which means that if the application - or the converse of that question is, can you compute the keyspace ID using the WHERE clause of your SELECT statement? So, it took a few months of pure thinking to come to the conclusion that yes, this can work. So, before we get into how it could work, what made you ask that question? Because I was not happy that at that time, you could not write an ODBC driver that could connect to Vitess. So, you were looking at, I'm casting this upon you, so if this is wrong, tell me, you were looking at adoption and you were saying, hey, if this thing is going to get adoption, we got to play the game and we can't have custom connectors for everything. So, I'm surprised at how mature we were - we actually at that time felt that for Vitess to have a real future, even within YouTube, it needs to be adopted outside YouTube. That is very mature. That's an interesting insight. We felt that if YouTube was the only user of Vitess, eventually somebody is going to say, why maintain something this bespoke? So, that is what motivated me to think that we need a generic database API for Vitess.

So, you asked the hard question and then you spent many months thinking about it and then what happened? That was actually one of the hardest problems for me to solve. Essentially, it didn't stop there. What I had to do is go back and learn relational algebra. And figure out and prove to myself that sharding and combining those shards into a single unit can be modeled using traditional relational algebra. And that's why that was difficult. And then eventually mapping the relational algebra back to SQL.

I actually, the document should still be there in Vitess when I first wrote the initial document. And then I wrote an enormous query and showed how this - with all possible constructs of SQL - and showed how this maps to relational operations. And how such a relational operation would work in a sharded database. This is fascinating.

Okay, so let me get my thoughts here. So, you decided we are going to parse the query and then we are going to figure out what shard this needs to go to. So we are going to become basically like you're going to eat all of that pain on behalf of the application developers. And that's what that's what takes it from Vitess is pretty useful but pretty specialized to Vitess is extremely useful. And the application needs to know even less and Vitess will just do a lot more work. And so did you push the same query parser up to VT gate and use that or did you write?

Okay, so you use that same one.

Okay, so the parser is the same and then you're going to have to keep it high level. How did you determine from the query which shard it should go to? Just you're going to have to keep it up here for me but tell me just how did you do that?

Yes, so that was actually what I implemented is not what is there today. But at that time, so the way I started was let's take the simple case. If it is a select statement where your where clause is the shard in key, it is very straightforward. We like that.

Yeah, so you start with that and then you go to a slightly more complex construct which is a join. And the join is on the shard in key and the where clause is also the shard in key. Then the query still goes to the same shard that was like the next level of complexity. The third level of complexity is you select a query where you do the join on the shard in key. But there is no where plots. So if that happens, then you can actually you don't have to break you don't have to break that query up into smaller parts. You send that query to all shards because without changing it because all rows for each shard live within that row. And the fourth level of complexity is when you realize, oh, the rows for this shard are not in the same place. In which case, this is where relational, the relational algebra came into play where can you identify parts within this complex query where you can say this portion, this part of this query can still be preserved and sent to a single shard.

How Vitess evolved to hide complexity

So that was the aha moment for me, think that even if there is a complex query, you can identify parts that can be preserved as is and sent to individual shards. Incredible. So this is a, you know, I don't think learning about relational algebra and deconstructing queries is broadly applicable, but what you did is broadly applicable in that you started with can we make the simple case work, can we just make the easy one work?

Yes, we can. Let's move on to slightly harder, slightly harder, until you have kind of this framework built up in your head, to where you get to the hardest one, and you've kind of taken a few, yes, confidence building, but also like groundwork building steps to get there, such that you can tackle the hardest one there at the end. And is there a place still online where we can read this, you know, paper relational algebra post, whatever, is that still available somewhere?

I, I, I did a series video series. I lost my audience, I lost most of my audience, about three fourths of the way, still there on my website on YouTube channel, because it is, it is a very hard problem. Yeah, sounds like it. And unless you, you really, really, unless it's a matter of survival, you would not want to put yourself through that pain.

Okay. Well, I'm going to find that and I'll leave a link down below. So if any of you fellow database nerds want to watch it, I'll put a link down below. It was not planned. It is pure free form where I said, I'm just going to start talking a lot. I just kept talking, I, an hour past, I said, let's stop here. And then, and I think it's like six or seven parts. So like seven, eight hours of me just rambling. It's very boring. I would, I would not recommend that. That actually, I should talk about Andres who is at the planet scale. He is the, the first one who ingested that entire thing. Wow. And say, you know what, there's a better way to do this. Oh, cool. Oh, man, I love that. I am very pro reading original source material. And in this case, watching original source material. Because like you said, nobody does it. And if nobody is doing it, there's a huge amount of, there's a huge amount of an advantage that can be gained when you go straight to the source. So it sounds like that happened at planet scale, which brings us to the planet scale years, because I want to, I want to talk about.

Founding PlanetScale & maintaining Vitess solo

Vitess for Postgres, and spend a lot of our time there. So tell me just as briefly or as long as you want to, You founded PlanetScale with a co-founder and then you decided, I need some time off and you took three years off. So let's lump those two things together and tell me about that part of your story before you come back out of retirement for your triumphal return. So I was going to sabbatical retirement, I even called it retirement for a while.

When I co-founded PlanetScale with Jiten, I was the only one who left Google from the original Vitess team. So I was actually the sole maintainer of Vitess even by that time because YouTube was actually migrating to Spanner. So there was nothing to do with Vitess being bad. It is just a policy decision at Google that they wanted a uniform data store and they didn't want to be maintaining something that is bespoke. So at that time, I was the only one maintaining Vitess after that.

Vitess by the way is too complex to fit in one person's brain. That includes mine. And the way I managed it was by actually time compartmentalizing, which means that I would allocate a few months to just one area of Vitess and during that time, I was incapable of answering questions or helping anyone with anything else. I would focus on that and moved to something else, which is how I maintained it for about one and a half years or so. It was very stressful. But over this time, at PlanetScale, I managed to find great people like Vicken and Rishen, the entire Vitess team. And they ramped up very quickly. And there was a time when I was outproducing the entire team. And by 2022, I could barely keep up with a single one of them. They were all incredible. And that's when I realized, you know what, I've succeeded.

Yeah, no kidding. I hadn't taken a break. It was like 12 years or so. That's how I went on this sabbatical. Okay. And so you've built up, that's incredible, by the way. You've built up this foundational piece of software. And you get these great people in. And I, when I was at PlanetScale, worked with Deepthi. And I just love her. She's amazing. She's awesome. She's wonderful. I'm sure the rest of the team is great, but I worked with Deepthi. And she's a delight. So you built up this team of geniuses. And you stepped away from it, which was, I thought, going to be the hardest part. Well, she's doing it. She's doing a great job. So you built up this team of geniuses and you decided, all right, it's time to take a little rest. And then you go and do what for three years. Just chill out.

Sabbatical, rediscovering empathy, and volunteering

I would call it reinventing myself. I kind of look and behave like the same person. The way I would put it is I had a value system, which I think was good, I think people appreciated. But I don't think I was connected to it emotionally, within myself? It was more of just a behavior, and what I did in the last three years was basically worked on connecting myself to myself, if that makes sense. Fascinating. Can you give us any examples? Maybe one of the values that you felt like was just behavior, but you weren't connected to?

Empathy. When you see someone in pain, you empathize with them, but do you actually feel what they are feeling? Sometimes you say, oh, something bad is happening to them and then you go, you know you have to be nice to them, you have to say nice things, but do you actually feel what they are feeling? Those are the kinds of things that I worked on. Understanding, truly understanding people's pains, truly understanding people's joys, and actually relating to that is what I would... I actually, the transformation was so fundamental to me that I actually went into a lot of volunteer work because now I understand why you have to work with other humans, help others. So I for a while thought of doing that full time until I felt that I started missing tech work also, so I still do all the volunteer work, but now I am also doing the technical work. This is such a delightful story. So you go off into the wilderness, right? You go off into the wilderness, you reinvent yourself, you rediscover, you find yourself, and then you have the original call that's like pulling you back. So you've done the thing, you've gone off, and you've made yourself, you've become more yourself, and then the siren song of the thing that you walked away from starts calling back. So tell me...

The itch to bring Vitess to Postgres

What is it that started to bring you back, and why was it Postgres this time, instead of MySQL?

The idea of Postgres has always been sitting in the back of my mind, and it was constantly nagging me, and it's not new, it's been there since 2020. In 2020, there were a few people from the Postgres community that talked to me, and said, you know, we need to figure out a way to do this - there are actually even issues open in Vitess about it. They say, this problem needs solving, and I was excited, I wanted to do it, but then I feel very guilty, but I had to say, I can't focus on this. Sure, there was too much to take care of at PlanetScale. Too many things - we were still making Vitess work better, bringing up features, making the serverless application work. There was a lot of things going on, I was very sad to put an end to that project. I didn't put an end to it, I said, there will be another time, but not now. That has been in the back of my mind. Even during my sabbatical, I've talked to people saying that I reached out to an old Vitess contributor and told them, you start this Postgres thing for me. Nobody did, and so this has been slowly growing, and it became an obsession in the last couple of months when I realized Postgres is exploding. I am an advisor to a company called Metronome, and they had an outage, they published a postmortem, I saw that and I said, oh my god, this problem needs to be solved today. That's when I decided what can we do to restart this project. And I will say, I don't think you should feel guilty, but the thing that you told that person, I can't do it right now, but someday, is now true. Now is the day, you're doing it. So I don't know if that issue is still open, but you gotta go find it, and you say we're back, baby. So you've had this itch in the back of your head for a long time. Let's bring Vitess to Postgres. So talk to me about some of the things, as you're now focusing on that, so it's been something that's been there the whole time, but now you're like, let's do this. What are some of the things?

Why Multigres focuses on compatibility and usability

What are some things that you either want to do the same, or differently, from original, or current, Vitess, which we'll just call Vitess, and we're going to call the new one Multigres, is that right? Multigres, OK, so what are some things you want to do with Multigres that are either the same, or different than Vitess, things that worked great, or things that are like, man, now that I have a clean slate, let's redo it, and to color that answer, what are some of the things about Postgres that require you to do something slightly different, so you can take that anywhere you want, but I want to hear about your vision for Multigres. So let me talk about the things I feel like we did not prioritize in Vitess. Yeah, and this is no indictment, by the way, Vitess is a fundamentally successful project that is powering huge applications, so, but you have a chance to start over, so go on. The first one, compatibility, I would focus on compatibility, make sure that to the extent possible, every Postgres construct works like before. So Multigres should feel and act like a Postgres, love that, that is the highest priority. So when we built Vitess, that was not our goal, mainly because we thought Vitess would become its own standard and database. And that is not true, people, till even today, the first question they ask is, how compatible is it with us? So that is one problem. The other one that we did not really pay attention to is approachability and usability. Vitess is a very, very complex piece of software, extremely flexible, like you could go to Vitess and ask, can you do this very specific thing? The answer is most likely yes, because all you have to do is take this piece, take that piece, put it together and write a script, and it will do this for you. So this command line option that you need to change, like literally if you do VT tablet dash edge, you will get like, I pages of command line options. And yes, and you can change one of those to make Vitess do what you like. But that actually has been a huge barrier for adoption, because it's daunting to think about bringing up a Vitess cluster. So that is the second problem that I would definitely address the numerous degrees. What would be a third one? Let me see if I can bring up my notes here. I had some notes. I would say the other problem would be something that we are still struggling with. Actually, we finally have, I shouldn't say we, it's the VTest thing that did this, which is too PC. So the VTess workload has always encouraged not lying on two phase commits, distributed transactions. And because we have always believed that two phase transactions are a slippery slope, once you let you do it, you may go, abuse it and actually end up in a state where you're not happy. So we highly discouraged it to the extent that we didn't even want to support it.

And finally, I think the Vitess team realized that, you know, we should support it and they've added support. And that's something that I would prioritize for in postgres because again, it goes back to compatibility. If you don't do PC, you're not going to have compatibility. So those are the three things I would prioritize moving.

The Postgres codebase vs. MySQL codebase

Everything else I'll bring mostly. Yeah, that's huge. That's a huge endorsement of Vitess.

And so, with the architecture of Postgres, Does that make those things harder, easier, or does it have no effect? Because MySQL and Postgres are fundamentally different beasts. And so, you're not going to be able to just grab it all and move it over. Does that make your life a lot easier, that you're now doing it for Postgres instead of MySQL?

Yes, I think there are a few reasons why it's easier. One is Postgres has a much cleaner 2PC API, very beautiful, I saw it.

Oh, my God, whereas MySQL is not this simple. So, that's number one.

Compatibility is actually going to be easier, mainly because of how approachable the Postgres codebase is and how much tribal knowledge exists in Postgres, about the Postgres engine itself. For example, if we wanted to implement stored procedures, we would basically have to do it right now in Vitess, we would have to do it from the ground up. But if we were doing it in Postgres, we can take the entire existing stored procedure implementation in Postgres and just transport it into Vitess. So compare the source code of Postgres and MySQL for me real quick.

What is - I've read horror stories on Hacker News about the source code for MySQL. What is the - and you said Postgres is so clean and understandable. What's going on there? What's the big difference?

I think the big difference is that there are not as many experts in MySQL about the code because they are - I don't even know, like for example, I wouldn't even know where to ask is there somebody who knows how this engine works. And whereas in Postgres, there are so many places to ask. And I would say it's the openness of the community. Yeah. Yeah, that makes sense. Yeah. And I've heard - yeah. No, please.

Yeah, like it feels like any question that you have, you can get it answered probably, and also the licensing because it's PostgreSQL license, I can freely copy it. If it's MySQL license, even if I see the code, I cannot copy it. Gotcha. Well, that makes a huge difference. I have heard amazing things about the, not the Postgres user community, I haven't heard bad things, but about the Postgres hackers. I think is what they call themselves - the Postgres hackers. I've heard amazing things about that community, the people that actually work down on the core and how

Joining Supabase & building the Multigres team

Everything is just done out in the open. You just put it on the mailing list and you get to talk to people and gather consensus. But while that may sound like a slow and arduous process, it's all happening out in the open. And there are people that you can talk to, and you can find them at conferences or email them or hop on a call with them, and you can get all of that knowledge downloaded into your head. So that makes a lot of sense to me. So that part is going to make your life a lot easier.

How are you, or you and the team, you should say that you have joined Supabase to further this effort.

So how are you and whatever team you're building around yourself inside of Supabase? How are you going to develop it, structure it, organize it, both like your little new community that is budding and the actual software project. Like, where are you, are you going to do VTGates and VTTablets - is all of that architecture coming over? So maybe start with the organization of the team and your decision to join Supabase and then we can move into the technical organization. Yeah, that's a great question. So the idea, my idea of joining Supabase is like, you know, I have done this before. I was able to hire and train people. So building a team is not a problem. And there are also old contributors, contributors that are out there that I should be able to reach and get them on board. So I have confidence in building the right team and with the right people, I know it won't be hard to bring them up to speed. So that part, I'm not worried at all. And is that your mandate inside Supabase? Let's go grab some people and form this Multigres team. Find the right talent, get them ramped up and do it. So that's the team strategy.

Okay. And the technical strategy, what I want Multigres to be is a

Starting Multigres from scratch with lessons from Vitess

Postgres native project. In other words, I should put a no MySQL sign.

What we don't want is as two reasons, like one is anything MySQL specific is we don't want to live with that for project that is meant to last for many years for Postgres. So we don't want to inherit anything that we did just because it's MySQL. So that's one thing.

And the other thing is Vitess is an old project. It's 15 years old. It has legacy features that are either not in use or that we are supporting only because we don't want to break somebody. There's no reason to bring those in.

Also into this new project.

So for these two reasons, what we've decided to do was we are going to start from scratch and we are going to import copy over what we think we want instead of taking this as is and retrofitting it to make it to work for Postgres. Cool. So you get to do the thing that every developer always wants to do and that is a rewrite. You get to start over. So you get to start over, but not only are you starting over, you're starting over with a solid foundation from which you can pull and you're starting over with a mandate from a successful company. So that you can actually go out and build the team and you can spend the time to do this right normally rewrites are happening, you know, as the planes on the way down, you're building the parachute, but you're not doing that here. You've got the time, you've got the bandwidth, you've got the space. And so where do you even begin? Like you know, in the beginning, you said we started, you know, the VT gate, we started or the tablet or whatever we start with select star where ID equals and it's like, okay, we can do that. So in this situation, walk me through like, what are you actually where you actually going to start what component or what like Postgres extension or where's where are you going to get your hooks in?

The first one is actually to build a VT tablet and a VT gate, okay, a very basic version of the two. And actually, our goal is to deploy throughout super base, like which means that once this is ready, this will be the super base load balancer, this will be your super base front end, unified. And the reason why we want to do this is because once you got your connection, that connection stays until

MVP goals for Multigres

So, your database goes to 500 terabyte, you don't have that's the good stuff.

Yeah, you never have to know that your things are now completely different underneath, and basically, you should not know that you went chartered, you should not know that all these transitions took place, it should be completely turned. So, for that reason, what we want is Vitess to manage your life from the beginning, building a weekend scale to millions, you just changed to building a weekend scale to billions. That's amazing, so you're going to start with the router, which is the VT gate, and that's the one that the outside world, the application developers will connect to, and it will hold that connection and serve as kind of like an opaque layer that's like, don't worry, what's going on behind the curtain. We got it, we're in control, you connect to us, we'll handle the rest, so that's the, are you going to call them MG gates for Multigres? I think it's going to remain VT gate, I looked at it, that is just too much VT. Yeah, I was going to say, that's going to not only be hard technically, but you are going to continue to say VT gate forever, I guarantee it. So, yeah. So, you're going to start with the VT gate, which is the router and the VT tablet, which sits next to, in this case, Postgres, and handles the, is it still handling the sort of like blocking of bad queries and stuff like that, this VT gate. Some of those actually, we may actually not port all those features, because many of the, like, that's actually a good question. We may not port it initially, because we don't see, we don't see any, we don't see many people using those features anymore. I think, I think when we were there, because we were pushing the databases to the limit, these, we were very sensitive to these changes. But once you are scaled out, I don't think it's as big a deal.

Okay, cool. It's my guess. So, that raises my next question, which is, which are, if you had to put like a punch list of these are the three things that we definitely have to have to hit 0.1 release or whatever. So, what do you have those in mind? And if not just off the top, okay, let's hear them. What are the big guys that you're trying to hit? The VT tablet and the proxy layer. What I want, where I want to be is being able to serve charted queries. Okay. Maybe not stored procedures. Maybe don't know, but at least queries that are for insert update, you know, delete, select those, those four queries in a fully sharded postgres.

Okay. So, that would be what I would call my MVP.

Okay, that's your north stars. You want to be able to serve sharded queries. Yes. Transparently. Right. With no knowledge from the application developer, which is what makes VTES work in my opinion. So, what about like, what about the next features? Are you thinking, does that include resharding? And how much of, I guess this is, this is an interesting situation you find yourself in because you're coming into a mature company that has a lot of features. And so, are you going to be trying to pull some of those features and put them into multi-graph?

Integration with Supabase & database branching

How does that interplay, kind of work within Supabase? Well, the advantage is Supabase treats Postgres as a black box. And we become that back black box.

I'm thinking about something like database branching. Does that not feel like it crosses over between what Supabase is already doing and where Multigres is going to come in or is that something you have thought about? Because I have to imagine there are some places where it's going to be kind of blurry as to who's responsible.

That's a good question. I actually don't know how database branching works in Supabase, but I would presume that at the end of the day, even if you branch to database, you are going to create another one. So I think it will still work fine by the fact that Multigres is a black box that masquerades as Postgres. You want to create a copy, you create a copy and this will create a copy for you. So you should not care that it is a sharded Postgres under the covers. The way I would approach it, but what you say has a point, like some of these abstractions, you know, are may not be as perfect. In which case, I am in Supabase, so I could go talk to that even.

Oh, I can't wait to have you back in a year to figure out how all this stuff finally plays out. This is so fascinating to me. So where are you? Where's the project right now?

Your announcement came maybe last week. I feel like it was pretty recent. Last week, the week before something like that, you are now inside of Supabase. Talk to us about the status of both the project and the team. I learned two days learning Mac because I had previously. I was previously a Linux user for I think I last used Mac like, oh, at least over 10 years ago. I'm just laughing because you're like, you're like one of the four most like computer science brains in the world. And you join this company and you're like, I don't know what is this thing. I don't know what this is.

Okay, so you got a little bit of a learning curve. You're on Mac. All right. That's a good start. I'm making while, you know, I can still type. So I am making I'm actually making a feature list. I'm making a project plan.

Okay, I'm making. I'm also talking about some making some core changes that the old witness has that the new witness, you know, need not have. Some cool extensions that we don't have the luxury to do with the old witness that we can do with them because we now have a clean slate. So I'm brainstorming with myself right now because there's nobody else. And so I'm doing that, which is pretty exciting.

Yeah, so you're getting a dream. You're getting a dream a little bit. Yeah, it sounds fun. Yeah, so in your in your wildest dreams. Let's not talk about reality in your wildest dreams. Where do you see multigress going? Either from like a adoption and impact standpoint or from a feature standpoint, where you're like, you know what? I don't even want to say this out loud, but wouldn't it be crazy if? So as you're in this dreaming phase and planning and trying to like map out what is going to be a huge contribution to the database ecosystem. What are these things that you're like, oh, that would be incredible. Do you have any?

Sugu's dream for Multigres

I think this dream may sound simple, but my dream is someone coming in, and being able to say this is just Postgres, except that it can scale massively. I will say that that is a simple dream from a consumer's perspective, but probably a big, hairy audacious goal from your perspective, right? It has to feel like a huge goal. It's almost unachievable, but can we do it? That's what makes it so fun. They should not feel like it's a different system. Because that hurt us so much in Vitess. Other than that, I want to see, I mean this is something that Vitess is totally capable of, I would like to see databases that run into the petabytes comfortably.

Okay, those are good goals. I like that. I especially like the focus on, and maybe this is your empathy showing, the focus on the compatibility. Because one of the frustrating things is connecting to a Vitess cluster or plant scale database and finding out, oh, this doesn't work. They can't do correlated subqueries or whatever weird things that we developers like to do. And they've been hammering pretty hard on compatibility, but to hear that as a founding principle for Multigress is, I think that seems directionally correct to me. For every Suga who wants to build out a brand new database, there are five million developers that just want to run a database. And they just want it to work. And so having compatibility be so high up on the list, seems like a wise decision from my point of view. But I know that that's going to be super hard from your point of view. And so that's what makes it a good goal. And that's what makes it exciting. You're going to have to sit around for three months and think about things again. Simple and hard.

Yes, exactly. And from from the perspective of building the team, is there anything? Is there anything you want to say to anyone that might be listening of people you're looking for or how's that process going? Or like where are you trying to pull these people? How big of a team are you going to start with? Tell me about your team building thoughts. I've always liked smaller teams. I would say something something in the range of five to ten people as is what I would say the sweet spot. I still feel proud that with this has never had a team size greater than 10, for example, even it's crazy. And we were able to compete with all other distributed systems with this small team. So so I believe in the power of efficacy of a single engineer being able to you know, I think a small team works so well together that take and outperform extremely large teams. So with this has proven that and that has been the case even in my previous lives, both at YouTube and PayPal. So I would say it like five.

Small teams, hiring, and open positions

I'd like 5 to 10 people would be the number of people I have in mind. And anyone, I would say anyone that has worked on Vitess, will obviously be welcome, and it's, I assume it's all going to be the same stack, it's all going to be Go, same as Vitess, because you're going to pull over a lot of code from Vitess.

As a matter of fact, 90% of Vitess is actually agnostic of MySQL. Most people don't know this, because that's how we built Vitess - we did not want to get locked down with MySQL, which was smart, until we were pressured to, which is why it was not compatible, because we did not want to do anything specific with MySQL. And then later we said, okay, you know, we'll start working on that.

So, are you actively hiring right now? Yes, we are actively hiring. There is a link in the blog post at Supabase. So, if you are interested, click on that link, and please apply. Wonderful.

Yeah, we have already a bunch of applicants, I bet, interviewing a few. So, yes, and at some point in time, we will actually say we have enough, because I don't want to build a big team - 5 to 10 is not very many. Well, I'll leave a link down below for anyone that has either worked with Vitess or is a super Go genius or anything like that. Apply - anybody with distributed systems or deep database knowledge. Perfect, yes, I'll leave a link below for all of that. So, speaking of your blog post, you kind of came out of nowhere. You've been out in the wilderness, and then you come back and you're like, by the way, I'm joining Supabase, and we're doing Multigres. What was the response like to that? I mean, I saw it, but tell me from your point of view, what was the public's reaction?

Community response to Multigres announcement

It was overwhelmingly positive. I did not expect what I saw. I kind of felt like I come from the MySQL world, I thought I may not be accepted by the Postgres community, so I feel very humble that the people are welcoming me so I will keep that in mind. So I didn't expect to be welcomed this well and I will respect that for sure.

I think that is a testament both to you and your proof of work and to the Postgres community, that they are so excited to have you become like a pretty fundamental part of it doing this massive project.

So I'm glad to hear, that's what I saw from the outside as everybody was like, aw, this totally rules and I'm glad to hear that that was felt by you as well because I think it is well deserved. So I don't want to take up too much more of your time but tell me, is there anything else about Multigres or things you're looking forward to or things that are exciting for you or anything else you want to leave the audience with? Any just kind of broad thoughts or

Where to find Sugu

I think you covered everything that I wanted to say, and I would say I would talk about, now if you are interested, yeah, definitely looking for talent,

Wonderful, at this point, my top, the top of my mind is that, If you want to, if you're listening and you want to work with a thoughtful genius, I'll leave some links down below to where you can find Sugu, I have to tell you, you should be incredibly proud of all the things that you have done. You're extremely thoughtful, but also clearly massively intelligent, and the stuff that you have done is very impressive, and so I hope that you feel that, I know that you worked on it for 12 years, and then had to be like, I got to take a step back, But I hope you're coming back with tons of energy, and I hope you feel like, man, this is going to be great, because I think it is, and I think you should be super proud of all the stuff that you've done.

Oh, thank you very much.

Yeah, and thanks for coming on here. I'll leave links down below for everything we've talked about so people can find out more, and tell us just as we go here, where can people find you online if they want to connect? You can find me on LinkedIn - Sugu, you can find me on x.com - S, Sugu, and where else do I go? I can't even remember.

Okay, Sugu. Perfect, perfect, so if you want to go see his Reddit comments over there, I'll leave the links to all of that down below. Thank you all for listening.

Yeah, I know. I realize that I should have been more active. I don't know, I think, you know, if you're going to go on sabbatical, logging off seems like the right thing to do. So we'll say that that was the right thing to do, but now you're back, so people can go on LinkedIn and Twitter or Reddit, I suppose, and find you if they want to hang out. So thank you, Sugu, for being here. This has been just a delight. This has been so much fun, I hope you enjoyed this. It's just as much fun for me too.

Oh good, well that's very nice. If y'all are listening on audio, there is a video on YouTube, and if you're listening on YouTube, there is an RSS audio only feed, I'll leave links to both in the show notes or the description. Until the next time, we'll see you later.

Multigres Blog

A 2.5x faster Postgres parser with Claude Code

Why build a parser at all?​

My system: the directory that ran the project​

Using Claude and the expertise of knowing what right looks like​

Example 1 (the type system subtlety):​

Example 2 (fixing symptoms instead of causes):​

Example 3 (no reference to copy from):​

Trust, but verify: the discipline of working with AI​

The new normal​

Generalized Consensus: Recap

Definition

Durability Policy

Governing Rules

Definitions​

Rules​

Scoping

Coordinator

Term Numbers

A Raft inspired approach

Components​

Coordinator​

Node​

Ruleset​

Completing requests​

Leadership Change​

Obtaining a term number, Revocation, Candidacy, and Discovery​

Propagation​

Establishment​

Conclusion

Generalized Consensus: Addenda

Health checks

Responsibilities​

Failure detection​

Term numbers

The Raft approach​

Time​

etcd​

Alternate durability

Navigating the Series​

Generalized Consensus: Consistent Reads

Leader lease

Leader heartbeat read

Log index based read

Replica heartbeat read

Quorum read

Upsides​

Downsides​

Eventually consistent reads

Navigating the Series​

Generalized Consensus: Changing the Rules

Policy vs Rules

Coordinator method

Corner case​

Scenario 1​

Scenario 2​

Summary of rules​

Leader Method

Planned leadership change​

Adding and removing cohort nodes​

Navigating the Series​

Generalized Consensus: Discovery and Propagation

Rules​

Discovery

Propagation

The Paxos way​

The Raft way​

The timestamp way​

Discovery revisited

Selecting Timelines​

Failure scenarios​

Scenario 1​

Scenario 2​

Scenario 3​

Scenario 4​

Scenario 5​

Intermission

Navigating the Series​

Generalized Consensus: Revocation and Candidacy

Rules​

Why build a parser at all?

My system: the directory that ran the project

Using Claude and the expertise of knowing what right looks like

Example 1 (the type system subtlety):

Example 2 (fixing symptoms instead of causes):

Example 3 (no reference to copy from):

Trust, but verify: the discipline of working with AI

The new normal

Definitions

Rules

Components

Coordinator

Node

Ruleset

Completing requests

Leadership Change

Obtaining a term number, Revocation, Candidacy, and Discovery

Propagation

Establishment

Responsibilities

Failure detection

The Raft approach

Time

etcd

Navigating the Series

Upsides

Downsides

Navigating the Series

Corner case

Scenario 1

Scenario 2

Summary of rules

Planned leadership change

Adding and removing cohort nodes

Navigating the Series

Rules

The Paxos way

The Raft way

The timestamp way

Selecting Timelines

Failure scenarios

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Navigating the Series

Rules

Revocation

Candidacy

Scenario 1: no race

Scenario 2: newer term steals the nodes

Scenario 3: newer term starts after scenario 1

All possible leaders

Navigating the Series

Rules

Time ordering

Encounter ordering

Choosing the order

Navigating the Series

Roles

Additional Observations

Followers

Observers

Rule 2

An alternate type of leadership

Navigating the Series

Definitions

Rules

Questions

Navigating the Series

Consensus state diagram

Rejections and failures

Pluggable Durability

Navigating the Series

Why are we even doing this?

Redefining the problem

Why Another Take on Consensus?

What We Cover

A New Definition