The previous article Database Replication Explained: Lag, Consistency, and the Bugs You Don’t Expect covered replication lag — the gap between when a write happens and when all replicas see it. But it left a more fundamental question unanswered: which node actually accepts the write in the first place?
This is the replication architecture question, and the answer shapes everything about how a distributed database behaves. There are three fundamentally different answers, each with real tradeoffs and real failure modes.
Single-Leader Replication
The simplest and most common approach. One node is designated the leader — also called the primary or master. All writes go to the leader. The leader processes the write, records it, and replicates it to follower nodes asynchronously or synchronously.
Reads can go to any node — the leader or any follower. Followers serve reads using whatever data they have replicated so far.
Writes
↓
[Leader]
/ \
[Follower 1] [Follower 2]
↑ ↑
Reads Reads
Used by PostgreSQL, MySQL, MongoDB (default), and most traditional databases.
Why It Works Well
Single-leader replication is easy to reason about. There is one authoritative source of truth. The write order is unambiguous — if write A arrives at the leader before write B, every follower sees A before B. Transactions are straightforward. Conflict resolution is not needed because only one node accepts writes.
The Failure Mode: Leader Goes Down
What happens when the leader fails? A follower needs to be promoted to leader — a process called failover. This sounds simple but is genuinely hard in practice.
The new leader needs to have all the writes the old leader had. If the old leader was using asynchronous replication, some writes may not have reached any follower yet. Those writes are lost when the old leader is gone.
There’s also a split-brain risk: if the old leader comes back online while a new leader has already been elected, you now have two nodes both believing they are the leader. Both may accept writes, creating conflicting data with no way to reconcile automatically. Most databases handle this by fencing the old leader — explicitly telling it to stop accepting writes — but the implementation details are subtle and failure-prone.
Failover in production is one of those things that looks clean in documentation and is messy in reality. The single leader is also a write bottleneck — all writes must pass through one node, which limits write throughput.
Multi-Leader Replication
Instead of one leader, multiple nodes can accept writes. Each leader processes writes locally and replicates to the other leaders and their followers asynchronously.
[Leader 1] ←——→ [Leader 2]
↓ ↓
[Follower] [Follower] [Follower]
When Multi-Leader Makes Sense
Multi-datacenter deployments are the primary use case. With single-leader replication across datacenters, every write has to travel to the leader’s datacenter — adding significant latency for users far away. With multi-leader, each datacenter has its own leader. Writes go to the local leader and replicate to other datacenters in the background. Users in each region get low-latency writes.
Offline applications are the second major use case. When your phone, laptop, and tablet all need to accept writes without internet connectivity, each device effectively acts as a leader. Changes sync when connectivity is restored. Apple Calendar, Google Docs, and Notion all work this way — your device writes locally and syncs later.
Collaborative editing — multiple users editing the same document simultaneously — is the third case. Each participant’s edits are accepted locally and propagated to others. This is explored more deeply in a separate article.
The Hard Problem: Conflict Resolution
Multi-leader replication introduces a problem single-leader never has: two leaders can independently accept conflicting writes to the same data.
User updates their email address from two devices simultaneously — one writes “abhijeet@gmail.com”, the other writes “abhijeet@outlook.com”. Both writes succeed locally. Now both leaders need to replicate to each other. Who wins?
There is no correct automatic answer. Different systems make different choices, all with tradeoffs:
Last Write Wins (LWW) uses timestamps to decide — whichever write has the later timestamp survives. Simple, but timestamps on different machines are never perfectly in sync. The “winner” can be arbitrary, and the losing write is silently discarded. Data loss with no indication anything went wrong.
Merge combines both values somehow — for a set of items, take the union. This is what some shopping cart implementations do. As we’ll cover in a later article, this can produce surprising results: items removed from a cart sometimes reappear because the merge kept both the “add” and the “remove” operations.
Explicit conflict presentation surfaces the conflict to the user or application to resolve manually. Git does this. Calendar apps sometimes do it by creating duplicate events. Honest but requires the application to handle it.
Multi-leader replication is powerful for specific use cases but operationally complex. Most databases support it only as an add-on feature, not the default. The conflict resolution problem is genuinely hard, and custom conflict resolution code is a reliable source of subtle bugs.
Leaderless Replication
The most radical departure: there is no designated leader at all. Any node can accept any write. Reads are fetched from multiple nodes simultaneously. The client determines the correct value by comparing what different nodes return.
This approach was popularised by Amazon’s Dynamo paper (2007) and is used by Cassandra, DynamoDB, and Riak.
First: What n Actually Means
Before explaining how leaderless replication works, there is a misconception worth correcting upfront. When you see n, w, and r in distributed systems literature, most engineers assume n means the total number of nodes in the cluster. It does not.
n is the replication factor — the number of nodes designated to hold a specific piece of data. A Cassandra cluster might have 100 nodes total, but with n=3, every piece of data lives on exactly 3 of those nodes. Which 3 depends on the data’s key:
Total cluster: 100 nodes
n = 3 (replication factor)
Key "user:abhijeet" → lives on nodes 4, 11, 17
Key "user:gazal" → lives on nodes 2, 8, 15
Key "user:john" → lives on nodes 6, 12, 19
Different data lives on different sets of n nodes. This is how a 100-node cluster distributes load — each node holds a fraction of the data, but each piece of data has n replicas for redundancy.
Quorum Reads and Writes
With n nodes responsible for a piece of data, two more variables control consistency:
- w — how many nodes must confirm a write before it’s considered successful
- r — how many nodes must respond to a read before returning a result
The consistency condition is:
w + r > n
When this holds, at least one node in any read set must have seen the latest write. With n=3, w=2, r=2: write goes to all 3 nodes, waits for 2 confirmations. Read goes to all 3 nodes, waits for 2 responses. The 2 write nodes and 2 read nodes must overlap by at least 1 — guaranteed to include the latest write.
An important misconception to correct: quorum does not mean “send the write to w nodes only.” Writes and reads are sent to all n nodes simultaneously. w and r determine how many confirmations you wait for before returning. Like sending a message to all 9 people in a group chat and proceeding once 5 have replied — you still sent it to all 9. This is why read repair works naturally: you already have responses from all n nodes and can identify which are stale without an extra network round trip.
Tuning the Consistency Dial
The quorum condition w + r > n is a recommendation, not a rule. You can deliberately break it:
| Setting | Use case | Tradeoff |
|---|---|---|
| w=1, r=1 | Twitter timelines | Maximum availability, weak consistency |
| w=2, r=2 | Most applications | Balanced |
| w=n, r=1 | Critical writes | Maximum durability, slow writes |
| w=1, r=n | Audit logs | Maximum read consistency, slow reads |
Breaking the quorum condition — w + r ≤ n — means accepting stale reads in exchange for higher availability. A Twitter timeline that shows a tweet a few seconds late is acceptable. A bank balance that shows the wrong amount is not. The dial is explicit and configurable.
What Happens When Nodes Fail
With n=3, w=2 and 1 node down, you can still meet quorum — 2 of the remaining nodes confirm the write. The system stays available with degraded redundancy.
With 2 nodes down, you cannot meet w=2. Two approaches:
Strict quorum — fail the operation. Return an error. Consistent but unavailable during the outage.
Sloppy quorum — accept the write on a substitute node that is not one of the normal n replicas. The substitute holds the write with a hint: “this belongs to node 11, deliver it when node 11 comes back.” When the failed node recovers, the substitute delivers the queued writes via hinted handoff and deletes its local copy.
Sloppy quorum keeps the system available during outages but temporarily weakens consistency — reads during the outage may still return stale data because the substitute’s copy doesn’t count toward normal quorum guarantees.
Conflict Resolution in Leaderless Systems
When multiple nodes accept writes to the same key simultaneously, conflicts arise — just as in multi-leader replication. Leaderless systems typically resolve these using version numbers: each write increments a version counter, and the reader takes the highest version.
For truly concurrent writes — where two writes happen with no causal relationship — most leaderless databases fall back to Last Write Wins. The write with the higher timestamp survives, the other is silently discarded.
This is where leaderless replication shows its sharpest edge. LWW has a clock skew problem: timestamps on different machines drift. Two clients writing simultaneously might have clocks that differ by milliseconds. The “winner” is determined by which client’s clock happened to be slightly ahead — effectively arbitrary. Cassandra uses LWW as its only conflict resolution method, and engineers often don’t realise data is being quietly lost until they investigate a missing update.
The only safe use of LWW is to ensure each key is written exactly once and never updated — using a UUID as the key makes every write unique and eliminates concurrent writes to the same key entirely.
The Three Approaches Side by Side
| Single-Leader | Multi-Leader | Leaderless | |
|---|---|---|---|
| Write authority | One designated node | Multiple designated nodes | Any node |
| Consistency | Strong | Eventual (conflicts possible) | Tunable |
| Conflict resolution | Not needed | Required, complex | LWW or version numbers |
| Transactions | Full ACID | Limited | Very limited |
| Write throughput | Limited by leader | Higher | Highest |
| Operational complexity | Low | High | Medium |
| Used by | PostgreSQL, MySQL, MongoDB | CouchDB, multi-DC setups | Cassandra, DynamoDB, Riak |
The Question Behind the Question
The choice of replication architecture is really one question asked three ways:
Where does write authority live? At one node (single-leader), at several designated nodes (multi-leader), or anywhere (leaderless).
What happens when that authority is unavailable? Failover to a new leader (single), continue with remaining leaders (multi), or fall back to sloppy quorum (leaderless).
What do you do when two writes conflict? Can’t happen (single), resolve it somehow (multi), LWW or version numbers (leaderless).
Single-leader gives you the simplest answers to all three questions. Leaderless gives you the most flexible answers. Multi-leader is the right answer for specific geographic or offline use cases, with complexity proportional to the benefit.
The next article takes multi-leader replication into familiar territory — your calendar app, Google Docs, Git — and shows how the conflict resolution problem plays out in systems you use every day.
