Blog

From SQL to Graph to LSM-Trees: A Beginner’s Guide to How Databases Actually Work


I’ve been reading Designing Data-Intensive Applications by Martin Kleppmann. It’s a brilliant book — and also one of the densest, most diagram-heavy technical books I’ve picked up. There are pages where I’d stare at a diagram for five minutes thinking “what does this even mean?”

So I decided to write the guide I wish existed when I started: one that builds intuition first and gives you definitions second. One where every new concept is introduced because the previous one broke.

This is not a reference article. It’s a story with a cause-and-effect flow:

  • We start with SQL — which is genuinely great
  • We find where it breaks
  • We introduce the tool invented to fix that break
  • That tool introduces a new problem
  • We introduce the next tool to fix that
  • And finally — we put two storage engines head to head and ask: B-Tree or LSM-Tree, and when?

By the end, you’ll understand not just what these database types are, but why each one exists — and how to actually choose between them. Let’s go.


Part 1: Relational Databases Are Genuinely Great

Before we criticise anything, let’s be fair. Relational databases have powered most of the world’s software for 50 years — and for good reason.

Think of a relational database as a collection of spreadsheets (tables) where every row follows a strict structure, and tables can reference each other. You’ve almost certainly used PostgreSQL, MySQL, or SQLite. This is the default choice for most applications, and it should be.

They’re excellent at:

  • Structured, consistent data. Every row follows a schema. No surprises.
  • Aggregation. SUMCOUNTGROUP BY — SQL is purpose-built for crunching numbers across millions of rows.
  • ACID transactions. Atomicity, Consistency, Isolation, Durability — the guarantee that operations either fully complete or fully roll back. Your bank account doesn’t lose money mid-transfer because of this.
  • Simple relationships. Joining two or three tables together is fast, readable, and battle-tested.

The rule of thumb: use a relational database for most things. It is the right default.

But there are specific, well-defined scenarios where relational databases start to fight against you. Understanding exactly where and exactly why is the key to understanding everything else in this article.


Part 2: The First Crack — Many-to-Many Relationships

The Problem

Let’s say you’re building a university system. Students enrol in courses. One student can take many courses. One course has many students. This is called a many-to-many relationship.

In a relational database, you can’t directly link one row in the students table to multiple rows in the courses table. You need a third table — a junction table — to hold those links:

students    (id, name, age)
courses     (id, title, credits)
enrollments (student_id, course_id)   ← the junction table

To answer “which courses is Alice taking?”, you write:

SELECT courses.title
FROM students
JOIN enrollments ON students.id = enrollments.student_id
JOIN courses     ON courses.id  = enrollments.course_id
WHERE students.name = 'Alice';

One extra table, one JOIN. That’s manageable.

Where It Starts to Break

Now your university adds course prerequisites. To graduate, you must complete Course A before Course B. And Course A itself requires Course C. In SQL, this is another junction table:

course_prerequisites (course_id, required_course_id)

And now your query needs another JOIN. And if prerequisites have their own prerequisites? Another JOIN.

This is called the JOIN explosion — and it gets ugly fast.

Here’s the SQL to answer a seemingly simple question: “Who are the friends-of-friends of Alice?” — just two degrees of separation:

SELECT u3.name
FROM   users u1
JOIN   friendships f1 ON u1.id = f1.user_id
JOIN   users       u2 ON u2.id = f1.friend_id
JOIN   friendships f2 ON u2.id = f2.user_id
JOIN   users       u3 ON u3.id = f2.friend_id
WHERE  u1.name = 'Alice';

Four JOINs for two hops. A hop is simply one step along a relationship — getting from Alice to her friend is one hop, getting from Alice to her friend’s friend is two hops. At six hops, this becomes seven JOINs — unreadable, unmaintainable, and slow.

The Root Cause

The reason this hurts so much is a fundamental characteristic of relational databases: they compute relationships at query time.

SQL doesn’t store connections between rows. It re-discovers them every single query by scanning tables and matching IDs. Every JOIN is the database saying: “let me scan this entire table to find all rows where this ID matches.”

For a table with a million rows, that’s a million comparisons — per JOIN. Chain six JOINs together and the cost compounds dramatically.

This isn’t a bug in SQL. It’s a deliberate design that works brilliantly for tabular, row-oriented data — inventory, transactions, user accounts. But when your data is fundamentally a network of interconnected entities, you’re using the wrong tool.

The question this raises: is there a database that stores relationships directly, so it doesn’t have to recompute them on every query?

Yes. That’s what graph databases do.


Part 3: Graph Databases — Relationships as First-Class Citizens

The Big Idea

A graph database throws away the concept of tables entirely. Instead, everything is either a node (an entity) or an edge (a named, directed connection between two entities). Both nodes and edges can carry properties.

[Alice] --KNOWS--> [Bob] --WORKS_AT--> [Acme Corp]
  node     edge     node     edge          node

Alice is a node. Bob is a node. “KNOWS” is an edge between them. “WORKS_AT” is an edge between Bob and Acme Corp. Edges can carry data too — like KNOWS since: 2019 or WORKS_AT role: "Engineer".

Why Traversal Is Fast

Here’s the key difference. In a relational database, the connection between two rows is just a matching ID — the database has to find the connection by scanning tables. In a graph database, each node physically stores a direct pointer to its neighbouring nodes on disk.

Think of it like the difference between a phone book and a speed dial:

  • SQL: to call Bob, open the phone book and scan every entry until you find “Bob” — every time
  • Graph DB: Bob’s number is already saved and directly accessible — one press

Traversing a relationship in a graph DB is just following a pointer — O(1) per hop. No scanning. No matching. No growing cost.

And crucially: the cost scales with how many nodes you touch, not the size of the entire database. You could have a billion users — if your query only traverses 50 nodes, you only pay for 50 hops.

The same friends-of-friends query in Cypher (Neo4j’s query language):

MATCH (alice {name: 'Alice'})-[:FRIEND*1..2]->(person)
RETURN person.name

One line. Change *2 to *6 for six degrees of separation — still one line, still fast.

When to Use a Graph Database

Graph databases win when your data is fundamentally a network and your questions are about paths through that network:

  • Social networks — friends-of-friends, influence chains, mutual connections
  • Fraud detection — spotting rings of suspicious accounts connected through shared transactions
  • Recommendation engines — “people who bought what you bought also connected with…”
  • Access control — does this user, through any chain of role inheritance, have this permission?
  • Network topology — which services depend on which, and what’s the blast radius if this one goes down?

When NOT to Use a Graph Database

If you just need students and courses with no deeper traversal — a relational DB is perfectly fine. One junction table, one JOIN, done. Graph databases are not universally better. They’re optimised for a specific shape of question: “how are these things connected, and through what path?”

If your questions are shaped like “give me all orders from last month grouped by region” — that’s SQL’s territory. Stay there.

Mental model: SQL = set operations across rows. Graph DB = relationship traversal across connected entities. Different shapes of questions, different tools.


Part 4: Zooming In — How Does Any Database Actually Store Data?

So far we’ve been talking about how different databases model relationships at the query level. Now let’s go one level deeper: how does a database actually store and retrieve data on disk?

This is where key-value stores and the hash index come in — and understanding this will make everything about SSTables and LSM-Trees click.

The Simplest Possible Database

Imagine the simplest database you could build. It’s just a text file that you keep appending to. Every time you store a city, you write: key,{value} and move on.

123456,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

Here, 123456 and 42 are not property names — they’re record IDs, like a primary key in SQL. The value is the entire object for that record. Key 42 means “city number 42” — you use it to fetch the whole city object, not a single field.

Writing is trivially fast — just append to the end. But reading? To find key 42, you’d have to scan every single byte from the beginning of the file. On a file with millions of records, that’s unacceptably slow.

The question this raises: how do you find a specific record in a large file without scanning all of it?

The Hash Index Solution

The answer is to keep a hash map in RAM — a lookup table where every key maps to a byte offset: the exact position in the file where that record begins.

Think of a byte offset like a page number in a book. If byte offset is 64, it means: “jump to position 64 in the file and start reading there” — just like “turn to page 64.” You skip everything before it.

key      byte offset
123456   0           ← record starts at the very beginning
42       64          ← record starts at position 64

Writing a new record:

  1. Append the record to the end of the file — note the byte position it landed at
  2. Update the hash map: key 42 → byte 64

Reading a record:

  1. Look up key 42 in the hash map → get byte 64 (instant — it’s in RAM)
  2. Jump directly to byte 64 in the file — one disk seek, no scanning
  3. Read the value starting at that position

This is the strategy used by Bitcask, the storage engine inside Riak. Reads are always one disk seek. Writes are always a fast append. It’s simple and effective.

The Fatal Flaw

But there’s a hard limit: every key must live in the hash map, which lives in RAM.

If you have 100 million unique city IDs, all 100 million entries must fit in memory simultaneously. At some point, you run out of RAM — and there’s no elegant solution, because the hash map must be in memory to work.

Worse: hash maps are unordered. The key 42 has no relationship to key 43 inside a hash map. So if someone asks “give me all cities with IDs between 100 and 200”, you have no choice but to scan every single key and check each one individually. Range queries are impossible to do efficiently.

The question this raises: what if we kept the file sorted by key? What would that unlock?


Part 5: SSTables — Sort Everything, Fix Everything

The Insight

The entire idea behind an SSTable (Sorted String Table) is one simple change: instead of writing records in the order they arrive, always keep the file sorted by key.

apple,  {"color":"red"}
cherry, {"color":"red"}
fig,    {"taste":"sweet"}
mango,  {"color":"yellow"}
peach,  {"season":"summer"}
zebra,  {"type":"animal"}

This one change — sorting — fixes both problems from the hash index.

Fix 1: You No Longer Need Every Key in RAM

Because the file is sorted, you can build a sparse index — an index that only stores some keys, not all of them.

apple  → byte 0
mango  → byte 144

Now if someone asks for “fig”, you think: “fig comes after apple and before mango alphabetically, so it must live somewhere between byte 0 and byte 144.” You jump to byte 0 and scan forward — a small, bounded scan, not the whole file.

The sparse index might store one entry for every 4KB of file. For a 4GB file, that’s about a million entries instead of potentially billions. It fits in RAM easily regardless of how many total keys exist. The RAM problem is solved.

Fix 2: Range Queries Are Now Trivial

Because the file is sorted, range queries are natural. “Give me all keys between cherry and peach”:

  1. Find “cherry” in the sparse index → jump to that position
  2. Read forward, collecting every key
  3. The moment you see a key greater than “peach”, stop

You never touch data outside the range. No full scan needed. The range query problem is solved.

Fix 3: Merging Multiple Files Is Elegant

SSTables are immutable — once written, you never modify them. When you need to merge two SSTable files (say, to clean up old data or combine smaller files), you use merge sort:

Walk through both sorted files simultaneously with one pointer per file. At each step, compare the two current keys and write the smaller one to the output. Because both files are sorted, you only need one pass through each — O(n) total.

File 1:  apple, cherry, mango
File 2:  banana, cherry, zebra

Merge →  apple, banana, cherry (newer wins), mango, zebra

When the same key appears in both files, the newer file’s value wins and the old one is discarded. Storage reclaimed.

The New Problem SSTables Introduce

Here’s the catch: you can’t write to disk in sorted order as data arrives, because data arrives in random order. If “zebra” arrives first and “apple” arrives second, you can’t insert “apple” before “zebra” in a file without rewriting the entire file.

So how does an SSTable database accept random writes but keep files sorted?

The question this raises: what if you sort the data in memory first, and only write to disk once it’s sorted?


Part 6: The LSM-Tree — The Complete Picture

The LSM-Tree (Log-Structured Merge Tree) is the answer. It uses SSTables as its on-disk format but adds a clever in-memory layer that absorbs writes, sorts them, and flushes them to disk in batches.

It’s built on four rules. Read them as a cause-and-effect chain.


Rule 1: Writes Go Into the Memtable (Not Disk)

Every incoming write goes into a memtable — an in-memory balanced tree (like a red-black tree or AVL tree). A balanced tree automatically keeps all entries in sorted order as you insert them.

Write "fig"    →  memtable: [fig]
Write "apple"  →  memtable: [apple, fig]       ← auto-sorted
Write "mango"  →  memtable: [apple, fig, mango] ← auto-sorted

Writing to RAM is microsecond-fast. No disk access. No file rewriting. The sort happens automatically as a property of the tree structure.

One safety detail: remember that the memtable lives entirely in RAM — which means a machine crash wipes it before it gets flushed to disk. To handle this, writes also go to a simple append-only WAL (Write-Ahead Log) on disk. Think of the WAL as a rough draft — it’s not sorted, not indexed, not used for reads. Its only job is: if we crash, replay these writes to rebuild the memtable from scratch. Once the memtable is safely flushed to an SSTable, the corresponding WAL entries are discarded.


Rule 2: When the Memtable Fills Up, Flush It to Disk

Once the memtable hits a size threshold — typically a few megabytes — it gets written to disk as a new SSTable file. Because the tree was already sorted in memory, writing it out is one fast sequential disk write.

memtable: [apple, fig, mango]
    ↓ flush
segment-001.sst: apple | fig | mango    ← sorted ✓

A fresh, empty memtable opens immediately. Writes continue without pausing — flushing and accepting new writes happen in parallel.


Rule 3: Reads Check Newest to Oldest

To find a key, the engine checks in order:

  1. Memtable (newest — if the key was written recently, it’s here)
  2. Most recent SSTable on disk
  3. Next older SSTable
  4. …and so on until found

Stop the moment you find the key. Because you always check newest first, if a key has been updated, you automatically get the latest value — older versions are simply never reached.

A concrete example: you write key 42 with value "London". Later you update it to "San Francisco". Now 42 → "San Francisco" is in the memtable, and 42 → "London" might be in segment-001 on disk. When you read key 42, the memtable is checked first → you get "San Francisco". The old value in segment-001 is invisible until compaction removes it.


Rule 4: Background Compaction Merges Old Segments

Over time, multiple SSTable files accumulate on disk. Some keys appear in multiple files — the original value and one or more updates. Reads get slower as they have to check more files.

Compaction runs quietly in the background while normal reads and writes continue. It merge-sorts multiple SSTable files into one clean, compact file, keeping only the newest value for each key.

segment-001:  apple, cherry(old), mango
segment-002:  banana, cherry(new), zebra

    ↓ compaction

segment-003:  apple, banana, cherry(new), mango, zebra

The old segments are deleted. The number of files stays manageable. Read performance recovers.


The Question That Makes It All Click

Here’s the question that trips most people up: What happens when A, C, F are flushed to disk as Segment 1 — and then B and D arrive?

B and D go into a brand new memtable. The existing segment on disk is never touched. It is immutable — forever.

When B and D’s memtable fills and flushes, you now have two segments on disk:

segment-001:  A, C, F    ← sorted within itself ✓
segment-002:  B, D       ← sorted within itself ✓

Each segment is sorted within itself. But globally — across both segments — the order is broken. A reader looking for B has to check segment-002; a reader looking for C has to check segment-001.

This is intentional. The “messiness” of multiple out-of-order segments is the price you pay for having blindingly fast writes. You never go back and squeeze B between A and C on disk — that would mean rewriting the whole file. B lives in segment-002 temporarily, and compaction eventually produces:

segment-003:  A, B, C, D, F    ← globally sorted ✓

The LSM-Tree’s core tradeoff: dramatically faster writes in exchange for slightly slower reads. Writes always hit the memtable — they never touch existing disk files. This is why Cassandra and RocksDB can sustain extraordinary write throughput even on modest hardware.


When Does Compaction Run?

Compaction is always a background process — it triggers when the number of segments crosses a threshold, when total disk usage grows too large, or on a regular schedule. Different databases implement this differently. RocksDB uses “levelled compaction” — segments are organised into levels, each level roughly 10x larger than the previous one, and compaction moves data downward through levels as it grows.

Crucially: if the machine crashes before compaction runs, no data is lost. All segments are safely on disk. Compaction is a performance optimisation, not a correctness requirement.

One more trick: databases layer Bloom filters on top of each SSTable. A Bloom filter is a small probabilistic structure that can tell you, in microseconds, whether a key definitely does not exist in a segment — letting the engine skip reading that segment entirely. For reads of keys that don’t exist, this avoids expensive disk seeks across every segment.


Part 7: B-Trees vs LSM-Trees — The Real Comparison

We’ve now seen both storage engines properly. Let’s put them head to head — not to declare a winner, but to understand exactly when each one is the right choice.

This is the question DDIA builds to over an entire chapter. Here’s the short version.


Write Amplification — the hidden cost of every write

Before comparing them, you need to understand one concept the book introduces here: write amplification — the fact that one logical write to your database results in multiple physical writes to disk over the database’s lifetime.

Both B-Trees and LSM-Trees suffer from it. But they suffer differently.

B-Tree write amplification:

1. Write the change to the WAL (crash safety)
2. Write to the actual tree page
3. If the page splits → write two new pages + update parent page
4. Some engines write each page twice to avoid partial updates on power loss

LSM-Tree write amplification:

1. Write to the WAL (crash safety)
2. Write to memtable (RAM — fast)
3. Flush memtable to a new SSTable on disk
4. Compaction reads and rewrites that SSTable
5. Compaction at the next level rewrites it again

Both pay the cost — but at different times and in different ways. B-Trees pay it upfront on every write. LSM-Trees spread it over time through compaction.

Why does this matter beyond performance? SSDs wear out. Every SSD block can only be overwritten a finite number of times. High write amplification = faster SSD wear = shorter hardware lifespan. This is a real operational cost in production.


Where LSM-Trees win

1. Higher write throughput

LSM-Trees always write sequentially — appending to the end of SSTables. B-Trees write randomly — jumping to whatever page needs updating, wherever it sits on disk.

On magnetic hard drives, sequential writes are dramatically faster than random writes:

Magnetic HDD sequential write:  ~200 MB/s
Magnetic HDD random write:        ~2 MB/s   ← 100x slower

This is why Cassandra is the go-to database for write-heavy workloads — activity feeds, IoT sensor data, event logging, time series. The workload plays directly to LSM-Trees’ strength.

2. Better compression, less wasted space

B-Trees leave empty gaps inside pages. When a page splits, both halves are only half full. When a row doesn’t quite fit a page, the leftover space is wasted. This is called fragmentation.

LSM-Trees don’t have this problem. SSTables are written densely. Compaction periodically rewrites them, eliminating any gaps. The result: smaller files, better compression ratios, less wasted disk — often significantly so in high-volume systems.

3. Range scans are naturally sequential

Because compaction rewrites large sorted segments in one operation, related keys naturally end up physically adjacent on disk. Scanning a key range = one sequential read. Fast.

B-Trees have to work hard to achieve this — pages can end up scattered across disk as the tree grows, requiring a disk seek per page for range scans.


Where B-Trees win

1. More predictable read latency

Every B-Tree read is the same: walk from root to leaf, O(log n) page reads, done. No surprises.

LSM-Tree reads are less predictable — in the worst case, you check the memtable, then every SSTable segment newest to oldest. Compaction normally keeps the number of segments small, but when compaction is running hard, it competes for disk bandwidth with your reads and writes. At high percentiles — p99, p999 response times — LSM-Trees can produce occasional spikes that B-Trees simply don’t.

If you need consistently fast responses — not just fast on average — B-Trees are the safer choice.

2. Compaction can fall behind — and nobody tells you

If writes arrive faster than compaction can process them, unmerged segments pile up on disk. More segments means reads must check more files, which means reads get slower. Eventually you run out of disk space entirely.

The dangerous part: LSM-Tree engines typically don’t throttle incoming writes even when compaction is falling behind. They just let the backlog grow silently. You need explicit monitoring to catch this before it becomes a crisis.

3. Transactions are simpler

In a B-Tree, every key exists in exactly one place in the index. To lock a key for a transaction, you lock that one location. Clean, simple, reliable.

In an LSM-Tree, the same key can exist in multiple segments simultaneously — the current value in a newer segment, old versions in older segments. Implementing transactional locking on top of this is significantly more complex.

This is why PostgreSQL, MySQL, and virtually every traditional relational database uses B-Tree indexes. ACID transactions — the bedrock of financial systems, e-commerce, and most business applications — are much easier to build correctly on top of B-Trees.


The honest summary

“There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.” — DDIA

But here’s the practical guide:

If your priority is…Reach for…
Very high write throughputLSM-Tree (Cassandra, RocksDB)
Predictable, consistent read latencyB-Tree (PostgreSQL, MySQL)
Storage efficiency and compressionLSM-Tree
Strong ACID transaction supportB-Tree
Fast range scans on large datasetsLSM-Tree
Simpler operations and monitoringB-Tree
SSD lifespan and lower write amplificationLSM-Tree

The rule of thumb that holds in practice: if writes are your bottleneck, LSM-Tree. If reads and transactions are your bottleneck, B-Tree.


Part 8: A New Problem — Real-Time Event Logging

So far we’ve talked about databases that store entities — cities, users, products. But what if your data isn’t really entities at all? What if it’s a stream of events over time?

Imagine you’re building a dashboard that shows how many requests your API is handling every second. You’re logging:

2024-01-15 10:00:01  service=auth  status=ok    latency=42ms
2024-01-15 10:00:01  service=auth  status=error latency=310ms
2024-01-15 10:00:02  service=api   status=ok    latency=18ms
...

Thousands of rows per second. Every row has a timestamp. Every query is time-bounded: “show me the last 5 minutes” or “what was the average latency between 9am and 10am?”

Why Relational Databases Struggle Here

A relational database can technically store this data. But it wasn’t built for it:

  • High-frequency writes hurt. Each insert updates indexes, checks constraints, writes to the WAL. At thousands of writes per second, this overhead compounds.
  • Time-range queries are expensive. WHERE timestamp BETWEEN X AND Y needs a well-tuned index. Without it, it’s a full table scan. Even with it, it’s not as fast as it could be.
  • Storage bloats. Every row stores all columns even if most values are repetitive. There’s no built-in compression for sequential time-based data.
  • Old data management is manual. Keeping 7 days of per-second data but only hourly rollups for the past year requires custom scripts and jobs.

Time Series Databases — Built for This Exact Shape

time series database (TSDB) treats time as the primary index. Everything about its storage and query engine is optimised around the assumption that:

  • Data always arrives in time order
  • Queries always filter by time range
  • Old data can be automatically rolled up or expired

Internally, TSDBs use columnar storage — instead of storing row-by-row, they store all timestamps together, all values together, all tags together. Sequential values of the same type compress dramatically — often 10-40x compared to row storage. That per-second data that would bloat a SQL database fits in a fraction of the space.

Built-in features that would require custom code in SQL:

  • Automatic downsampling: keep raw data for 7 days, hourly averages for 3 months, daily averages forever — configured with one policy
  • Rolling time windows: moving_average(latency, 5m) is a first-class operation
  • Retention policies: old data expires automatically

Popular choices:

  • InfluxDB — purpose-built TSDB, own query language (Flux), great for IoT and event logging
  • TimescaleDB — PostgreSQL extension, full SQL syntax + time functions. Best if your team knows SQL already
  • Prometheus — pull-based metrics collection, pairs natively with Grafana, the industry standard for infrastructure monitoring
  • QuestDB — SQL-like, extremely fast ingestion, great for high-frequency financial data

Rule of thumb: the moment your primary query is “give me data between time A and time B” and you’re writing thousands of rows per second — reach for a TSDB.


Part 9: Which Database, When?

We’ve covered a lot of ground. Here’s the complete decision guide, as a cause-and-effect reference.


Your data is tabular and your questions are about aggregating or filtering rows? → Relational database (PostgreSQL, MySQL)

Example: e-commerce orders, user accounts, inventory, financial transactions


Your relational DB is write-heavy and you need high throughput over strict consistency? → Consider an LSM-Tree based database (Cassandra, RocksDB, HBase)

Example: activity feeds, event logging, IoT data, analytics pipelines


Your relational DB is read-heavy and transactions / predictable latency matter most? → Stay with a B-Tree based database (PostgreSQL, MySQL, SQLite)

Example: financial systems, e-commerce, anything with strong ACID requirements


Your data is a network and your questions are about paths and connections across multiple hops? → Graph database (Neo4j, Amazon Neptune, ArangoDB)

Example: social networks, fraud detection, recommendation engines, permission inheritance


You need ultra-fast point lookups on a known, bounded set of keys that fit in RAM? → Key-value store with hash index (Redis, Bitcask)

Example: session caching, feature flags, real-time counters


Your data is a stream of timestamped events and queries are always time-bounded? → Time series database (InfluxDB, TimescaleDB, Prometheus)

Example: API metrics, IoT sensor readings, real-time dashboards, infrastructure monitoring


Closing Thought

Every database type in this article was invented because a real engineer hit a real wall with the existing tools. The wall had a specific shape. The new tool was shaped to fit through it.

That’s the entire chain we just walked through:

  • Relational DB works great → but hits a wall with multi-hop relationship traversal → Graph databases were invented
  • Hash indexes work great → but hit a wall with RAM limits and range queries → SSTables were invented
  • SSTables work great → but hit a wall with random writes arriving out of order → LSM-Trees were invented
  • B-Trees win on reads and transactions → but hit a wall on write throughput → LSM-Trees trade predictability for speed
  • Relational DBs work great → but hit a wall with high-frequency timestamped event data → Time series databases were invented

Each tool is the right answer to a specific question. None of them is universally best.

The right question is never “which database is the best?” It’s: what is the shape of my data, and what is the shape of my questions?

Once you think about it that way, the answer usually becomes obvious — and when you see a new type of database you’ve never heard of, you’ll know exactly what question to ask: what wall did this tool get built to break through?


This is part of the BytesByAbhi series on database internals — building intuition from first principles, one concept at a time. Heavily inspired by Designing Data-Intensive Applications by Martin Kleppmann — the most important technical book I’ve read this year.

Leave a Reply

Your email address will not be published. Required fields are marked *