OLTP vs OLAP: Why Your Analytics Queries Are Killing Production

abhijeetpnwr

3 weeks ago

Every backend engineer has been there. Someone from the data team sends a Slack message: “Can you run this quick query on the production database? It’ll take two minutes.”

You run it. The database slows to a crawl. API response times spike. Alerts fire. Two minutes later the query finishes and everything recovers — but not before half the engineering team has opened their laptops wondering what just happened.

This isn’t bad luck. It’s a fundamental mismatch between two completely different types of database workloads — and understanding the difference is one of the most practically useful things you can know as a backend engineer.

Two completely different jobs

Every database operation falls into one of two categories, even if you’ve never thought about it this way before.

OLTP — Online Transaction Processing is the workload your application generates in real time. A user logs in. An order is placed. A message is sent. Each operation touches a small number of rows, happens frequently, and needs to complete in milliseconds. This is what your PostgreSQL or MySQL database was built for.

OLAP — Online Analytical Processing is the workload your data team generates when answering business questions. “What was our revenue by region last quarter?” “Which users churned in the last 30 days?” “What’s the average session length across all users this month?” Each operation touches millions or billions of rows, happens less frequently, and can take seconds or minutes to complete. This is what your data warehouse was built for.

The same database trying to do both simultaneously is like asking a surgeon to also answer the hospital’s phone calls. Both are legitimate jobs. They require completely different tools, different skills, and most importantly — they should never happen at the same time on the same system.

Why the workloads are fundamentally incompatible

The difference isn’t just about query speed. It’s about what the storage engine is optimised for at the physical level.

OLTP access patterns:

Read or write a small number of rows — often just one
Access rows by primary key or index — a precise lookup
Write frequently — hundreds or thousands of transactions per second
Example: SELECT * FROM orders WHERE id = 7291

OLAP access patterns:

Read millions or billions of rows
Touch only a few columns — but across the entire table
Write infrequently — bulk loads, not real-time transactions
Example: SELECT region, SUM(amount) FROM orders GROUP BY region

That last example is the key one. To calculate total revenue by region, you need the region column and the amount column from every single order. You don’t need user_id, status, created_at, shipping_address, or any of the other columns in the table.

But in a row-oriented database — which is what PostgreSQL, MySQL, and every OLTP database uses — those columns are all stored together. To get region and amount from a row, the database reads the entire row off disk. For 50 million orders, that’s 50 million full row reads, most of which return columns you immediately throw away.

This is exactly why that “quick query” from the data team killed your production database. It wasn’t a bad query. It was the right query running on the wrong type of system.

The data warehouse — a separate system for a separate workload

The solution is architectural: run OLAP workloads on a completely separate system, purpose-built for analytical queries. This is called a data warehouse.

Data is extracted from your production OLTP database, transformed into an analytics-friendly format, and loaded into the warehouse — a process called ETL (Extract, Transform, Load). In practice this means a pipeline that runs on a schedule — say, every hour — pulling new rows from production, cleaning and reshaping them, and writing them into the warehouse in the columnar format that makes analytics fast. The production database never sees the analytical queries. The data warehouse never sees the real-time writes.

Production database (OLTP)          Data warehouse (OLAP)
─────────────────────────           ──────────────────────
PostgreSQL / MySQL                  BigQuery / Redshift / Snowflake
Real-time writes                    Batch loaded (hourly / daily)
Row-oriented storage                Columnar storage
Optimised for point lookups         Optimised for aggregate scans
Your application queries here       Your data team queries here

The data in the warehouse is typically a few hours or a day behind the production database — but that’s fine for analytical workloads. Nobody needs yesterday’s revenue figures to be real-time accurate to the millisecond.

How a data warehouse is structured — the star schema

Data warehouses don’t just store data differently — they model it differently too. The standard approach is called a star schema, and it’s built around two types of tables.

Fact tables

A fact table is the centre of the star. It stores the events that actually happened — every order, every page view, every transaction, every click. Each row represents one event.

Fact tables are typically enormous — billions of rows — because they store every event that has ever occurred. They contain measurable numeric values (the “facts”) and foreign keys pointing to dimension tables.

FACT TABLE: orders
──────────────────────────────────────────────────────────────────
order_id  │ date_key │ customer_key │ product_key │ amount │ quantity
──────────────────────────────────────────────────────────────────
10001     │ 20240115 │ 7291         │ 442         │ 250.00 │ 2
10002     │ 20240115 │ 8834         │ 117         │ 80.00  │ 1
10003     │ 20240116 │ 7291         │ 890         │ 510.00 │ 3

Notice what’s not in this table: the customer’s name, the product’s description, the date’s day of week. Those live in dimension tables. The fact table stays lean — just IDs and numbers — deliberately. A fact table with billions of rows needs to be scanned in its entirety for many analytical queries. Every column you add to it is a column that gets read across billions of rows. Keeping it narrow and numeric makes those scans dramatically faster.

Dimension tables

Dimension tables surround the fact table like points of a star (hence “star schema”). They store the descriptive context for the facts — who, what, when, where.

DIMENSION TABLE: customers          DIMENSION TABLE: products
──────────────────────────────      ─────────────────────────────────
customer_key │ name  │ city         product_key │ name     │ category
──────────────────────────────      ─────────────────────────────────
7291         │ Alice │ London       442         │ Keyboard │ Electronics
8834         │ Bob   │ Paris        117         │ Book     │ Education
                                    890         │ Monitor  │ Electronics

DIMENSION TABLE: dates
────────────────────────────────────────────────
date_key │ date       │ day_of_week │ month │ quarter
────────────────────────────────────────────────
20240115 │ 2024-01-15 │ Monday      │ Jan   │ Q1
20240116 │ 2024-01-16 │ Tuesday     │ Jan   │ Q1

The date dimension table is particularly important. By pre-computing “this date is a Monday in Q1” you can answer questions like “show me sales by quarter” or “compare weekday vs weekend revenue” without any date calculation at query time.

A query that answers “total revenue by product category in Q1” joins across these tables:

SELECT
    p.category,
    SUM(o.amount) AS total_revenue
FROM orders o
JOIN products p ON o.product_key = p.product_key
JOIN dates   d ON o.date_key     = d.date_key
WHERE d.quarter = 'Q1'
GROUP BY p.category;

Clean, readable, and fast — because the data warehouse is built for exactly this shape of query.

Snowflake schema — when dimensions get complex

The star schema works well for most cases. But sometimes dimension tables have their own sub-dimensions. For example, a products table might reference a categories table, which references a departments table.

When dimension tables are normalised this way — broken into multiple related tables — it’s called a snowflake schema, because the diagram starts to look like a snowflake with branching arms rather than a clean star.

Star schema:                         Snowflake schema:
                                
   [dates]                               [dates]
      |                                     |
[customers]─[ORDERS]─[products]         [customers]─[ORDERS]─[products]
                                                                |
                                                            [categories]
                                                                |
                                                            [departments]

Star schemas are simpler and faster to query — fewer joins. Snowflake schemas are more normalised and use less storage — no repeated category names in every product row. In practice, most data warehouses use a mix, favouring star schema for the most frequently queried tables.

The Kimball principle — how to actually choose

Ralph Kimball essentially invented modern data warehouse design, and he was strongly opinionated about this: favour star schema almost always.

His reasoning is practical rather than theoretical. Query performance and usability matter far more than storage normalisation in an analytical context. Storage is cheap. Analyst time and query latency are not. Every extra join in a snowflake schema is a potential mistake for a non-technical analyst writing SQL, and a slower query for everyone.

Modern cloud warehouses like BigQuery and Snowflake have made this argument even stronger. Columnar compression is so good that the storage savings from normalising dimensions have shrunk significantly. The performance and simplicity argument for star schema has only gotten stronger with time.

The practical rule most experienced data engineers follow:

Denormalise aggressively for your hottest tables. Add snowflake normalisation only when a dimension has a deep, independently-queried hierarchy — or updates so frequently that maintaining denormalised copies becomes genuinely painful.

Three questions that settle it in practice:

Who is querying? Non-technical analysts writing SQL → star schema. Every extra join is friction. Data engineers running automated pipelines → snowflake is manageable.

How often does dimension data change? If your products table has a category_name that changes frequently, a star schema means updating that string in every product row. A snowflake schema means updating it once in a categories table. High change frequency nudges toward snowflake.

How big and repetitive are your dimensions? A customers table with 10 million rows storing a repeated country_name string means 10 million copies of “United Kingdom.” A snowflake schema stores the country name once and stores a tiny integer key in customers instead. Massive dimensions with highly repeated string values can justify snowflake.

In practice: most dimension tables should be flat. The date dimension, the customer dimension, the product dimension — star. Snowflake only when you have a real, concrete reason. Not a theoretical one.

Columnar storage — the secret behind fast analytical queries

This is the most important idea in OLAP — the one that makes everything else possible.

The problem with row-oriented storage for analytics is straightforward: analytical queries touch a few columns across millions of rows, but row-oriented storage forces you to read entire rows to get those columns. You pay for data you don’t need on every single read. The fix is to flip the storage layout entirely.

Row-oriented storage (OLTP): Stores all columns of each row together. Reading one column means reading entire rows.

Row 1: [id:1, name:"Alice", city:"London", amount:250, status:"done"]
Row 2: [id:2, name:"Bob",   city:"Paris",  amount:80,  status:"pending"]
Row 3: [id:3, name:"Carol", city:"London", amount:510, status:"done"]

Columnar storage (OLAP): Stores all values of each column together across all rows. Reading one column reads only that column’s data — nothing else.

id column:      [1, 2, 3]
name column:    ["Alice", "Bob", "Carol"]
city column:    ["London", "Paris", "London"]
amount column:  [250, 80, 510]
status column:  ["done", "pending", "done"]

Now for the query SELECT SUM(amount) FROM orders WHERE city = 'London':

Read the city column → find rows where city is London → rows 1 and 3
Read the amount column → get values for rows 1 and 3 → 250 and 510
Never touch id, name, or status at all

For 50 million rows, the difference between reading 2 columns and reading 8 columns is enormous — especially when each column compresses independently.

Why columnar storage compresses so well

Columns contain repetitive data. The city column might have “London” repeated 10 million times. The status column might have only three possible values — “done”, “pending”, “cancelled” — across a billion rows.

Compression algorithms love this. A column of a billion status values with only three distinct options can be compressed to a tiny fraction of its raw size using dictionary encoding — store a lookup table {0:"done", 1:"pending", 2:"cancelled"} and then just store the number instead of the string. In row-oriented storage, the repeated strings are scattered across rows and don’t compress nearly as well.

The numbers are striking in practice. A table that takes 1TB in a row-oriented database might compress to 100GB or less in a columnar warehouse — sometimes far smaller depending on how repetitive the data is. That’s not just cheaper storage. It’s 10x less data to read off disk for every query, which means 10x faster scans before any other optimisation is applied.

This is why data warehouses can store petabytes of data in a fraction of the space you’d expect — and why reading from them is so fast. Less data on disk means less data to read. Less data to read means faster queries.

The real-world stack

Understanding OLTP vs OLAP explains why your company’s data infrastructure looks the way it does:

Your application
      ↓ writes
Production DB (PostgreSQL / MySQL)     ← OLTP
      ↓ ETL pipeline (hourly / nightly)
Data Warehouse (BigQuery / Redshift / Snowflake)   ← OLAP
      ↓ queries
BI tools (Looker / Tableau / Metabase)
      ↓ dashboards
Business stakeholders

BigQuery — Google’s data warehouse, built on the same columnar query engine Google uses internally to analyse petabytes of search and ads data. Serverless — you pay per query, not per server. Scales automatically.

Amazon Redshift — AWS’s data warehouse. Columnar, cluster-based, the most widely used in enterprise.

Snowflake — Cloud-agnostic data warehouse. Known for separating compute from storage, meaning you scale query capacity independently of data volume.

Apache Parquet — Not a database, but a columnar file format. The standard way to store columnar data on disk or in cloud storage (S3, GCS). Used by Spark, Hive, BigQuery, and almost every modern data tool.

dbt (data build tool) — The tool that transforms raw data in your warehouse into clean, modelled tables ready for analysis. Sits between the ETL load and the BI tool.

The practical takeaway

If you take nothing else from this post, take these three things:

Never run analytical queries on your production OLTP database. If your data team needs to query production data, give them a read replica at minimum — and push toward a proper data warehouse as soon as scale allows.

The star schema is the lingua franca of data warehouses. Fact tables store events. Dimension tables store context. If you ever need to design a data warehouse schema or talk to a data engineer, this vocabulary is essential.

Columnar storage is not magic — it’s physics. Reading less data from disk is faster than reading more data. Storing repeated values together compresses better than storing them scattered. Every modern data warehouse is built on these two simple facts.

The next time someone says “just run it on the production database” — you’ll know exactly why the answer is no, and exactly what to suggest instead.

This is part of the BytesByAbhi series on database internals — building intuition from first principles, one concept at a time. Heavily inspired by Designing Data-Intensive Applications by Martin Kleppmann.