Database Internals

A Deep Dive into How Distributed Data Systems Work

Alex Petrov · 2019

sufficient

reading path: overview → analysis → narration

overview

Overview

Database Internals: A Deep Dive into How Distributed Data Systems Work (2019) by Alex Petrov is the definitive guide to understanding what happens inside databases. Unlike traditional database textbooks that focus on SQL and query design, Petrov goes deep into storage engines, distributed coordination, and the fundamental data structures that power modern databases.

------|-------|-------------| | I: Storage Engines | How data is organized on disk and in memory | B-Trees, LSM-Trees, File Formats, Caching | | II: Distributed Systems | How multiple nodes coordinate | Replication, Partitioning, Consensus, Distributed Transactions |

Key Takeaways

Understanding storage engines is essential for database selection. The difference between a B-Tree database (PostgreSQL, MySQL) and an LSM-Tree database (LevelDB, RocksDB, Cassandra) explains most of their performance characteristics.
B-Trees are the foundation of relational databases. They provide excellent read performance, predictable latency, and support range queries efficiently. Write amplification is their main weakness.
LSM-Trees optimize for writes. Log-Structured Merge Trees batch writes in memory and flush them to disk in sorted runs. Compaction merges runs. Read performance suffers compared to B-Trees, but write throughput is dramatically higher.
ACID transactions require careful concurrency control. Locking, MVCC (Multi-Version Concurrency Control), and optimistic concurrency control each make different trade-offs.
Consistency models span a spectrum. From strict serializability (most consistent, least available) to eventual consistency (least consistent, most available). The CAP theorem bounds the options.
Consensus algorithms enable fault tolerance. Paxos and Raft allow distributed systems to agree on a value even when nodes fail. Raft is more understandable; Paxos is more efficient.
Replication comes in two flavors. Leader-based (single node handles writes) and leaderless (any node accepts writes). Each has trade-offs for consistency, latency, and availability.
Partitioning distributes data across nodes. Hash partitioning is simple but makes range queries hard. Range partitioning enables efficient scans but risks hotspots.

Who Should Read

| Reader Type | Why | |---|---| | Database engineers | The closest thing to a practitioner's guide to storage engines | | Backend engineers building on databases | Understanding internals helps in schema design and query optimization | | Distributed systems engineers | Part II covers replication, consensus, and partitioning | | SREs and infrastructure engineers | Understanding storage internals helps in capacity planning and troubleshooting | | Anyone choosing a database | The framework for evaluating storage engine trade-offs |

Who Should Skip

Application developers who just use ORMs — this is far deeper than most need
Beginners to databases — read a SQL or database design book first
Readers seeking distributed systems theory alone — the distributed chapters are practical, not theoretical

Why This Book Matters

Most engineers treat databases as black boxes. Database Internals opens the box and shows the gears turning. For engineers working with data at scale, understanding these internals is the difference between guessing at performance and predicting it. The book fills a gap between high-level database design books and research papers.

| Book | Author | Connection | |------|--------|------------| | Designing Data-Intensive Applications | Martin Kleppmann | The companion book — covers systems design at a higher level | | Readings in Database Systems | Hellerstein, Stonebraker | The academic paper collection | | The Art of PostgreSQL | Dimitri Fontaine | PostgreSQL-specific internals | | Transaction Processing | Gray & Reuter | The (out of print) classic on transaction systems |

Final Verdict

Database Internals is an excellent engineering reference that covers territory few books address. Its treatment of storage engines is outstanding. The distributed systems section, while valuable, covers more familiar ground. The book's main limitation is that it is a survey — it covers many topics at a depth sufficient for understanding but not for implementing. For implementation, you need the original papers and source code.

Rating: 8/10 — Essential for database engineers and serious backend developers. A readable bridge between academic concepts and practical engineering.

content map

B-Tree vs LSM-Tree

flowchart TB
    subgraph BTree["B-Tree Database"]
        B1["Write to page in-place<br/>(random I/O)"]
        B2["Read page from disk<br/>(~1-2 seeks per read)"]
        B3["WAL for crash recovery"]
    end

    subgraph LSMTree["LSM-Tree Database"]
        L1["Write to MemTable<br/>(in memory)"]
        L2["Flush to SSTable<br/>(sequential write)"]
        L3["Compaction: merge<br/>multiple SSTables"]
        L4["Bloom filters speed<br/>up point reads"]
    end

    subgraph Tradeoffs["Trade-offs"]
        T1["B-Tree: fast reads,<br/>slower writes"]
        T2["LSM-Tree: fast writes,<br/>slower reads (compaction)"]
    end

    BTree --> T1
    LSMTree --> T2

B-Tree

Read-optimized: data is organized in fixed-size pages (usually 4-16 KB), linked in a balanced tree structure
Write amplification: each write may touch multiple pages; WAL adds additional write overhead
Predictable latency: reads typically take 2-4 page fetches
Used by: PostgreSQL, MySQL/InnoDB, SQLite, Oracle, DB2

LSM-Tree

Write-optimized: writes go to an in-memory structure (MemTable) first, then flushed to disk in large sequential batches
Compaction: background process merges sorted runs; this causes write amplification (10-50x) but enables high throughput
Read amplification: may need to check multiple SSTables (mitigated by Bloom filters)
Used by: LevelDB, RocksDB, Cassandra, HBase, BigTable

Transaction Processing and MVCC

flowchart LR
    subgraph MVCC["Multi-Version Concurrency Control"]
        T1["Transaction 1<br/>writes row A"]
        T2["Transaction 2<br/>reads row A"]
        V1["Version 1 of row A"]
        V2["Version 2 of row A"]
    end

    T1 --> V1
    T1 --> V2
    T2 --> V1
    T2 -.->|"Reads consistent<br/>snapshot"| T1

MVCC enables concurrent transactions to see a consistent snapshot of data without locking readers. Each row has multiple versions; each transaction sees the version that was committed before it began. Implemented by PostgreSQL, Oracle, MySQL/InnoDB.

Distributed Consensus: Raft

flowchart LR
    subgraph Raft["Raft Consensus"]
        L["LEADER<br/>(handles all writes)"]
        F1["FOLLOWER<br/>(replicates)"]
        F2["FOLLOWER<br/>(replicates)"]
        C["CLIENT"]
    end

    C -->|"Write request"| L
    L -->|"AppendEntries RPC"| F1
    L -->|"AppendEntries RPC"| F2
    F1 -->|"Ack"| L
    F2 -->|"Ack"| L
    L -->|"Commit after<br/>majority ack"| C

Raft elects a leader through a voting process. The leader handles all writes, replicated via AppendEntries RPC to followers. A write is committed when a majority of nodes acknowledge it. Raft guarantees safety even when up to (N-1)/2 nodes fail.

Key Lessons

There is no perfect storage engine. B-Trees and LSM-Trees make fundamentally different trade-offs. Your workload determines which is better.
Write amplification is the hidden cost. Every write to a database causes 2-50x actual I/O. Understanding this helps explain SSD wear and unexpected performance characteristics.
Consensus algorithms are the foundation of distributed consistency. Paxos and Raft are not optional theoretical curiosities — they are what make systems like etcd, ZooKeeper, and Consul work.
Concurrency control is where theory meets practice. Serializable isolation is ideal but expensive. Most databases default to Read Committed or Snapshot Isolation for a reason.

Practical Applications

For Database Selection

Write-heavy workload: LSM-Tree (Cassandra, RocksDB)
Read-heavy with complex queries: B-Tree (PostgreSQL)
Mixed workload: consider both or a hybrid approach

For Schema Design

B-Tree: create indexes carefully (each index amplifies writes)
LSM-Tree: wide rows reduce compaction overhead

For Operations

Monitor compaction pressure in LSM databases
Tune checkpoint frequency in B-Tree databases
Understand that fsync behavior determines crash safety vs write performance trade-offs

analysis

Strengths

Fills a genuine gap. There are surprisingly few books that cover storage engine internals for practitioners. Database Internals is the best available.
Excellent treatment of B-Trees and LSM-Trees. The comparison is nuanced and practical. Petrov explains not just how they work but why you would choose one over the other.
Practical and implementation-focused. Unlike academic textbooks, this book connects concepts to real databases (RocksDB, LevelDB, WiredTiger, etc.).
Good diagrams and code snippets. The visual explanations of B-Tree page splits and LSM compaction are clear.

Weaknesses

Distributed systems section is less novel. The distributed chapters cover replication, consensus, and partitioning — topics already well-covered by Kleppmann's Designing Data-Intensive Applications.
Lacks hands-on exercises. There are no implementation projects. To really understand storage engines, you need to build one.
Some topics are too brief. File formats, caching, and buffer management each deserve more depth.
Occasionally verbose. Some explanations could be more concise. The book could be 20% shorter.

Criticism

Too broad for its length. Covering both storage engines and distributed systems in 374 pages means neither is as deep as it could be.
Knowledge of C/C++ is assumed. The book references data structures and memory management concepts that require comfort with systems programming.
Already somewhat dated. The landscape moves fast — newer databases (FoundationDB, CockroachDB internals) are not covered.

Comparison

| Book | Author | Focus | |------|--------|-------| | Database Internals | Petrov | Storage engines + distributed systems | | Designing Data-Intensive Applications | Kleppmann | Distributed systems design; lighter on storage | | Readings in Database Systems | Hellerstein | Academic papers (much harder) | | Transaction Processing | Gray & Reuter | Transactions only (out of print) |

Final Assessment

| Dimension | Rating | Notes | |-----------|--------|-------| | Niche Coverage | 9/10 | Best available on storage engines | | Practical Utility | 8/10 | Directly applicable to database selection and tuning | | Depth | 7/10 | Good for understanding; light for implementation | | Clarity | 7/10 | Clear but sometimes verbose | | Completeness | 7/10 | Broad but not deep enough in some areas | | Overall | 7.5/10 | Excellent reference; best paired with Kleppmann |

narration

Introduction

Welcome to BookAtlas. Today: Database Internals by Alex Petrov. Published 2019 by O'Reilly. The book that opens the black box of databases and shows what is really happening on disk.

Most engineers use databases every day without understanding how they work internally. Petrov argues that this ignorance has a real cost: you cannot optimize what you do not understand.

The Two Great Families: B-Trees and LSM-Trees

Engineer: At the highest level, every storage engine falls into one of two families. B-Tree databases — PostgreSQL, MySQL, SQLite — are read-optimized. LSM-Tree databases — Cassandra, RocksDB, BigTable — are write-optimized. Understanding this one distinction explains most of the performance characteristics you encounter.

Skeptic: So why does anyone use LSM-Trees? Read performance is usually what matters.

Engineer: Because writes are what limits scale. In a write-heavy workload — logging, time-series, IoT, analytics ingestion — the B-Tree's in-place writes become the bottleneck. The LSM-Tree turns random writes into sequential writes, which is 10-100x faster on hard drives and significantly better on SSDs too.

The Write Amplification Problem

flowchart LR
    subgraph BTree_Write["B-Tree Write"]
        W1["Write to WAL<br/>(1x)"]
        W2["Write to page<br/>(1-2x)"]
        W3["Write to index<br/>(1x)"]
    end

    subgraph LSM_Write["LSM-Tree Write"]
        LW1["Write to WAL<br/>(1x)"]
        LW2["Write to MemTable<br/>(in memory)"]
        LW3["Flush to SSTable<br/>(1x)"]
        LW4["Compaction<br/>(10-50x)"]
    end

Engineer: Every write to a database causes more I/O than you expect. A single row insert in PostgreSQL might write 2-3 pages. A single row insert in RocksDB might eventually rewrite 10-50 copies of that data through compaction. Understanding write amplification is essential for predicting SSD lifespan and database performance.

Skeptic: 50x write amplification? That sounds terrible.

Engineer: It is the price of high-throughput writes. And it is manageable because the writes are sequential — which is efficient on modern storage. The real cost is that compaction competes with user reads for I/O bandwidth. That is why tuning an LSM database is an art.

Distributed Consensus

Engineer: The second half of the book covers distributed databases. The key concept is consensus — multiple nodes agreeing on a value even when some nodes fail. Raft and Paxos are the two main algorithms. Raft is designed for understandability: leader election, log replication, safety guarantees.

Skeptic: I have read about Raft. It is for coordination, not for general data storage. etcd and ZooKeeper use it. But large-scale databases do not use consensus for every write — it is too slow.

Engineer: Correct. Most databases use weaker consistency models for throughput and reserve consensus for critical metadata (leader elections, schema changes). The trade-off between consistency and performance is the central tension in distributed database design.

The Verdict

Engineer: Database Internals is not a book for everyone. But if you are a database engineer, a backend engineer working at scale, or simply someone who wants to understand how databases actually work, it is the best resource available.

Skeptic: It is also not deep enough to be a reference. When I need to know exactly how PostgreSQL's MVCC works, I go to the PostgreSQL documentation, not this book.

Engineer: That is true. But Database Internals gives you the map before you visit the city. You know what questions to ask, what trade-offs to look for, and how the pieces fit together. That contextual understanding is what makes it valuable.

Final Thoughts

Database Internals fills an important niche: storage engine internals for practitioners. It is not the deepest book on any single topic, but it is the best survey of the area. Read it with Kleppmann's Data-Intensive Applications for the complete picture.

This has been a BookAtlas narration of Database Internals by Alex Petrov. Thanks for listening.

section status
section	state
overview	written
content map	written
analysis	written
narration	written