SQL Antipatterns
Avoiding the Pitfalls of Database Programming
sufficient
reading path: overview → analysis → narration
overview
Overview
SQL Antipatterns: Avoiding the Pitfalls of Database Programming by Bill Karwin, published by Pragmatic Bookshelf (first edition 2010, revised edition 2022), ISBN 9781934356555, ~320 pages. Karwin — software engineer, database architect, and speaker — draws on decades of production database work to catalog the most common, costly mistakes developers make when writing SQL and designing schemas. A Caltech alum and seasoned conference speaker, Karwin has seen these antipatterns repeated across companies of every size.
The book is organized into four parts — logical database design, physical database design, queries, and application development — each presenting an antipattern, explaining why the seemingly reasonable approach goes wrong, and showing the correct alternative. Unlike most SQL books that teach syntax, this one teaches judgment.
------|-------|-----------------------------| | 1. Logical Database Design | Schema structure mistakes | Naive Trees, ID required, comma-separated values, EAV | | 2. Physical Database Design | Storage and referential integrity | Clone schema, read-only keys, null is not null | | 3. Queries | Query formulation errors | Exact-count, random selection, querying by committee | | 4. Application Development | SQL and app integration | SQL injection, pagination |
The book's recurring message: the database is a shared, long-lived resource. Design with the next developer — and the next decade — in mind. Antipatterns are not just bugs; they are architectural decisions that compound cost over time.
Key Takeaways
-
Naive Trees store hierarchical data using a parent_id foreign key and then repeatedly query with recursive joins or application-level loops. The fix is the nested set model, closure table, or modern recursive CTEs depending on the read/write balance.
-
ID Required forces every table to have a synthetic auto-increment key even when a natural composite key is more meaningful and enforces uniqueness by fiat. Surrogate keys are not always preferable.
-
Exact Count — using
SELECT COUNT(*)for pagination or to check if data exists — forces the database to scan every matching row. UseEXISTSor bounded range counts instead. -
NULL Is Not NULL — treating NULL as a value, comparing it with
=or!=, or using it inNOT INsubqueries produces silent logical errors that are extremely hard to debug.NULLmeans "unknown," not zero or empty. -
Random Selection —
SELECT ... ORDER BY RAND()on large tables requires a full file sort. Use reservoir sampling, precomputed random offsets, or dedicated random-access tricks. -
EAV (Entity-Attribute-Value) replaces a well-designed schema with a three-column "generic" table. It trades type safety, referential integrity, and query performance for flexibility that almost always becomes a liability.
-
Comma-Separated Lists store collections in single string columns using CSV or similar formats. This breaks relational algebra, indexing, and constraint enforcement. Normalize into a child table.
-
Pagination Solutions —
LIMIT 10000, 20on large tables forces the engine to discard 10,000 rows. Use keyset pagination (seek method) with a stable sort column. -
Clone Schema — copying identical schema structures across tables instead of using inheritance, partitioning, or shared lookup tables. Common with multi-tenant or multi-region designs.
-
Read-Only Keys — declaring foreign keys but never enforcing
ON DELETE/UPDATE CASCADE, leaving orphaned rows and silently broken relationships. Foreign keys without referential integrity rules and application enforcement are worse than no foreign keys at all. -
Querying by Committee — letting every stakeholder influence the query structure, producing monstrous SQL that tries to serve every use case. Queries should serve one purpose well.
-
SQL Injection — concatenating unsanitized user input into query strings. Swappable with parameterized queries (prepared statements). This is the antipattern with the most severe security consequences.
Who Should Read
| Reader Type | Why | |---|---| | Backend and application developers | The antipatterns mirror exactly what developers write before they learn better | | Data engineers and DBAs | A systematic framework for diagnosing bad SQL and schema design | | Engineering leads and architects | A shared vocabulary: "that's an EAV antipattern" replaces long debates | | Students learning SQL | Instills good habits from the outset, before bad patterns harden | | Technical reviewers / auditors | A checklist for schema and query reviews |
Who Should Skip
- Those who have already internalized these patterns through years of production work — you know this material
- Beginners looking for a SQL syntax reference — Karwin assumes familiarity with
CREATE,SELECT, andJOIN - Readers wanting deep coverage of a specific RDBMS (Oracle, PostgreSQL, MySQL internals) — this is vendor-neutral
- Anyone looking for ORM-specific guidance — the book speaks plain SQL
Why This Book Matters
Bad schema and query design is expensive in a way that is invisible until it is too late. A single EAV table or missing cascade can silently corrupt data integrity for years. A query using ORDER BY RAND() that takes 30 seconds on a million-row table does not throw an error — it just burns CPU. Karwin's book exists because these problems are not dramatic failures; they are slow drains on productivity, performance, and trust in data.
The revised 2022 edition matters because it adds modern context: recursive CTEs as a cleaner solution to the Naive Trees problem, improved NOT EXISTS patterns, and updated framing around ORM abstractions. Karwin's conversational style and use of realistic examples make the book practical rather than academic.
If you write SQL — even occasionally — this book will change how you think about your next schema and your next query.
Related Books
| Book | Author | Connection | |------|--------|------------| | High Performance MySQL | Baron Schwartz et al. | Query optimization and InnoDB internals that Karwin's antipatterns implicitly warn against | | Designing Data-Intensive Applications | Martin Kleppmann | Broader distributed systems framing for the storage decisions underlying schema design | | Database Internals | Alex Petrov | Deep dive into B-trees, LSM trees, and storage engines behind the schema decisions | | Refactoring Databases | Ambler & Sadalage | Evolution-focused counterpart to Karwin's prevention-focused antipatterns guide | | SQL Performance Explained | Markus Winand | Index-centric companion; Karwin's bias against EAV and comma-separated lists is index-backed | | Seven Databases in Seven Weeks | Redmond & Wilson | Polyglot perspective that illuminates why "just use a relational database" sometimes creates EAV antipatterns |
Final Verdict
SQL Antipatterns is the most accessible and practical book on SQL design mistakes. It is short (~320 pages), opinionated, and rooted in real-world pain rather than academic theory. Karwin writes as a senior engineer who has reviewed enough bad schemas to know exactly where developers go wrong. The revised edition keeps it current.
Every developer who writes SQL should read this — once. Then keep it within reach.
Rating: 9/10 — The best single book for developing SQL judgment. Essential for developers, valuable for DBAs, and should be required reading before anyone is allowed to design a production schema.
content map
Antipatterns by Layer
graph TD
A["SQL Antipatterns"] --> B["Part 1: Logical DB Design"]
A --> C["Part 2: Physical DB Design"]
A --> D["Part 3: Query Formulation"]
A --> E["Part 4: App Development"]
B --> B1["Naive Trees<br/>parent_id recursion"]
B --> B2["ID Required<br/>surrogate key mandate"]
B --> B3["EAV<br/>one-size-fits-all schema"]
B --> B4["Comma-Separated Lists<br/>CSV in a column"]
C --> C1["Clone Schema<br/>duplicate table structures"]
C --> C2["Read-Only Keys<br/>FKs without cascades"]
C --> C3["NULL Is Not NULL<br/>treating null as a value"]
D --> D1["Exact Count<br/>SELECT COUNT(*) for pagination"]
D --> D2["Random Selection<br/>ORDER BY RAND()"]
D --> D3["Querying by Committee<br/>multi-purpose SQL monster"]
E --> E1["SQL Injection<br/>string concatenation"]
E --> E2["Pagination<br/>LIMIT offset on large tables"]
Each antipattern follows a structure: why the naive choice feels reasonable, how it breaks, and the pragmatic alternative.
Naive Trees: Hierarchical Data in a Flat Table
The most common way to represent a tree (categories, org charts, threaded comments) is a self-referential foreign key:
CREATE TABLE comments (
id INT PRIMARY KEY AUTO_INCREMENT,
parent_id INT NULL REFERENCES comments(id),
body TEXT
);
This works until you need to query the entire subtree of a comment. Recursive joins in MySQL before 8.0 required as many application-level round-trips as the tree was deep.
flowchart LR
A["Comment A (root)"] --> B["Comment B"]
A --> C["Comment C"]
B --> D["Comment D"]
B --> E["Comment E"]
C --> F["Comment F"]
style A fill:#e8f4fd
style B fill:#fef9e7
style D fill:#fef9e7
The fix depends on your read/write balance:
- Recursive CTE (MySQL 8+, PostgreSQL, SQL Server): readable, standards-based.
- Nested set model: encodes left/right boundaries; fast subtree reads, moderate writes.
- Closure table: a separate
pathtable; the most flexible, but requires extra maintenance. - Materialized path: stores the full path in each row; simple but fragile.
Karwin's rule: do not choose a tree model without first counting your reads against your writes.
EAV: The Anti-Relational Schema
flowchart TB
subgraph EAV_Wrong["❌ EAV Antipattern"]
direction LR
E1["products<br/>(id, name)"]
E2["attributes<br/>(id, name)"]
E3["values<br/>(product_id, attr_id, val)"]
E1:::antipattern --> E2:::antipattern
E1 --> E3:::antipattern
E2 --> E3
end
subgraph EAV_Right["✓ Relational Alternative"]
direction LR
R1["products<br/>(id, name, price, weight, country)"]
end
EAV_Wrong -.->|"type safety, integrity, performance"| EAV_Right
classDef antipipattern fill:#fce4ec,stroke:#c62828
classDef right fill:#e8f5e9,stroke:#2e7d32
The Entity-Attribute-Value (EAV) antipattern replaces a table with typed columns by three generic tables: entities, attributes, and values. It trades every relational advantage — type checking, unique constraints, foreign keys, index usage — for illusory flexibility.
Karwin argues EAV is almost always wrong:
- Values are stored as strings, requiring implicit casts.
- Foreign keys to the same lookup table become circular or impossible.
- Every query devolves into pivot logic or application-level re-assembly.
- Adding a real column later requires migrating millions of sparse EAV rows.
The real fix: use a wide table with typed columns, a JSON column (with a check constraint) for genuinely optional attributes, or a vertical partitioning strategy.
Referential Integrity: Keys Without Cascades Are Worse Than No Keys
A foreign key without ON DELETE or ON UPDATE action creates a runtime invariant the database cannot enforce:
-- Declares a relationship but enforces nothing
ALTER TABLE order_items
ADD CONSTRAINT fk_order
FOREIGN KEY (order_id) REFERENCES orders(id);
If the order is deleted, the database rejects the cascade — or, if the FK is deferrable / ignored, orphaned rows accumulate silently.
flowchart LR
subgraph Cascade_Right["✓ ON DELETE CASCADE"]
O["orders (id=101)"] -->|"ON DELETE CASCADE"| OI["order_items"]
O -.->|"DELETE 101"| DEL["Orphan-free delete"]
DEL --> OI
end
subgraph FK_But_No_Cascade["❌ FK + No Action = Orphans"]
O2["orders (id=102)"] -->|"NO ACTION"| OI2["order_items"]
O2 -.->|"DELETE 102"| FAIL["Integrity Violation<br/>or orphaned rows"]
FAIL --> OI2
end
style Cascade_Right fill:#e8f5e9
style FK_But_No_Cascade fill:#fce4ec
The fixes:
ON DELETE CASCADEfor dependent rows (order_items → orders).ON DELETE SET NULLfor optional relationships.ON DELETE RESTRICT(orNO ACTION) where orphan prevention is essential.
If DDL-level cascades are not an option, Karwin insists the application must enforce the invariant — but this is fragile and almost always worthwhile to move into the database.
Comma-Separated Lists and the Set-Valued Column
Storing a list in a string column:
-- ❌ Antipattern
CREATE TABLE posts (
id INT PRIMARY KEY,
tags VARCHAR(255) -- e.g. 'sql,antipatterns,database'
);
This breaks every relational operation: finding all posts with tag sql requires LIKE '%sql%' (false positives), enforcing tag spelling is manual, and adding composite queries with multiple tags is combinatorial.
The fix is a bridging table:
CREATE TABLE post_tags (
post_id INT NOT NULL REFERENCES posts(id),
tag_id INT NOT NULL REFERENCES tags(id),
PRIMARY KEY (post_id, tag_id)
);
Karwin notes that modern PostgreSQL and MySQL 8+ support JSON arrays — but even JSON defer the same problem: you cannot enforce foreign-key constraints or efficient joins on unindexed array elements. A normalized table is still correct.
NULL Is Not NULL: The Three-Valued Logic Trap
SQL NULL represents "unknown," not zero, not empty, not "no value." Conflating it causes silent logical errors:
| Query | What It Does | Common Bug |
|-------|-------------|-----------|
| WHERE col = NULL | Always FALSE | Developer means "is null"; should use IS NULL |
| WHERE col != 'x' | Skips rows where col IS NULL | Rows with missing data disappear invisibly |
| WHERE col NOT IN (subquery) | Evaluates to UNKNOWN if subquery contains NULL | Entire predicate returns no rows |
flowchart LR
N["NULL (unknown)"] --> EQ["col = NULL"]
EQ -->|"result"| F["FALSE (always)"]
N --> NEQ["col != 'x'"]
NEQ -->|"result"| U["UNKNOWN — filtered out silently"]
N --> NOTIN["NOT IN (subquery with NULL)"]
NOTIN -->|"result"| EMPTY["Empty result set<br/>silent failure"]
Karwin's guidance: avoid NULL when a sentinel value is clearer; when NULL is semantically legitimate, always use IS NULL / IS NOT NULL / COALESCE(), and never rely on != to find missing data.
Exact Count: SELECT COUNT(*) Is Not Free
-- ❌ Antipattern
SELECT COUNT(*) FROM orders WHERE user_id = 42;
-- OR, for pagination:
SELECT * FROM articles LIMIT 10000, 20;
COUNT(*) over a large matching set forces a full scan. LIMIT 10000, 20 asks the engine to find and discard the first 10,000 rows before returning 20. On a table indexed on created_at, both operations are O(n).
The fixes:
- Does data exist? Use
SELECT 1 FROM ... WHERE ... LIMIT 1orEXISTS(). - How many exactly? Approximate from metadata when precise count is not required.
- Pagination? Use keyset pagination:
WHERE created_at < 'last_seen_value' ORDER BY created_at DESC LIMIT 20.
flowchart LR
subgraph Offset["❌ OFFSET Pagination"]
O1["LIMIT 10000, 20"]
O1 --> Scan["Scan + discard 10,000 rows"]
Scan --> Slow["Slow — O(n)"]
end
subgraph Keyset["✓ Keyset Pagination"]
K1["WHERE created_at < '...'<br/>ORDER BY created_at DESC<br/>LIMIT 20"]
K1 --> Index["Index seek — O(log n)"]
Index --> Fast["Fast — consistent"]
end
Keyset pagination requires a stable, unique sort column. Karwin's recommendation: use it for any feed, list, or search result beyond a few hundred rows.
Random Selection: ORDER BY RAND()
-- ❌ Antipattern on a table with 1M rows
SELECT * FROM articles ORDER BY RAND() LIMIT 10;
The database must assign a random float to every row, sort the entire table, then return 10. Cost: full sort — O(n log n) with allocation per row.
The fixes depend on scale:
| Scale | Approach |
|-------|---------|
| \< 10K rows | ORDER BY RAND() is fast enough in practice |
| 10K–1M | Precompute a random column; index it; select a range |
| > 1M | Reservoir sampling (single pass, O(n) memory bounded) |
| Any | Fetch IDs with a fast random function, join back |
Karwin's principle: if a query takes proportional time to table size, it is a candidate for redesign, not caching.
SQL Injection: Concatenation as Attack Vector
The antipattern with the most severe consequences:
-- ❌ Antipattern (Python / pseudo-code)
query = "SELECT * FROM users WHERE name = '" + user_input + "'"
If user_input is '; DROP TABLE users; --, the query becomes:
SELECT * FROM users WHERE name = ''; DROP TABLE users; --'
The fix is parameterized queries (prepared statements):
# Python / psycopg2 example
cursor.execute(
"SELECT * FROM users WHERE name = %s",
(user_input,)
)
Parameters are sent separately from the query plan. The database treats user input as data, never as SQL syntax. Karwin is unambiguous: there is no circumstance where concatenating user input into a raw query is acceptable.
Querying by Committee: Every Feature in One Query
A query that tries to serve five different use cases:
-- ❌ Antipattern
SELECT u.id, u.name,
COUNT(o.id) AS order_count,
SUM(o.total) AS lifetime_value,
MAX(o.created_at) AS last_order,
p.country,
CASE WHEN p.vip THEN 'Yes' ELSE 'No' END AS is_vip,
... -- 12 more columns
FROM users u
JOIN orders o ON o.user_id = u.id
JOIN profiles p ON p.user_id = u.id
GROUP BY u.id, p.country, p.vip -- mixed group keys
HAVING COUNT(o.id) > 0
OR p.country = 'US'
This violates single-responsibility for queries. The correct approach: one purpose, one query, compose at the application layer if needed.
analysis
Strengths
- The right scope. SQL Antipatterns is short (~320 pages), focused, and vendor-neutral. Karwin does not try to teach SQL syntax or a specific RDBMS; he targets the judgment gaps that persist across every SQL database.
- Structured antipattern format. Each chapter follows a consistent, scannable pattern: the antipattern name, a realistic scenario, the "what seems reasonable" framing, the failure mode, a real-world consequence, and the correction. This makes the book effective as both a teaching tool and a retrospective checklist.
- Conversational, developer-friendly tone. Karwin writes as a senior engineer onboarding a new hire. The examples feel like real code reviewed in a pull request, not textbook contrivances.
- Covers both design and query layers. Most SQL books pick one: schema design OR query optimization. Karwin treats the developer who writes
SELECT COUNT(*)and the same developer who designs the schema behind it — this is where the compounding cost lives. - Pragmatic, not purist. The alternatives Karwin presents are practical. RabbitMQ-style recommendations, not "always use a closure table." He weighs the nested set model against recursive CTEs honestly — it is not always the best choice, and he says so.
- Revised edition adds needed context. The 2022 revision updates examples for modern SQL (recursive CTEs,
JSONcolumns with constraints) and adds ORM-specific antipatterns that were not as prominent in the 2010 original. - The NULL chapter is outstanding. Three-valued logic is a concept many developers gloss past. Karwin's treatment —
NOT INwith NULL,!=filtering silently dropping rows,IS NULLvs= NULL— converts a dry corner of the spec into a clear, memorable mental model.
Weaknesses
- Shallow on any single topic. The book is deliberately wide, not deep. A developer working through the EAV antipattern will want more: real migration scripts, partitioning strategies, and case studies from companies that escaped bad EAV decisions. Karwin points you toward Refactoring Databases for that.
- Limited database-specific guidance. Karwin stays neutral across MySQL, PostgreSQL, SQL Server, and Oracle. This is a strength for a first read, but teams on a single database want deeper guidance — e.g. the specifics of how MySQL 8 handles
ON DELETE CASCADEvs. PostgreSQL's deferrable constraints. - No exercises or worked refactorings. The book presents a corrected schema, but there are no "before and after" multi-step transformations or review problems. O'Reilly or Addison-Wesley style would include them; Pragmatic Bookshelf often does not.
- Some corrections rely on features not available in older systems. Recursive CTEs (MySQL 8+, PostgreSQL 8.4+) are Karwin's preferred alternative to naive trees. Teams on MySQL 5.7 or SQL Server 2008 will need different solutions; Karwin acknowledges this but does not always pivot quickly enough.
- ORM framing is lighter than expected. The revised edition nods at ORM-generated SQL and implicit antipatterns, but does not give ORM users (ActiveRecord, Hibernate, SQLAlchemy) enough concrete guidance on mapping bad patterns back to their framework of choice.
- Pagination compression. The keyset pagination treatment is clear but brief. Teams handling infinite scroll at scale need more: cursor semantics, duplicate handling, and the interaction between keyset pagination and full-text search filters.
Criticism
The "EAV Is Never Right" Pushback
Karwin's stance on EAV is absolute. EAV is almost always wrong is a fair statement for business applications with known attributes. But there is a context where EAV is less irrational: systems that genuinely must support arbitrary, user-defined attributes without a schema migration — some CMSes, metadata stores, and scientific data pipelines. Karwin acknowledges this limit but does not explore it deeply. Teams building attribute-lightweight, user-extensible product surfaces (like early Shopify product variants or event-driven systems with open schemas) will feel his dismissal is too categorical.
The Vendor Neutrality Tax
Karwin's neutrality means the book is evergreen — but it also means some advice ages at different speeds across databases. For example, NOT IN with NULL has different execution plans and warning behaviors in MySQL vs. PostgreSQL vs. SQL Server. Teams that want to exploit database-specific features (e.g. PostgreSQL's DISTINCT ON, MySQL's STRAIGHT_JOIN, recursive CTEs with cycle detection) will need to look elsewhere after Karwin.
The "Correction Feels Over-Simple" Impression
Karwin's "proper" schemas sometimes seem to paper over real-world complexity. A well-typed products table replacing EAV is the right answer when the attributes are known and stable — but product teams evolve requirements faster than schema migrations in some organizations. The gap between "is this really an antipattern in our context?" and "let's refactor" is larger in practice than Karwin's examples suggest.
ORM Blur
Karwin correctly notes that ORMs can generate SQL that exhibits exact-count, EAV, and random-selection antipatterns silently — but he stops short of naming specific ORM pitfalls or framework configurations. Developers using Rails migrations or Django models will recognize their own code but may not know which ORM feature to invert.
Target Audience Fit
| Reader | Fit | Reason | |---------|-----|--------| | Beginner SQL developers | ⭐⭐⭐⭐⭐ | Directly addresses the instinct-driven mistakes beginners make before they have internalized relational theory | | Intermediate backend engineers | ⭐⭐⭐⭐⭐ | Catches habits that feel justified but are compounding debt | | Senior DBAs / database architects | ⭐⭐⭐⭐ | A useful mental vocabulary for schema reviews and code reviews | | Data analysts and BI engineers | ⭐⭐⭐⭐ | The query antipatterns (exact count, random, querying by committee) are exactly the mistakes analysts make in ad-hoc SQL | | Engineering managers | ⭐⭐⭐⭐ | Gives a shared language: "this feels like EAV" resolves debates without personality | | ORM-heavy teams | ⭐⭐⭐ | Useful framing but lacks framework-specific guidance; pair with ORM documentation and code review practices | | Database vendors / product teams | ⭐⭐⭐⭐ | The antipattern catalog reads like a feature backlog for constraint enforcement and query insight tools |
Comparative Context: How It Fits in a SQL Reading Path
Getting started
|
[Karwin — SQL Antipatterns] <-- develop judgment (this book)
|
v
[Winand — SQL Performance Explained] <-- indexing and execution in depth
|
v
[Schwartz et al. — High Performance MySQL] <-- MySQL-specific internals and operations
|
v
[Ambler & Sadalage — Refactoring Databases] <-- how to safely migrate out of antipatterns
The book is at its best early. It is the book that prevents bad habits from hardening. Reading it after years of production SQL will uncover decisions you have been paying for — but the most value comes from reading it before the mistakes accumulate.
narration
Introduction
Welcome to BookAtlas. Today: SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Bill Karwin, Pragmatic Bookshelf, revised edition 2022. About 320 pages. This is the book that will change how you look at every schema and every query you write.
Karwin is a software engineer, a database architect, a Caltech alum, and a conference speaker who has seen enough bad SQL to write a field guide instead of a textbook.
Why This Book Exists
Engineer: Most SQL books teach syntax. CREATE TABLE, SELECT, WHERE, JOIN. They are syntax references, not judgment references.
Skeptic: And judgment is the hard part, right?
Engineer: Exactly. You can memorize LEFT OUTER JOIN and still design a schema that silently corrupts data for years. Karwin's insight is that the mistakes are consistent — developers with no shared training keep making the same bad decisions. The book names them, explains why they feel right, and shows the correct alternative.
The EAV Antipattern: Flexibility That Becomes Liability
Skeptic: Let me guess — every developer eventually hits the moment where they say "what if the product has arbitrary attributes?" and reach for EAV?
Engineer: That is exactly the moment Karwin is writing for. Entity-Attribute-Value replaces a table with typed columns by three generic ones: entities, attributes, and values. And it breaks almost everything.
Skeptic: Break it for me.
Engineer: Type safety is gone — everything is a string. Foreign keys become circular. Finding all products with weight > 1kg requires joining to three tables and applying implicit casts. And adding a new real column later means migrating millions of sparse EAV rows that are either NULL or actual values in the same column.
Skeptic: So it is an anti-relational design.
Engineer: Exactly. Karwin's preferred alternatives are a wide table with typed columns, a JSON column with a check constraint for genuinely optional attributes, or vertical partitioning if you have genuinely different attribute sets. The key test he provides: will you ever need to constrain, index, or join against this attribute? If yes, EAV is wrong.
flowchart LR
EAV["EAV<br/>generic key-value rows"] -->|"type safety"| WIDE["Wide table<br/>typed columns"]
EAV -->|"optional attrs"| JSON["JSON column<br/>with CHECK constraint"]
EAV -->|"per-entity opts"| VERT["Vertical partitioning<br/>separate typed tables"]
Skeptic: So when is EAV actually right?
Engineer: Karwin is honest about the boundary: systems that genuinely require user-defined, dynamic attributes without schema migrations — some metadata stores or scientific data pipelines. But for a business application where requirements are known, it is almost always wrong. The flexibility is illusory because the cost of querying cleanly far exceeds the cost of adding a column.
Naive Trees and Hierarchical Data
Skeptic: Hierarchical data is everywhere — categories, org charts, threaded comments. Is the recursive join really the naive view?
Engineer: It is, and it surprises people. The most natural schema is a self-referencing parent_id:
CREATE TABLE comments (
id INT PRIMARY KEY,
parent_id INT NULL REFERENCES comments(id),
body TEXT
);
This works. It breaks when you need to find all descendants of a comment, or all ancestors. In MySQL before 8.0 you would recurse in application code — one query per level. The database has no idea about the full tree.
flowchart BT
R["A (root)"] --> B["B"]
R --> C["C"]
B --> D["D"]
B --> E["E"]
C --> F["F"]
style R fill:#e8f4fd
Skeptic: And the fix?
Engineer: It depends on your read/write balance: nested sets encode left/right boundaries and enable fast subtree reads but expensive inserts. Closure tables store every path in a separate table — the most flexible. Materialized paths store the path as a string in each row. And from MySQL 8.0 onward, recursive CTEs make this cleanest without any schema trickery.
Karwin's principle: choose tree structure by counting reads against writes. If your tree is written heavily but read lightly, nested sets are a liability. If it is read heavily and written rarely, they are a victory.
NULL Is Not NULL: The Subtle Bug Generator
Skeptic: NULL — everyone knows it is "no value," right?
Engineer: That is precisely the mistake. NULL means "unknown." Not zero, not empty, not a placeholder. Three-valued logic — TRUE, FALSE, UNKNOWN — is SQL's spec, but most developers reason in two-valued Boolean. The difference produces silent failures.
Skeptic: Why silent?
Engineer: Because NULL comparisons return UNKNOWN, which WHERE filters out without warning:
| Query | Actual Result | Developer Expects |
|-------|--------------|-------------------|
| WHERE col = NULL | Always FALSE | rows with NULL |
| WHERE col != 'x' | Skips col IS NULL | all non-'x' rows |
| WHERE x NOT IN (...) | Returns 0 rows if list contains NULL | all rows not in list |
flowchart LR
N["NULL"] --> EQ["= NULL"]
EQ -->|"FALSE always"| TRAP1["trap: never true"]
N --> NE["!= 'x'"]
NE -->|"UNKNOWN"| TRAP2["trap: invisible rows"]
N --> NOTIN["NOT IN"]
NOTIN -->|"empty result"| TRAP3["trap: silent empty set"]
Skeptic: So what does Karwin recommend?
Engineer: Avoid NULL when a sentinel value is more explicit. When NULL is correct semantically, always use IS NULL and IS NOT NULL. Use COALESCE() generously. And be deeply suspicious of any NOT IN that touches a subquery — Karwin suggests always using NOT EXISTS instead, where NULL behavior is unambiguous.
Keys Without Cascades Are Worse Than No Keys
Skeptic: I have seen teams add foreign key constraints in schema migrations but never use ON DELETE at all. Is that really worse than nothing?
Engineer: Karwin argues yes. A foreign key with default NO ACTION enforces that the referenced row exists at insert/update time — but provides no guarantee about what happens when you delete the parent. The result is one of two things:
- Application errors:
DELETE FROM orders WHERE id = 101fails becauseorder_itemsstill references it. The application was not written to handle this, and the user sees an opaque error. - Silent orphans: if the FK was defined
DEFERRABLEor if the application bypasses the constraint (some ORMs do this), rows accumulate forever with a foreign key pointing to a non-existent row.
flowchart LR
subgraph On["ON DELETE CASCADE"]
O1["parent row"] -->|"delete propagates"| C["children deleted automatically"]
end
subgraph Off["NO ACTION / no ON DELETE"]
O2["parent row"] -->|"delete blocked"| ERR["error — or orphaned children if bypassed"]
end
style On fill:#e8f5e9
style Off fill:#fce4ec
Karwin's rule: foreign keys without ON DELETE action are a sword without a sheathe. If you declare the relationship, declare the lifecycle.
Pagination: Keyset vs. Offset
Skeptic: LIMIT 10000, 20 — surely that works fine?
Engineer: It works. It just gets slower linearly with offset. On a 5-million-row table, LIMIT 100000, 20 requires scanning and discarding 100,000 rows. The database does the work whether or not you see it.
Karwin's alternative is keyset pagination (also called the seek method):
-- ✓ Keyset pagination
SELECT * FROM articles
WHERE created_at < '2026-06-04T23:59:00Z'
AND (created_at, id) < ('2026-06-04T23:59:00Z', 9999)
ORDER BY created_at DESC, id DESC
LIMIT 20;
This is an index seek — O(log n) — not a table scan. The (created_at, id) tuple handles the case where multiple rows share the same created_at, making the sort stable.
flowchart LR
subgraph OffsetP["Offset Pagination"]
LP["LIMIT 10000, 20"] --> Skip["Skip 10,000 rows"]
Skip --> Seq["Sequential scan cost"]
end
subgraph KeysetP["Keyset Pagination"]
KP["WHERE created_at < last_seen<br/>ORDER BY ...<br/>LIMIT 20"] --> Idx["Index range seek"]
Idx --> Olog["O(log n) — consistent"]
end
style KeysetP fill:#e8f5e9
style OffsetP fill:#fce4ec
Karwin's recommendation: keyset pagination for any feed, list, or search result beyond a few hundred rows. Offset pagination for the first page and admin views where bounded data makes offset cost negligible.
Closing Thoughts
Karwin closes the revised edition with a message that holds for all four parts: the database is the longest-lived component of most systems. Application code gets rewritten. Schemas persist. Queries get copied into new services. Antipatterns — once introduced — propagate precisely because they were never flagged as problems.
Read this book once. Read it again before your next schema review.
Rating: 9/10 — Essential foundational reading for anyone who writes SQL.