booklore

Naked Statistics

Stripping the Dread from the Data

sufficient

reading path: overview → analysis → narration


overview

Overview

Naked Statistics is Charles Wheelan's case that statistics is neither tedious nor optional — it is the single most important body of knowledge for any citizen who wants to read the modern world honestly. The book takes the reader on a tour of the ideas that actually drive news headlines, medical research, election forecasts, and economic policy: descriptive statistics, probability, sampling, the central limit theorem, inference, regression. Along the way Wheelan dismantles the dread that the word "statistics" triggers, replacing formulas with intuition, jargon with stories, and anxiety with the satisfaction of finally understanding why the polls were off, why the medical study is being misreported, or why the regression in the news is not the regression the journalist thinks it is.

The book follows a deliberate arc. Part One builds the descriptive toolkit — mean, median, mode, standard deviation, correlation. Part Two introduces probability and the Monty Hall problem as a case study in why intuition fails. Part Three makes the leap from description to inference: sampling, the central limit theorem, confidence intervals, hypothesis testing, and the polling machinery that turns a few thousand phone calls into a national prediction. Part Four lands at regression analysis, the workhorse of modern empirical work, and ends with a sober catalogue of how it goes wrong.

This is the second installment of Wheelan's "Naked" series, following Naked Economics (2002) and preceding Naked Money (2017). It was a New York Times bestseller, and the San Francisco Chronicle called it "brilliant, funny... the best math teacher you never had."

Executive Summary

The Two Halves of Statistics

| Half | Job | Tools | |---|---|---| | Descriptive statistics | Summarize what is true in the data you have | Mean, median, mode, standard deviation, percentiles, correlation | | Inferential statistics | Use a sample to make a defensible claim about a larger population | Central limit theorem, standard error, confidence intervals, hypothesis tests, regression |

Wheelan's central claim is that almost every public misunderstanding of data comes from conflating these two jobs — either by treating a sample like a census, or by reading a description as if it were a causal explanation.

The Inference Engine

flowchart LR
    Pop[Population<br/>the thing you<br/>actually care about] --> Sample[Random Sample<br/>the data you<br/>can actually get]
    Sample --> Stat[Sample Statistic<br/>mean, %, slope]
    Stat --> SE[Standard Error<br/>how much the<br/>statistic would<br/>wiggle across<br/>samples]
    SE --> CI[Confidence Interval<br/>stat plus or minus<br/>a margin]
    CI --> Claim[Defensible Claim<br/>about the population]
    Pop -.->|"Infer"| Claim
    Sample -.->|"Describe"| Stat

The diagram captures Wheelan's main argument: the population is invisible, the sample is all you ever see, and the entire machinery of statistics exists to bridge the two without lying about the gap.

Core Concepts Covered

  • Descriptive statistics — mean, median, mode, range, standard deviation, percentiles, correlation coefficient.
  • Probability — expected value, independent events, the law of large numbers, the Monty Hall problem.
  • Sampling and bias — why a thousand randomly chosen Americans can describe three hundred million, and the five ways samples go wrong.
  • The central limit theorem — the wizard behind every poll, every clinical trial, every quality-control chart.
  • Inference — confidence intervals, hypothesis testing, statistical significance, p-values.
  • Regression analysis — the workhorse of modern empirical work, and the seven common ways it is misused.

Key Takeaways

  1. The mean is not the middle. The mean is pulled toward extremes; the median is the true middle. When Bill Gates walks into a bar, the mean wealth of the patrons soars, but the median is unaffected. Use the median when the distribution is skewed.
  2. Standard deviation is the single most important number you are not taught. It is the universal unit for "how spread out is this data." Two SAT prep classes can have identical mean scores and wildly different value, and the standard deviation is what tells them apart.
  3. Correlation is not causation. Ice cream sales and drowning deaths rise together in summer. Neither causes the other. Wheelan devotes an entire chapter to the difference and the mental move required to hold them apart.
  4. The Monty Hall problem is real and stubborn. Switching doors doubles your probability of winning. Thousands of people, including many PhD mathematicians, refuse to believe it on first contact. The example is Wheelan's demonstration that human intuition about probability is systematically unreliable.
  5. The central limit theorem is the reason the world is knowable. Take any population — even a wildly non-normal one — draw a large enough random sample, and the distribution of sample means will look like a bell curve. This is the engine that makes polls, trials, and quality control possible.
  6. A confidence interval is a humility ritual. A poll that says "52% plus or minus 3%" is not lying about its uncertainty — it is doing the math honestly. A poll that reports "52%" with no margin is hiding the same uncertainty behind false precision.
  7. Statistical significance is a convention, not a fact. The 0.05 threshold is a habit, not a law of nature. Researchers running many tests on the same data — the multiple comparisons problem — almost always find "significant" results by chance. The famous 2009 "dead salmon" fMRI study is Wheelan's exhibit A.
  8. Regression is the microscope of the empirical world. With enough data and the right variables, regression can isolate the effect of one factor while holding others constant. But it can also be weaponized: the Women's Health Initiative showed that decades of estrogen prescriptions to older women were based on regression results that later randomized trials overturned — tens of thousands of premature deaths, by one estimate.
  9. A representative sample beats a large biased one. A small sample drawn correctly is more valuable than a huge one drawn from the wrong population. The math of inference is unforgiving of selection bias.
  10. Programs that "work" often don't, because of regression to the mean. The fire department that adopts a new procedure, the business featured on the cover of a magazine, the sports team on the cover of Sports Illustrated — they all tend to underperform afterward. The reason is not the new procedure. It is that they were selected at the peak of a random fluctuation, and random fluctuations regress.

Who Should Read

| Reader Profile | Why This Book | |---|---| | The math-anxious general reader | The book exists for you. Wheelan treats the reader as intelligent but unburdened by formal training | | Journalists, marketers, managers | Anyone who works with data and needs to stop being fooled by their own charts | | Introductory statistics students | A friendly, story-driven supplement to a textbook — explains the why behind the formulas | | Medical and health-curious readers | Will leave you far better equipped to read clinical studies and risk headlines | | Policy professionals and analysts | Wheelan's chapters on regression, program evaluation, and program evaluation are gold | | Citizens evaluating the news | After this book, you cannot unsee a polling error, a p-value, or a confound |

Who Should Avoid

| Reader Profile | Why Skip | |---|---| | Professional statisticians | The treatment is non-technical; you will be bored by chapter 2 | | Readers who need a textbook with proofs | The book trades rigor for accessibility and is not a substitute for a course | | Those looking for Bayesian methods, causal inference, or machine learning | Wheelan covers the classical frequentist toolkit, not the modern extensions | | Anyone allergic to jokes about Las Vegas, drug cartels, and HBO's The Wire | You have been warned |

Difficulty

Easy-Medium. Wheelan writes at roughly a 7th-grade reading level (per the book's own appendix, modeled on Time magazine). There are no equations in the main text; the math is moved to an appendix for readers who want it. Estimated reading time: ~7 hours.

Historical Context

Naked Statistics appeared in January 2013, between the 2012 and 2014 U.S. elections — a period in which polling, Nate Silver, and the now-ubiquitous "538" forecast turned statistical literacy into a mass political event. The book also arrived at the crest of the data-science wave: the term "data scientist" had just been declared the "sexiest job of the 21st century" by Harvard Business Review (October 2012). Hal Varian, chief economist at Google, had recently pronounced statistics "sexy." Wheelan's project — strip the dread from the data — landed in a moment of acute cultural readiness.

Wheelen's own biography shaped the project. A former correspondent for The Economist, he had already written the bestseller Naked Economics (2002; revised 2010), which used the same plain-English approach for economic ideas. Naked Statistics extended the same formula — minimal jargon, vivid examples, occasionally crude jokes — to the discipline whose abuse most often appears in news headlines: statistics.

| Book | Author | Connection | |---|---|---| | Naked Economics | Charles Wheelan | Sister volume; same voice applied to economic thinking | | The Signal and the Noise | Nate Silver | Forecasts, predictions, and why most experts are worse than a simple model | | How Not to Be Wrong | Jordan Ellenberg | Mathematical thinking in everyday life, with more proofs | | The Drunkard's Walk | Leonard Mlodinow | Probability and randomness for the general reader | | Innumeracy | John Allen Paulos | The classic short book on mathematical illiteracy | | Thinking, Fast and Slow | Daniel Kahneman | The psychology behind the cognitive errors Wheelan describes in probability | | Superforecasting | Tetlock & Gardner | What separates good statistical thinkers from bad ones | | The Black Swan | Nassim Taleb | A more radical critique of the same statistical tools | | Naked Money | Charles Wheelan | Third volume in the series — finance, banking, monetary policy |

Final Verdict

** Rating: 8.5 / 10 **

Naked Statistics succeeds at the very thing it sets out to do: it makes statistics accessible to readers who would never open a textbook. The combination of vivid stories (the SAT prep classes with identical means and wildly different spreads, the dead salmon in the fMRI machine, the Women's Health Initiative estrogen catastrophe), careful definitions, and a steady drumbeat of "now you have the tools to see through this" is genuinely liberating.

The book has clear limitations. Wheelan is faithful to the classical frequentist toolkit, and readers who want modern causal inference (potential outcomes, instrumental variables, directed acyclic graphs) will need to go elsewhere — The Book of Why by Judea Pearl and Dana Mackenzie is the natural next step. The book also does not engage with the replication crisis that was gathering force in 2013 and that has since reshaped how researchers think about p-values, preregistration, and effect sizes. A 2026 reader will notice the absence.

Despite these gaps, the book has aged well. The core ideas — descriptive statistics, probability, sampling, the central limit theorem, inference, regression — are the operating system of empirical reasoning. They are not trendy. They are not going to be overturned. And most adults in 2026 still do not understand them. Naked Statistics is the friendliest, most honest, and most useful single-volume introduction to that operating system. Read it, and you will never read a poll, a study, or a regression the same way again.


content map

Core Concepts

Descriptive Statistics — Summarizing the Crime Scene

Wheelan opens with the central metaphor: data is a crime scene, and statistics is the detective work that turns thousands of messy observations into a meaningful answer. The first set of tools are the descriptive statistics — numbers that capture, in a single value or two, what is true about a dataset.

Mean, Median, Mode

These three measures of "central tendency" answer different questions.

  • Mean — the arithmetic average. Add everything up, divide by the count. Sensitive to outliers: if ten friends earn between $40,000 and $60,000, and a billionaire joins the lunch, the mean leaps; the friends' actual financial situation has not changed.
  • Median — the true middle. Sort the values and pick the one in the center. Unaffected by extremes. When distribution is skewed — income, house prices, city populations — the median is usually the honest summary.
  • Mode — the most common value. Useful for categorical data (most common blood type, most popular first name) and useless for continuous data with no repeats.

| Use the mean when… | Use the median when… | |---|---| | Distribution is symmetric | Distribution is skewed by outliers | | You need to combine groups | You want a typical individual | | You care about totals | You care about the typical person |

Standard Deviation

The single most important number most readers were never taught. Standard deviation is the average distance of every data point from the mean, in the same units as the data. It is the universal currency for "how spread out is this data."

Two SAT prep classes both post an average score gain of 50 points. In Class A, every student gains between 45 and 55. In Class B, half the class gains 200 and half lose 100. The means are identical; the standard deviations are wildly different; the classes are wildly different in value. Mean without standard deviation is a half-truth.

Wheelan's instinct: always ask for the spread, not just the average. A weather report of "average temperature 60 degrees" is meaningless without the variance; a stock that "averages 10% annual return" without a standard deviation could be a steady eddy or a Vegas slot machine.

Percentiles and the Box

Percentiles sort a distribution into hundred equal pieces. The 90th percentile height for American men is about 6'0" — 90% of men are shorter. The interquartile range (25th to 75th percentile) is a robust, outlier-proof measure of spread.

A box plot is the visual: a box for the middle 50%, a line for the median, whiskers for the rest, dots for the outliers. Wheelan returns to the box plot as a default visual because it forces honesty about spread.


Correlation — The Most Misunderstood Number in the World

The correlation coefficient runs from -1 to +1.

  • +1 — perfect positive correlation (height and weight).
  • 0 — no linear relationship (shoe size and IQ).
  • -1 — perfect negative correlation (outside temperature and heating bills).

Wheelan's warning: correlation is necessary but never sufficient for causation. The famous examples — ice cream sales and drowning deaths, both peaking in summer; Nicolas Cage films and pool drownings, also correlated; per capita consumption of mozzarella cheese and the number of civil engineering doctorates awarded — are funny precisely because the correlations are real and the causal claims are absurd.

The mental move: a correlation establishes that two variables move together. To claim that one causes the other, you need a third piece of information — a theory, a randomized experiment, an instrumental variable, a natural experiment, a time order, or a dose-response curve — that rules out coincidence, reverse causation, and confounding.

A useful rule of thumb Wheelan never quite states but implies: if the correlation is large and the causal mechanism is obvious, correlation may be good enough. If either is weak, slow down. If the headline is "X causes Y" and the underlying study is observational, treat the headline as a hypothesis, not a finding.


Probability — Why Your Intuition Is Wrong

Expected Value

The expected value of a gamble is the average outcome if you played infinitely many times. It is not the outcome you should expect from any single play. A lottery ticket with a 1-in-100-million chance of winning $100 million has an expected value of $1 — but a 99.99999% chance of being worthless.

Casinos survive on expected value. A roulette wheel with 38 numbers, one green, pays 35-to-1 on a single number. The expected value of a $1 bet is about $0.95. The casino collects the 5% on every spin, forever, by the law of large numbers. The gambler is, on average, donating 5% per spin. The house needs only one spin to win, the gambler needs millions.

Independent Events and the Law of Large Numbers

Independent events do not have memory. A coin that has come up heads ten times in a row is no more or less likely to come up heads on the eleventh flip. The "gambler's fallacy" — believing that ten reds in a row makes black due — is the systematic error of treating independent events as if they were connected.

The law of large numbers: as the number of trials grows, the average outcome converges on the expected value. This is why the casino always wins in the long run and why a poll of a thousand people can predict an election of a hundred million.

The Monty Hall Problem

The set-up: a game show, three doors, a car behind one, goats behind the other two. You pick a door. The host, who knows where the car is, opens a different door to reveal a goat. He offers to let you switch to the remaining unopened door. Should you switch?

The answer, which Wheelan treats at length: yes, you double your chance of winning by switching. The probability your original door hides the car is 1/3 and never changes. The probability the car is behind one of the other two doors is 2/3. The host has revealed a goat behind one of those two. The 2/3 collapses onto the single remaining unopened door. Switching wins 2/3 of the time. Staying wins 1/3 of the time.

This is the most famous probability puzzle in the world. It is also correct, and it has been empirically demonstrated in dozens of psychology studies. The lesson Wheelan draws: human intuition about probability is notoriously wrong, and a careful thinker has to override it with arithmetic.


Sampling — The Miracle of Modern Empirical Work

Why a Sample Works

A random sample of about 1,000 Americans can describe a country of 300 million with about 3 percentage points of error. The reason: variability in the sample mean shrinks as 1 over the square root of the sample size. To halve the margin of error, you must quadruple the sample. To go from 3% to 0.3%, you need a sample 100 times larger — 100,000 people. Diminishing returns set in early.

The Five Biases That Wreck a Sample

Wheelan's catalogue of sample failure modes:

  1. Selection bias — the sample is not drawn from the population you want to describe. Literary Digest's 1936 poll predicted Alf Landon would beat FDR; it sampled its own subscribers and automobile club members, and missed the New Deal coalition entirely.
  2. Non-response bias — those who refuse to answer differ systematically from those who do. Phone surveys miss the poor; online panels miss the elderly; political surveys miss the disengaged.
  3. Recall bias — people misremember. "How many sexual partners have you had?" gets different numbers depending on whether the researcher asks politely or provides a private ballot.
  4. Survivorship bias — you see the winners and miss the failures. The visible planes returning from WWII combat had bullet holes everywhere except the engines; Abraham Wald's insight was that the missing planes — the ones that did not return — were the ones hit in the engines.
  5. Healthy-user bias — observational studies that compare people who take vitamins to people who do not consistently find that vitamin-takers are healthier. But vitamin-takers also tend to exercise, eat vegetables, and go to the doctor; the vitamin may be innocent.

The lesson: a small, well-drawn sample beats a large, biased one. The math of inference is unforgiving of selection effects.


The Central Limit Theorem — The Wizard Behind the Curtain

This is the most important theorem in the book. Its claim: take any population, no matter how weirdly distributed. Draw a large enough random sample, repeat the sampling process many times, and the distribution of sample means will look like a bell curve, centered on the true population mean.

The theorem does not require the population itself to be normal. It does not require large samples — usually 30 is plenty. It works for incomes, heights, dice rolls, anything with a finite variance. It is the reason polls work, the reason clinical trials work, the reason quality control works, the reason weather forecasts work.

Standard error is the standard deviation of the sampling distribution. It is the "wiggle room" in a sample mean. The standard error of a sample mean is the population standard deviation divided by the square root of the sample size. The central limit theorem tells you what the sampling distribution looks like; the standard error tells you how wide it is.

This is the machinery that turns a sample into a population claim.


Inference — From Data to Claims

Confidence Intervals

A 95% confidence interval is a range constructed by a procedure that captures the true population value 95% of the time, in the long run. It is not a probability statement about the specific interval you just computed (the true value is either in this interval or it is not, and we don't know which). It is a statement about the procedure.

The subtle bit: the 95% refers to the procedure's long-run performance, not to the specific interval's probability of being right. Most non-statisticians (and most journalists) misinterpret this constantly. The honest framing: "we are 95% confident the true value lies in this range, which is a way of saying our procedure catches the true value 95% of the time across repeated samples."

Hypothesis Testing and the p-Value

The structure of a hypothesis test:

  1. State a null hypothesis (the default assumption — usually "no effect" or "no difference").
  2. Compute a test statistic from the data.
  3. Calculate the probability of seeing a test statistic at least this extreme, if the null hypothesis is true. This is the p-value.
  4. If the p-value is below the conventional threshold (usually 0.05), reject the null and call the result "statistically significant."

The threshold of 0.05 is a convention set by Ronald Fisher in the 1920s. It is a habit, not a law of nature. It is also a one-size-fits-all number applied to wildly different research contexts, which is part of why the replication crisis hit social science in the 2010s.

The Multiple Comparisons Problem — The Dead Salmon

If you test 20 hypotheses at the 0.05 level, you expect one of them to come up significant by chance alone, even if all 20 null hypotheses are true. This is the multiple comparisons problem.

Wheelan's exhibit A is the 2009 Bennett et al. study, "Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon." A dead salmon was placed in an fMRI machine and shown emotional photos. Standard analysis produced significant brain activity in the dead fish. The result was a deliberate demonstration that running thousands of comparisons on a single brain scan will turn up "activity" by chance, even in dead tissue.

The lesson: a single p-value of 0.04 in a study that ran 50 tests is worthless. Modern practice adjusts for multiple comparisons (Bonferroni, Benjamini-Hochberg, false discovery rate) — but many older studies, and many in the popular press, do not.

Reversion to the Mean

Extreme observations tend to be followed by less extreme ones. This is not a force or a correction; it is mathematics. If you select the tallest person in a room, their next measurement will on average be shorter. Not because they shrank, but because the selection process captured a random high.

The implication: any program that "selects" extreme cases for intervention will appear to work, even if the intervention does nothing. The fire department whose response times were terrible before training will improve. The sports team on the cover of Sports Illustrated will have a bad season. The CEO featured on the cover of BusinessWeek will underperform. The intervention, the jinx, the curse — all are reversion to the mean dressed up as causation.


Regression Analysis — The Workhorse

Regression is the statistical microscope of the empirical world. It isolates the relationship between an outcome variable and one or more predictor variables while holding the others constant. It is the single most-used tool in observational research, and the most-abused.

The Core Idea

Fit a line (or curve) through a cloud of data points that minimizes the sum of squared vertical distances. The slope of that line is the estimated effect of the predictor on the outcome, controlling for the other predictors in the model. A slope of 0.3 with a p-value of 0.01 says "holding everything else equal, a one-unit increase in the predictor is associated with a 0.3-unit increase in the outcome, and this association is unlikely to be due to chance."

Seven Common Mistakes

  1. Linear fit on a nonlinear relationship — fitting a straight line to a U-shaped curve yields a slope of nearly zero, missing the real pattern.
  2. Correlation is not causation — regression measures association, not cause.
  3. Reverse causation — sometimes the arrow goes the other way. Does depression cause poor sleep, or does poor sleep cause depression?
  4. Omitted variable bias — the most important and most common error. A regression of income on height that omits gender, age, and education will attribute those omitted variables' effects to height.
  5. Highly correlated predictors (multicollinearity) — including two variables that measure the same thing inflates the standard errors and makes coefficients unstable.
  6. Extrapolation beyond the data — the line may slope up within the observed range and then reverse outside it.
  7. Outliers — a few extreme points can dominate the fit and shift the line dramatically.

The Estrogen Catastrophe

The cautionary tale Wheelan returns to: in the 1990s, observational data showed that women who took estrogen after menopause had lower rates of heart disease. Millions of women were prescribed the drug. The randomized Women's Health Initiative, completed in 2002, found the opposite — estrogen increased the risk of heart disease, stroke, and breast cancer. A 2003 New York Times Magazine estimate suggested tens of thousands of premature deaths.

The lesson: a regression coefficient is a hypothesis, not a prescription. Until you have a randomized trial, the policy implications are tentative.


Program Evaluation — Isolating Cause and Effect

The cleanest way to determine causality is the randomized controlled trial. Assign subjects to treatment or control by coin flip. The only systematic difference between the two groups is the treatment. Any difference in outcomes is causal.

When randomization is impossible, the next-best tools are natural experiments (a policy that affected some people but not others for reasons unrelated to outcomes), difference-in-differences (compare trends before and after an intervention in treated and untreated groups), regression discontinuity (compare subjects just above and just below an eligibility cutoff), and instrumental variables (a third variable that shifts the treatment but affects the outcome only through the treatment).

The program evaluation chapter is Wheelan's reminder that empirical work is hard, that clean causal identification is rare, and that the honest answer to "does this work?" is often "we don't know yet — and anyone who claims otherwise without a randomized trial is selling you something."


Frameworks

graph TD
    A["THE NAKED STATISTICS TOOLKIT"] --> B["DESCRIPTIVE STATS<br/>Summarize the data<br/>you have"]
    A --> C["PROBABILITY<br/>Reason under<br/>uncertainty"]
    A --> D["INFERENCE<br/>Go from sample<br/>to population"]
    A --> E["REGRESSION<br/>Isolate relationships<br/>in complex data"]

    B --> B1["Mean / Median / Mode<br/>Standard Deviation<br/>Percentiles<br/>Correlation"]
    C --> C1["Expected Value<br/>Law of Large Numbers<br/>Monty Hall<br/>Reversion to Mean"]
    D --> D1["Sampling<br/>Central Limit Theorem<br/>Confidence Intervals<br/>Hypothesis Testing"]
    E --> E1["OLS Regression<br/>Multivariate Control<br/>Natural Experiments<br/>Program Evaluation"]

    B1 --> F["MISTAKE:<br/>Mean hides spread"]
    B1 --> G["MISTAKE:<br/>Correlation != causation"]
    C1 --> H["MISTAKE:<br/>Intuition beats math"]
    D1 --> I["MISTAKE:<br/>Biased sample"]
    D1 --> J["MISTAKE:<br/>p-hacking /<br/>multiple comparisons"]
    E1 --> K["MISTAKE:<br/>Omitted variable bias"]

    style A fill:#1a1a2e,color:#fff,stroke:#e94560
    style B fill:#16213e,color:#fff,stroke:#e94560
    style C fill:#16213e,color:#fff,stroke:#e94560
    style D fill:#16213e,color:#fff,stroke:#e94560
    style E fill:#16213e,color:#fff,stroke:#e94560
    style B1 fill:#0f3460,color:#fff,stroke:#e94560
    style C1 fill:#0f3460,color:#fff,stroke:#e94560
    style D1 fill:#0f3460,color:#fff,stroke:#e94560
    style E1 fill:#0f3460,color:#fff,stroke:#e94560
    style F fill:#533483,color:#fff,stroke:#e94560
    style G fill:#533483,color:#fff,stroke:#e94560
    style H fill:#533483,color:#fff,stroke:#e94560
    style I fill:#533483,color:#fff,stroke:#e94560
    style J fill:#533483,color:#fff,stroke:#e94560
    style K fill:#533483,color:#fff,stroke:#e94560

Mental Models

| Model | Application | |---|---| | The crime scene | Data is a crime scene; statistics is the detective work. A mean is the sketch; the standard deviation is the timeline | | Bill Gates in the bar | Outliers break the mean. Use the median when distributions are skewed | | The ice cream / drowning example | Correlation is not causation. Always ask for the missing third variable | | The casino's 5% | Expected value minus stake. A losing game played enough times is a guaranteed loss for the player and a guaranteed win for the house | | The 1,000-person poll | A sample of 1,000 can describe 300 million. Variability of the mean shrinks as 1 over the square root of sample size | | The bell curve of means | The central limit theorem. Sample means are normal even when populations are not | | The dead salmon | Run too many comparisons and you will find "significance" in a corpse. Always correct for multiple testing | | The Sports Illustrated jinx | Selection at a peak is followed by regression. Programs that "work" on extreme cases may work only by reversion to the mean | | The 95% procedure | A confidence interval is a statement about the long-run performance of a procedure, not about the specific interval you just computed | | The two engines | Description (summarize the data you have) and inference (project from sample to population) are different jobs; do not confuse them | | Omitted variable bias | The most common error in regression. If a key variable is missing, the regression will quietly attribute its effect to whatever is in the model |


Key Lessons

  1. Always ask for the spread. A mean without a standard deviation is a half-truth. Numbers like "average income" or "average response time" are incomplete without the variance.
  2. Correlation is necessary but never sufficient for causation. A correlation, a theory, and ideally a randomized experiment are the minimum for a causal claim.
  3. Your gut is wrong about probability. The Monty Hall problem, the hot-streak fallacy, the prosecutor's fallacy — all show that intuitive reasoning about uncertainty is systematically unreliable. Run the math.
  4. A small random sample beats a large biased one. The math of inference punishes selection effects without mercy. Garbage in, garbage out — but with a confidence interval attached.
  5. The central limit theorem is the reason the world is knowable. Every poll, every clinical trial, every quality-control chart rests on the fact that sample means cluster around population means in a predictable shape.
  6. Statistical significance is a convention, not a fact. The 0.05 threshold is a habit set in 1925. Many "significant" findings are noise. The dead salmon study is the cartoon version of the problem.
  7. Regression isolates associations, not causes. Without randomization or a credible natural experiment, a regression coefficient is a hypothesis, not a prescription. The estrogen catastrophe is the proof.
  8. Reversion to the mean is everywhere. Any program that selects extreme cases for intervention will appear to work. The improvement is arithmetic, not therapeutic.
  9. Confidence intervals are humility rituals. A claim with a wide margin of error is a claim that knows its own limits. A claim with no margin at all is hiding the same uncertainty behind false precision.
  10. Statistical literacy is a civic skill. The world is awash in numbers that try to influence what you buy, vote for, and believe. Understanding how they are produced and where they go wrong is the defense.

Practical Applications

Reading a poll: Find the sample size, the margin of error, and the field dates. A 2016 national poll of 500 likely voters with a 4.5% margin of error is barely informative. A 2016 national poll of 2,000 likely voters with a 2.5% margin taken three weeks before the election is informative. Ignore the headline number; compare the candidates' overlapping ranges — if they overlap, the poll cannot tell them apart.

Evaluating a medical study: Is it randomized? If not, the finding is a hypothesis. If yes, how large is the effect size? A 0.001% reduction in absolute risk may be "statistically significant" but practically meaningless. Also: who funded it, how many participants dropped out, and was the primary outcome pre-registered?

Reading a regression table in the news: Find the coefficient, the standard error, and the p-value. A coefficient of 0.30 with a standard error of 0.40 and a p-value of 0.45 is not significant — the journalist who reports "study finds X causes Y" is over-claiming. Also check: what other variables are in the model? If the obvious confounder is missing, the coefficient is suspect.

Evaluating a causal claim in the news: Before believing "X causes Y," ask three questions. (1) Is there a plausible mechanism? (2) Does the correlation hold up after controlling for the obvious confounders? (3) Is there a randomized or natural experiment that confirms the relationship? If all three are "yes," you can start to believe it.

Choosing a sample for a survey: Define the population clearly. Use a random sampling frame. Pursue non-respondents. Weight the results if some demographic groups are over- or under-represented. Report the margin of error and the response rate.

Interpreting a clinical trial result: Look at the absolute risk reduction, not just the relative one. A drug that reduces the risk of stroke by 50% sounds impressive — but if the baseline risk is 0.002% and the treated risk is 0.001%, the absolute benefit is 0.001 percentage points, and 100,000 people must be treated to prevent one stroke.

Distinguishing signal from noise in your own data: Plot the distribution. Look at the median, the standard deviation, and the outliers. Be suspicious of any single observation. Trust the procedure that produces consistent results across many samples more than any one dramatic finding.


Examples

The Literary Digest Poll of 1936: The magazine mailed 10 million ballots and received 2.3 million responses, predicting Alf Landon would beat FDR in a landslide. The actual result: FDR won 46 of 48 states. The bias: the sample was drawn from the magazine's own subscriber list and automobile club memberships — both skewed Republican and wealthy in 1936. Size did not save the poll. A small, well-drawn sample would have been more accurate.

The dead salmon in the fMRI machine: Bennett, Wolford, and Miller (2009) placed a dead Atlantic salmon in an fMRI scanner, showed it emotional photographs of humans, and ran a standard analysis. The result showed "significant" brain activity in the salmon's corpse. The point: running thousands of voxel-wise comparisons on a single brain scan guarantees false positives, even in dead tissue. The paper was a deliberate demonstration of the multiple comparisons problem and became a teaching tool in statistics courses worldwide.

The Women's Health Initiative: Decades of observational data showed that postmenopausal women taking estrogen had lower rates of heart disease. Millions of prescriptions followed. The WHI, a randomized trial of 16,000 women, found the opposite: estrogen increased the risk of heart disease, stroke, and breast cancer. The trial was halted early. A 2003 New York Times Magazine estimate suggested tens of thousands of premature deaths. The cause: observational regression without randomization, treated as a prescription.

The missing bullet holes: During WWII, the US military studied returning bombers to figure out where to add armor. The planes that came back were riddled with bullet holes — everywhere except the engines. Abraham Wald's insight: the planes that came back are the planes that survived hits to non-critical areas. The aircraft that did not return were the ones hit in the engines. The armor went on the engines. This is the canonical example of survivorship bias.

The case of the two SAT prep classes: Two prep classes both post average score gains of 50 points. Class A: every student gains between 45 and 55. Class B: half the class gains 200, half lose 100. Same mean, wildly different value. The standard deviation — the spread — is the only honest way to tell them apart.

The sports team on the cover of Sports Illustrated: A statistical analysis showed that teams and athletes featured on the Sports Illustrated cover underperform in the period immediately afterward. This "SI jinx" became a media cliché. The actual explanation: selection on the dependent variable. Being on the cover means being at an extreme high point. The next observation, on average, will be lower. The "jinx" is reversion to the mean.


Action Plan

  1. Audit the next three statistics you encounter in a news article for a sample size, a margin of error, and a measure of spread. If any are missing, treat the number as advertising.
  2. Learn to compute and interpret a confidence interval. Take a small dataset (a sample of 30 friends' incomes, say) and report the mean with a margin. Practice until the procedure is automatic.
  3. Before believing any causal claim, run the three-question test: plausible mechanism? Confounders controlled? Randomized or natural experiment? Two "no" answers means the claim is a hypothesis.
  4. For any regression you encounter, check for omitted variable bias. What other variables should be in the model? If a key one is missing, the coefficient is suspect.
  5. When evaluating a program, demand a control group. Anecdotal before-and-after comparisons are vulnerable to reversion to the mean.
  6. Treat a single p-value below 0.05 as a flag, not a finding. Ask how many comparisons were run. If the answer is "many," the result is noise.
  7. Read a research paper from start to finish at least once. The appendix, the limitations section, and the discussion of effect sizes are where the real signal lives. Most journalistic summaries skip them.
  8. Maintain a healthy suspicion of your own gut. The Monty Hall problem, the prosecutor's fallacy, the hot-streak fallacy, and the gambler's fallacy all show that intuitive probability reasoning is systematically wrong. Run the math.
  9. Build a habit of plotting before concluding. A scatter plot, a histogram, or a box plot will reveal structure that a single summary statistic hides.
  10. Recommend the book. Statistical literacy is contagious. The more people who can read a poll or a study critically, the better our public conversations about data become.

analysis

Strengths

  • Remarkable clarity. Wheelan writes at a 7th-grade reading level (per his own appendix, modeled on Time magazine) without ever being condescending. The book takes ideas that have tripped up countless textbooks and renders them in prose that any literate adult can follow.
  • The right examples. Wheelan chooses examples that are vivid, counterintuitive, and remembered — the dead salmon, the literary Digest 1936, the Women's Health Initiative estrogen catastrophe, the SAT classes with identical means and different spreads, the Monty Hall problem, the planes returning from WWII. Each one teaches a principle better than any formula would.
  • A complete arc. The book covers the full pipeline from description to inference to regression to program evaluation. A reader who finishes it has the conceptual machinery to read most empirical work in the social and medical sciences, even if they cannot reproduce the math.
  • Self-aware about its limitations. Wheelan repeatedly reminds the reader that he is stripping the math out, that the appendix has the formulas for those who want them, and that the point is intuition, not technique. This honesty about the book's scope is a strength.
  • The Monty Hall chapter is the best single treatment of a probability puzzle in print. Wheelan walks through every common objection, plays out the 100-door version that makes the answer obvious, and then lands the cognitive lesson.
  • The dead salmon as a teaching tool. Bennett, Wolford, and Miller's 2009 dead salmon fMRI paper is a perfect vehicle for the multiple comparisons problem. Wheelan deploys it well, and the image sticks.
  • Humor that works. Wheelan uses jokes (drug cartels, The Wire, steroid use in Major League Baseball) to make the reader want to keep going. The jokes are not forced. They are part of the voice.
  • Public utility. The book is one of the most successful general-audience introductions to statistical thinking ever written. It has moved a generation of journalists, marketers, managers, and curious citizens from "I don't do math" to "I can read a poll."

Weaknesses

  • Classical toolkit only. Wheelan covers the frequentist approach and stops there. There is no Bayesian inference, no causal inference framework (potential outcomes, instrumental variables, directed acyclic graphs), no discussion of machine learning, no treatment of bootstrap or permutation methods. Modern empirical work lives in those extensions, and readers who go on to take a real course will discover a much larger world.
  • No treatment of the replication crisis. The book was published in 2013, just as the replication crisis was breaking in psychology and beginning to surface in other fields. The conversation about p-hacking, publication bias, preregistration, and effect sizes is now central to statistical literacy, and Naked Statistics has no chapter on it. A 2026 reader will notice the gap.
  • The math appendix is perfunctory. Wheelan moves the math to an appendix in keeping with the "no equations in the main text" rule. But the appendix is light. Readers who want to actually do the work will need a textbook; readers who do not will leave the book with intuition but no technique.
  • The "regression" chapters are sketchy. Regression analysis is the most-used empirical tool in the world, and Wheelan devotes only two chapters to it (regression analysis + common regression mistakes). A reader who finishes those chapters will know the names of the seven mistakes but not how to actually fit a model, interpret a coefficient, or diagnose a problem. Mostly Harmless Econometrics or An Introduction to Statistical Learning is the necessary next step.
  • The final chapter is weak. The conclusion, "Five ways statistics is making the world a better place," is a feel-good coda that summarizes little and reads like padding. Some reviewers report skipping it on rereads.
  • The "Monty Hall chapter" in the second half of the book is actually the most important chapter in the book. Placing it as Chapter 5.5 rather than foregrounding it as the centerpiece understates the book's actual strength. A reader who skims the table of contents may miss the chapter that most rewards careful reading.
  • Limited engagement with non-medical empirical work. Wheelan leans heavily on medical examples (estrogen, vitamin C, the WHI). Economic, political, sociological, and ecological examples are sparser. The book would be richer with more variety.

Criticism

"Statistics is not the only antidote to statistical illiteracy"

The book occasionally presents statistical thinking as a magic fix. A reader who finishes the book may come away believing that a sound methodology is sufficient for honest empirical work. The replication crisis has shown that methodology is necessary but not sufficient — incentives, publication bias, preregistration, and replication culture all matter, and the book has nothing to say about any of them.

"The book overstates how easy it is"

A risk of any popularization is the illusion of mastery. A reader who absorbs Wheelan's intuitions can be a more critical consumer of polls and headlines, but cannot themselves run a competent analysis. The book neither warns the reader about this gap nor points to a more rigorous follow-up. The natural next book — Pearl and Mackenzie's The Book of Why for causal inference, or James, Witten, Hastie, and Tibshirani's An Introduction to Statistical Learning for the technique — is not referenced in the conclusion.

"The frequentist framing is now widely contested"

Wheelan's treatment of confidence intervals, p-values, and significance testing is the standard 1950s framing. Since 2016, the American Statistical Association has issued warnings about p-value misuse; since 2019, Nature has published an explicitly anti-p-value editorial; and a growing number of methodologists advocate moving to estimation (effect sizes with confidence intervals) over dichotomous significance tests. The book is silent on this debate. A reader who finishes it may believe the frequentist framework is settled. It is not.

"The examples have not aged uniformly"

The literary Digest, the dead salmon, the WHI, the planes over Europe are all durable. But several of the medical examples (vitamin C and colds, aspirin and heart attacks, the sat-fat story) are now more contested than Wheelan's treatment suggests. A book on applied statistics should include some examples of how the field reverses itself, and Naked Statistics does not.

"The book could have included a chapter on data visualization"

A reader who finishes the book can interpret a poll, a regression table, and a study. But the book does not teach them to read or produce a good chart. Edward Tufte's The Visual Display of Quantitative Information remains the canonical treatment, and Wheelan misses an opportunity to bridge the descriptive and visual.

Counterarguments

| Criticism | Response | |---|---| | Classical toolkit only | This is a popularization, not a textbook. Covering the modern extensions would double the length and lose the audience | | No treatment of the replication crisis | The book was published in 2013, before the crisis broke in earnest. Updating is a separate project | | Math appendix is perfunctory | The reader who wants to do the math is pointed toward the appendix. A popularization cannot also be a textbook | | Weak conclusion | The final chapter functions as inspiration, not summary. A motivated reader can summarize the book themselves | | Limited examples beyond medicine | Medical studies are the most accessible empirical work to non-specialists; broadening the examples would require more statistical background to evaluate | | Frequentist framing is contested | True, but contested does not mean wrong. The frequentist framework remains the dominant approach in most empirical fields, and Wheelan teaches it honestly |

Scientific Support

| Concept | Source | Key Finding | |---|---|---| | The dead salmon study | Bennett, Wolford, & Miller (2009) | Running 130,000 voxel-wise comparisons on a single fMRI scan turned up significant "brain activity" in a dead Atlantic salmon — a deliberate demonstration of the multiple comparisons problem | | Literary Digest 1936 | Robinson (1932); later historical analysis | The poll sampled 2.3 million responses from a population biased toward Republican, wealthy, and automobile-owning voters; the size did not compensate for the bias | | The Women's Health Initiative | Writing Group for the WHI (2002) | A randomized trial of 16,000 postmenopausal women found that combined estrogen + progestin therapy increased the risk of heart disease, stroke, and breast cancer — reversing the observational finding that had driven mass prescription | | Central limit theorem | Laplace (1810) | Sample means drawn from any population with finite variance converge to a normal distribution as sample size grows | | Survivorship bias in WWII bombers | Wald (1943) | Returning bombers had bullet holes everywhere except the engines; armor was added to the engines where the missing (non-returning) planes had been hit | | Monty Hall problem | Selvin (1975) | A letter to the American Statistician set off decades of debate; the correct answer (always switch) was demonstrated empirically by Burns & Wieth (2004) — but only after a logic walk-through | | Estrogen WHI estimate of harm | NYT Magazine (2003) estimated tens of thousands of premature deaths; the actual magnitude has been debated, but the reversal of medical practice is uncontested | | Reversion to the mean | Galton (1886); Kahneman (2011) | Extreme observations tend toward the mean on subsequent measurement; the Sports Illustrated "jinx" and similar phenomena are mathematical, not causal | | Replication crisis | Open Science Collaboration (2015) | Only 36% of psychology studies replicated; many social-science findings have not survived re-examination | | p-Value warnings | Wasserstein & Lazar (2016), Am. Stat. | The American Statistical Association's first formal warning about p-value misuse, six years after the publication of Naked Statistics |

Historical Context

Naked Statistics was published in January 2013, in a year that turned statistical literacy into a mass political event. Nate Silver's five-thirty-eight forecast had correctly predicted the outcome in 50 of 50 states in the 2012 U.S. presidential election. The blog's confident probabilistic pronouncements — "Obama has an 86% chance of winning" — became the most-discussed artifact of the campaign. The book preceded the 2014 midterms, the Brexit referendum, the 2016 Trump upset, and the 2016-2018 replication crisis in psychology — events that made statistical literacy both more important and more contested.

The broader cultural context included:

  • The 2012 Harvard Business Review declaration that "data scientist" was the "sexiest job of the 21st century."
  • Hal Varian's pronouncement that statistics had become "sexy."
  • The first wave of massive open online courses (MOOCs), many of which featured introductory statistics offerings (Daphne Koller's Stanford course, Sebastian Thrun's Udacity).
  • The mainstreaming of polling (Pew, Gallup, YouGov, Quinnipiac) as fixtures of the political news cycle.
  • A parallel explosion in observational data — fitness trackers, social media, web analytics, A/B testing — that brought statistical claims into the daily life of managers and marketers.

Wheelan was uniquely positioned to write the book. A former correspondent for The Economist, he had already written Naked Economics (2002; revised 2010), which used the same voice for economic thinking. He had spent a decade teaching the material to policy students at the University of Chicago and Dartmouth. He had seen, first hand, how badly intelligent adults mistreat the numbers in front of them — and how a few hours of careful explanation could fix it.

The book's success is also a measure of the audience's hunger. It appeared on the New York Times bestseller list and remained in print through multiple printings. The paperback (ISBN 9780393347777) is the 2014 reprint and is the most widely available edition.

Similar Books

| Book | Author | How It Compares | |---|---|---| | Naked Economics | Charles Wheelan | Sister volume; same voice applied to economic thinking. The two together form a strong general-audience education in empirical reasoning | | The Signal and the Noise | Nate Silver | Forecasting and prediction; deeper on specific case studies (weather, baseball, poker, terrorism), lighter on the underlying statistics | | How Not to Be Wrong | Jordan Ellenberg | Mathematical thinking in everyday life; more proofs and more ambitious scope; less focused on empirical work | | The Drunkard's Walk | Leonard Mlodinow | Probability and randomness; the Mlodinow book is shorter and more biographical, but covers similar ground | | Innumeracy | John Allen Paulos | The classic short book on mathematical illiteracy; older but still excellent; lighter on technique | | Thinking, Fast and Slow | Daniel Kahneman | The psychology behind the cognitive errors Wheelan describes; the Kahneman book is the deep dive on the "intuition is wrong" lesson | | Superforecasting | Tetlock & Gardner | What separates good statistical thinkers from bad ones; the most empirically rigorous popular book on calibrated reasoning | | The Black Swan | Nassim Taleb | A more radical critique of the same statistical tools; Taleb is hostile where Wheelan is gentle | | The Book of Why | Judea Pearl & Dana Mackenzie | The causal-inference framework Wheelan does not cover; the natural next book for Naked Statistics readers | | An Introduction to Statistical Learning | James, Witten, Hastie, Tibshirani | The free, accessible, modern textbook; the right follow-up for readers who want to do the work | | Naked Money | Charles Wheelan | Third volume in the series — finance, banking, monetary policy; the same voice applied to financial data | | Factfulness | Hans Rosling | A data-driven picture of the world; complements Wheelan's tools with a corrective worldview |

Long-Term Relevance

High, with caveats. The core ideas — mean, median, standard deviation, correlation, probability, the central limit theorem, inference, regression — are not trendy. They are the operating system of empirical reasoning, and they will remain so regardless of technological change.

Three shifts are worth noting:

  1. The replication crisis has reshaped the conversation. The 2015 Open Science Collaboration finding that only 36% of psychology studies replicated, the 2016 ASA warning on p-values, the 2018 Nature editorial, and the rise of preregistration, registered reports, and effect-size reporting have all changed what counts as good statistical practice. Naked Statistics does not engage with any of this. A 2026 reader who finishes the book will need a second source (the ASA statement, The 7 Deadly Sins of Psychology, or a recent methods textbook) to catch up.
  2. Causal inference has become the central concern. The 21st-century development that Wheelan does not cover is the formal language of causal reasoning — potential outcomes, instrumental variables, directed acyclic graphs, synthetic controls. This is now the standard toolkit of empirical work in economics, political science, epidemiology, and tech. The Book of Why (Pearl & Mackenzie) is the popularization; Causal Inference for Statistics, Social, and Biomedical Sciences (Imbens & Rubin) is the textbook.
  3. Data science and machine learning have absorbed the predictive half. Modern empirical work splits into "prediction" (machine learning, often with no statistical inference) and "causal identification" (formal frameworks for inferring causes from data). Wheelan's book sits between the two — strong on description and classical inference, light on both modern extensions.

The book's main vulnerability: the parts of statistics that are most abused in modern discourse (causal claims from observational data, p-hacked "significance," and uncritical reliance on dashboards) are precisely the parts that a 2026 reader most needs. A revised edition would need a new chapter on causal inference and a new chapter on the replication crisis.

That said, the book remains the best single-volume popular introduction to the descriptive and inferential core of statistics. The chapters on mean, median, and standard deviation; on the central limit theorem; on the Monty Hall problem; and on the Women's Health Initiative will be correct and valuable twenty years from now.

Final Assessment

| Dimension | Rating | Notes | |---|---|---| | Argument Quality | 8/10 | Well-constructed, examples earn their place, occasional hand-waving in the regression chapters | | Practical Utility | 9/10 | Among the most useful popularizations of statistics ever written; the intuition lasts | | Originality | 7/10 | Synthesizes standard material; the voice and the example selection are the contribution | | Readability | 9.5/10 | Outstanding; reads like The Economist for non-economists | | Rigor | 6/10 | The frequentist toolkit is presented honestly but uncritically; modern extensions are absent | | Comprehensiveness | 6/10 | Strong on the descriptive and inferential core; weak on causal inference, the replication crisis, and modern data-science practice | | Overall | 8.5/10 | A genuine contribution that has made statistical literacy accessible to a generation of readers |

Naked Statistics succeeds because Wheelan does something most statisticians cannot: he explains the discipline to non-statisticians without making them feel small. The book is a gift to the math-anxious reader, the journalist who wants to do better, the manager drowning in dashboards, the policy professional who needs to read empirical work, and the citizen who is tired of being misled by headlines. It is not the last word — The Book of Why and An Introduction to Statistical Learning are the necessary follow-ups — but it is the best first word.

The book's most enduring contribution may be the dead salmon. A single example, deployed at the right moment, has shaped how a generation of non-statisticians thinks about significance testing. That is the hallmark of a great popularization: a vivid image that teaches a principle better than the underlying math. Naked Statistics is full of those images. They are why the book is still in print, and why it will be read for years to come.


narration

🎙️ Naked Statistics — The Podcast Episode

Host: Welcome back to the show. Today we are talking about a book that has been on my list for years, and I am embarrassed that it took me this long to read it. It is called Naked Statistics: Stripping the Dread from the Data by Charles Wheelan. It is the second book in his "Naked" series, after Naked Economics, and it does for statistics what Naked Economics does for economics: it makes the subject feel like something ordinary humans can actually understand.

Here is the honest pitch. Statistics is the most important subject most people never learned. Every day you are being manipulated by numbers — polls, medical studies, regression coefficients, "studies show" — and most of us, including most journalists, do not have the tools to tell the honest work from the dishonest work. Naked Statistics gives you those tools. It does so without a single equation in the main text, with stories you will actually remember, and with jokes that sometimes land. Let me walk you through it.


🎙️ The Premise — Statistics Is the Detective Work

Wheelan opens with a metaphor that frames the whole book. He says: data is a crime scene. Something happened. There are thousands of clues scattered everywhere. Statistics is the detective work that turns those clues into a meaningful answer.

This framing matters because it captures something most people miss. Statistics is not the math. Statistics is the judgment — deciding which numbers to look at, which to ignore, which to report, which to flag, and which conclusions the data will and will not support. The math is a tool. The judgment is the work.

And here is the catch: the math is teachable, but the judgment requires understanding what the math is doing. That is the gap Naked Statistics fills. By the end of the book, you will not be able to fit a regression model. But you will know when someone else is fitting one badly, and that alone will save you from a lifetime of bad conclusions.


🎙️ The Mean Is Not the Whole Story

The first tool Wheelan introduces is the mean — the arithmetic average. And the first thing he does with it is break it.

The story: ten friends are having lunch. Their incomes range from $40,000 to $60,000. The mean income is $50,000. Now imagine Bill Gates walks into the restaurant and joins the lunch. The mean income of the group just leapt to over a billion dollars. Did anyone at the table get richer? No. The mean moved because of one extreme observation, not because anything changed for the typical person.

The lesson: the mean is sensitive to outliers. When a distribution is skewed — and income, house prices, city populations, and many other real-world quantities are skewed — the median is the honest summary. The median is the true middle: sort everyone by income, pick the one in the center. Half are above, half below. It is not pulled by Bill Gates.

This is one of the most underappreciated ideas in all of data work. The next time a politician tells you "the average American family" earns some number, ask which average they mean. If the distribution is skewed, the median is the truth.


🎙️ Standard Deviation — The Number You Were Never Taught

If the mean is half the story, the standard deviation is the other half. It is the average distance of every data point from the mean. It tells you how spread out the data is. Two datasets can have the same mean and wildly different stories, and the standard deviation is what tells them apart.

Wheelan's example is two SAT prep classes. Class A: every student gains between 45 and 55 points. Class B: half the class gains 200, half lose 100. The mean gain in both classes is 50 points. The standard deviation in Class A is tiny. The standard deviation in Class B is enormous. The classes are not equivalent, and the mean does not tell you that.

This is Wheelan's first rule of reading numbers: always ask for the spread. "Average income" is meaningless without the variance. "Average test score" is meaningless without the variance. "Average response time" is meaningless without the variance. The number without the wiggle is a half-truth, and most public numbers are half-truths.


🎙️ Correlation Is Not Causation — And You Will Not Believe How Often This Is Confused

Wheelan devotes an entire chapter to the most abused idea in all of data analysis: confusing correlation with causation. The example he uses is the canonical one, and it is funny because it is true. Ice cream sales and drowning deaths both peak in summer. They are perfectly correlated. Neither causes the other. The missing variable — the confounder — is the season.

The deeper lesson: a correlation tells you that two things move together. It does not tell you why. There are at least four possible explanations for any correlation. (1) A causes B. (2) B causes A — the reverse causation problem. (3) A and B are both caused by C, the confounder. (4) Coincidence. To claim a causal relationship, you need a fifth piece of information — a theory, a randomized experiment, a natural experiment, a time order, a dose-response curve — that rules out the other three.

The killer example Wheelan returns to is the Women's Health Initiative. For decades, observational data showed that women who took estrogen after menopause had lower rates of heart disease. The regression coefficient was large, the p-value was tiny, and millions of women were prescribed the drug. Then the WHI — a proper randomized trial of 16,000 women — was completed in 2002. It found the opposite. Estrogen increased the risk of heart disease, stroke, and breast cancer. The trial was halted early. A 2003 New York Times Magazine estimate suggested tens of thousands of premature deaths. The cause of the disaster: observational regression treated as a prescription. The correlation was real. The causal inference was wrong. People died.

The takeaway: a regression coefficient is a hypothesis, not a prescription. Until you have a randomized trial, the policy implications are tentative. This is the most important lesson in the book, and the one most worth tattooing on your forearm.


🎙️ The Monty Hall Problem — Why Your Gut Is Wrong About Probability

Now we come to the chapter Wheelan considers the heart of the book, and the one I am about to ruin by trying to explain in two minutes. The set-up: a game show, three doors, a car behind one, goats behind the other two. You pick a door. The host, who knows where the car is, opens a different door to reveal a goat. He offers to let you switch to the remaining unopened door. Should you switch?

The answer, which is correct, is yes. Switching doubles your chance of winning. The probability your original pick is right is 1/3, and it never changes. The probability the car is behind one of the other two doors is 2/3. The host has just revealed a goat behind one of those two. The 2/3 collapses onto the single remaining unopened door. Switching wins 2/3 of the time. Staying wins 1/3 of the time. This is the correct answer, and it has been confirmed by experiment, and it has been rejected by thousands of people, including a few thousand with PhDs in mathematics, on first contact.

Wheelan uses a 100-door version to make it obvious. Imagine 100 doors. You pick one. The host opens 98 doors to reveal goats. Two doors remain — yours and one other. Are you really going to stick with your original 1-in-100 guess, or are you going to switch to the door the host has deliberately avoided? The host's behavior is the information, and it points overwhelmingly to the unopened door.

The deeper lesson Wheelan draws: human intuition about probability is notoriously wrong. The gambler's fallacy (a coin that has come up heads ten times in a row is not more likely to come up tails on the next flip). The hot-streak fallacy (a basketball player who has made ten shots in a row is not more likely to make the eleventh). The prosecutor's fallacy (a defendant who matches the DNA profile of 1 in 10,000 people is not necessarily guilty because the test came up positive — there are 30,000 people in the country with that profile). All of these errors come from treating independent events as connected, or from confusing the probability of the evidence given the hypothesis with the probability of the hypothesis given the evidence.

The cure is to run the math. Every time. Even when your gut is screaming. Especially when your gut is screaming.


🎙️ The Central Limit Theorem — The Wizard Behind the Curtain

Now we get to the most important idea in the book. The central limit theorem.

Take any population, no matter how weirdly distributed. Draw a large enough random sample, repeat the sampling process many times, and the distribution of sample means will look like a bell curve, centered on the true population mean. The theorem does not require the population itself to be normal. It does not require a huge sample — usually 30 is plenty. It works for incomes, heights, dice rolls, anything with a finite variance. It is the reason polls work, the reason clinical trials work, the reason quality control works, the reason weather forecasts work.

Wheelan describes it as "the wizard behind the curtain." Without it, none of the empirical work that has shaped the modern world — from election forecasts to drug approvals to economic indicators — would be possible. With it, you can take a small sample and say something defensible about a huge population. The variability of the sample mean shrinks as 1 over the square root of the sample size. To halve the margin of error, you must quadruple the sample. To go from 3% to 0.3%, you need 100 times the data. Diminishing returns set in early.

The standard error is the standard deviation of the sampling distribution. It is the "wiggle room" in a sample mean. A poll that says "52% plus or minus 3%" is reporting a sample statistic (52%) with its standard error (3%). The plus-or-minus is the humility. A poll that reports "52%" with no margin is hiding the same humility behind false precision.

This is the engine that turns a sample into a population claim. Once you have internalized the central limit theorem, you have internalized most of what statistics has to say about the empirical world.


🎙️ Inference — From a Sample to a Defensible Claim

The central limit theorem gives you the shape. Inference is the discipline of using that shape to make a claim you can defend.

A confidence interval is a range constructed by a procedure that captures the true population value 95% of the time, in the long run. The 95% refers to the procedure, not to the specific interval you just computed. The true value is either in this interval or it is not, and we don't know which. We only know the procedure is right 95% of the time across many samples. This subtle but important distinction trips up most non-statisticians (and most journalists). Wheelan handles it well.

A p-value is the probability, assuming the null hypothesis is true, of seeing data at least as extreme as the data you observed. If that probability is below 0.05, the result is "statistically significant." The 0.05 threshold is a convention set by Ronald Fisher in the 1920s. It is a habit, not a law of nature. And when researchers run many tests on the same data, the multiple comparisons problem guarantees false positives.

Wheelan's exhibit A is the 2009 study by Bennett, Wolford, and Miller. They placed a dead Atlantic salmon in an fMRI machine, showed it emotional photographs, and ran a standard analysis. The result showed significant brain activity in the dead fish. The point: running 130,000 comparisons on a single brain scan will turn up "activity" by chance, even in dead tissue.

This is the most vivid image in the book, and it is now a teaching tool in statistics courses worldwide. The lesson is durable: a single p-value of 0.04 in a study that ran 50 tests is worthless. Always ask how many comparisons were run. Always correct for multiple testing.


🎙️ Regression Analysis — The Workhorse, And Its Dangers

The last third of the book is about regression analysis, the workhorse of modern empirical work. Regression isolates the relationship between an outcome variable and one or more predictors while holding the others constant. The slope of the fitted line is the estimated effect of the predictor on the outcome, controlling for the other variables in the model.

The chapter is short, and the book is the weaker for it. Regression is the most-used tool in the empirical world, and Wheelan gives it only two chapters. But he does catalogue the seven common ways it goes wrong: linear fit on a nonlinear relationship, correlation confused with causation, reverse causation, omitted variable bias, multicollinearity, extrapolation beyond the data, and outliers. The seventh is the deadliest. Omitted variable bias — failing to include a key confounder in the model — quietly attributes the missing variable's effect to whatever is in the model. The resulting coefficient is biased, and the bias is invisible.

The Women's Health Initiative is Wheelan's primary cautionary tale. The estrogen catastrophe is the canonical case of an omitted variable in observational research: the women who took estrogen after menopause were also, on average, healthier, wealthier, and more engaged with the healthcare system than the women who did not. The regression that attributed the lower heart disease rate to the hormone was actually attributing it to those omitted variables. The randomized trial revealed the truth. People died.

Wheelan's recommendation: treat any regression result as a hypothesis to be tested, not a conclusion to be deployed. Especially when the policy stakes are high.


🎙️ Final Thoughts

Here is what I ultimately think about Naked Statistics.

It is not the most rigorous book on statistics. If you want deep dives into the math, you need a textbook. It is not the most ambitious. If you want modern causal inference, you need The Book of Why. It is not the most philosophically sophisticated. If you want the critique of the entire empirical enterprise, you need The Black Swan.

But Naked Statistics may be the most useful book on statistics for the general reader. It is the one I would give to a math-anxious journalist. It is the one I would recommend to a manager drowning in dashboards. It is the one I would put in the hands of anyone who has ever said "I am not a numbers person" and then watched themselves manipulated by a poll or a study.

The book's central message is genuinely empowering: the numbers you encounter every day are not magic. They are the product of choices — about samples, about models, about significance thresholds, about confounders. Those choices can be good or bad. You do not need a PhD to recognize the difference. You need the mental models Wheelan provides: the mean is not the middle, the standard deviation is half the story, correlation is not causation, your gut is wrong about probability, the central limit theorem is the reason the world is knowable, and a regression coefficient is a hypothesis, not a prescription.

These are durable ideas. They will be correct ten years from now and fifty years from now. They are the operating system of empirical reasoning, and most adults in 2026 still do not have them installed.

This has been a BookAtlas narration of Naked Statistics by Charles Wheelan. Read it. Then read a poll. Then read a study. Then read a regression. You will see things you have never seen before.

Outro: If this episode helped you look at a number with a little more skepticism, share it with someone who needs it. We will be back next week with another book worth knowing.