Introduction to Probability
A Modern Introduction Built on Harvard's Celebrated Stat 110 Course
sufficient
reading path: overview → analysis → narration
overview
Overview
Introduction to Probability by Joseph K. Blitzstein and Jessica Hwang is the modern, story-driven introduction to probability built from Harvard's celebrated Stat 110 lectures. First published in 2014 (ISBN 9781886529236) by Chapman and Hall / CRC Press and updated in a second edition in 2019 (ISBN 9781138369917), the book has become the most widely adopted probability textbook of its generation — used at Harvard, Stanford, MIT, and hundreds of other institutions, and as the basis of the popular edX online course.
The book is distinctive for three things: a story-based pedagogy that motivates every distribution and identity with a memorable narrative, an unmatched collection of story proofs that build deep intuition before formal derivation, and a steady integration of simulation in R so that students see the math in action.
------|----------|-------| | Foundations | 1-2 | Probability and counting; conditional probability and Bayes' theorem | | Discrete Probability | 3-4 | Random variables, PMFs, expectation, classical discrete families | | Continuous Probability | 5-6 | Densities, the Uniform universality, moments and MGFs | | Multivariate | 7-9 | Joint, marginal, and conditional distributions; transformations; conditional expectation | | Limit Theorems | 10 | Inequalities, LLN, CLT, Chi-Square, Student-t | | Stochastic Processes | 11-13 | Markov chains, MCMC, Poisson processes |
Key Takeaways
-
Probability is a language, not a body of formulas. The book opens with Pebble World — equally likely outcomes represented as pebbles — and uses that picture relentlessly to demystify conditioning, independence, and counting.
-
Conditioning is the single most important problem-solving tool. Chapter 2 hammers home that nearly every hard probability problem becomes tractable once you condition on the right event. Bayes' rule, the law of total probability, and the famous "Adam and Eve" examples all flow from this idea.
-
Distributions are stories. The Binomial is the number of successes in $n$ trials. The Geometric is the number of trials until the first success. The Poisson is the limit of rare events. The Normal emerges from sums. The Beta governs probabilities. The Gamma governs waiting times. The book never lets you forget the narrative behind each family.
-
The Uniform is universal. Chapter 5 shows that any continuous distribution can be sampled by transforming a Uniform — a unifying perspective that connects continuous and discrete probability.
-
Indicator random variables are the bridge from counting to expectation. The "fundamental bridge"
\mathbb{E}[I_A] = P(A)turns problems about expected counts into one-line calculations, and is the engine behind linearity-of-expectation proofs. -
Moment generating functions are the right tool for sums of independent random variables. Adding random variables convolves distributions; multiplying MGFs is easy. The book makes this trade-off concrete and applies it to Poisson, Chi-Square, and Normal sums.
-
Conditional expectation is a random variable, not a number. Once you treat $E(X \mid Y)$ as an rv, you get a tower property, a clean derivation of the Adam-Eve law $E(X) = E(E(X \mid Y))$, and the foundation for modern statistics.
-
The Central Limit Theorem is the bridge from finite samples to statistical inference. The book states the Lindeberg-Levy CLT cleanly and connects it to Normal approximation of Binomial and Poisson — the workhorse behind confidence intervals and hypothesis tests.
-
Markov chains give you a free way to sample from hard distributions. The Metropolis-Hastings algorithm, presented in Chapter 12, is one of the most important algorithms of the twentieth century. The book derives it from scratch using reversibility.
-
Poisson processes model arrivals in continuous time. Chapter 13 ties together the Exponential (interarrival times), the Poisson (counts in a window), and the connection to two-dimensional Poisson processes — a complete picture of how random events unfold in space and time.
Who Should Read
| Reader Type | Why | |---|---| | Undergraduates in stats, math, CS, or engineering | The clearest, most motivating first course in probability | | Data scientists and ML practitioners | A rigorous foundation for Bayesian methods, MCMC, and inference | | Self-learners | Stat 110's free YouTube lectures pair chapter-for-chapter with the book | | Graduate students in non-probability fields | A more readable alternative to measure-theoretic texts | | Quantitative finance, genetics, and physics students | The stories and applications are the best in the field |
Who Should Skip
- Readers with no calculus background — at minimum single-variable calculus and basic matrix algebra are assumed
- Those seeking a measure-theoretic treatment — the book is rigorous but stops short of formal measure theory
- Practitioners who only want recipes — the book insists on understanding, not just formulas
- Readers looking for deep coverage of statistical inference, linear models, or Bayesian computation (use a follow-up text like Bayesian Data Analysis or Statistical Inference)
Why This Book Matters
Probability is the foundation of modern statistics, data science, and machine learning. Yet most introductory textbooks present it as a wall of formulas to be memorized. Introduction to Probability does the opposite: it builds intuition first through stories, then formalizes with proofs, then grounds everything in simulation.
Joseph Blitzstein's Stat 110 lectures at Harvard have enrolled hundreds of students per year and reached millions of viewers online. The book is the printed, polished, expanded version of those lectures — co-authored with Jessica Hwang of Harvey Mudd College, whose research in stochastic processes and pedagogy strengthens the later chapters.
If you read one probability textbook, read this one.
Related Books
| Book | Author | Connection | |------|--------|------------| | A First Course in Probability | Sheldon Ross | Classic alternative; more theorem-proof oriented | | Probability and Statistics | DeGroot and Schervish | Bayesian-flavored competitor with more inference | | Statistical Inference | Casella and Berger | Standard graduate-level follow-up | | Bayesian Data Analysis | Gelman et al. | The natural next step for Bayesian methods | | Statistical Rethinking | Richard McElreath | Applied Bayesian inference with R | | Naked Statistics | Charles Wheelan | Accessible non-technical prelude | | How Not to Be Wrong | Jordan Ellenberg | Popular math with probability threads |
Final Verdict
Introduction to Probability is the rare textbook that is both rigorous and a pleasure to read. The story-based pedagogy is not a gimmick — it is the most effective way to internalize the definitions and connections that make probability intuitive. The book rewards close study and re-reading, and its R-based simulations make the abstract concrete.
Rating: 9.5/10 — The best single-volume introduction to probability in print. The default choice for any first course in probability and a worthy addition to any quantitative practitioner's shelf.
content map
Pebble World: The Foundational Picture
The book opens with the Pebble World mental model. Every random experiment is a sample space filled with equally likely pebbles. Probabilities are fractions of pebbles. Conditioning is carving out a sub-sample. Independence is having two carvings of the space intersect in proportion to the product of their sizes.
flowchart LR
subgraph S["Sample Space S (all pebbles)"]
A["Event A"]
B["Event B"]
AB["A ∩ B"]
end
S --> |"P(A) = |A|/|S|"| A
S --> |"P(B) = |B|/|S|"| B
A --> |"P(A∩B)/|A|"| AB
B --> |"P(A∩B)/|B|"| AB
style S fill:#dae8fc,stroke:#6c8ebf
style A fill:#d5e8d4,stroke:#82b366
style B fill:#ffe6cc,stroke:#d79b00
style AB fill:#e1d5e7,stroke:#9673a6
The book makes a deliberate point: even when the pebbles are not equally likely (the "non-naive" definition), the entire probability machinery still works. Pebble World is a teaching tool, not a restriction.
Bayes' Theorem and the Law of Total Probability
Two identities carry almost the entire book:
P(A | B) = P(B | A) * P(A) / P(B)
P(B) = sum_i P(B | A_i) * P(A_i)
``` (where A_i is a partition)
```mermaid
flowchart TD
HYP["Hypothesis A<br/>(prior)"]
EVD["Evidence B<br/>(likelihood)"]
POST["Posterior P(A|B)"]
NORM["Normalizing constant P(B)"]
HYP --> |"P(A)"| NORM
EVD --> |"P(B|A)"| NORM
NORM --> POST
HYP --> |"P(A)"| POST
EVD --> |"P(B|A)"| POST
style HYP fill:#d5e8d4,stroke:#82b366
style EVD fill:#ffe6cc,stroke:#d79b00
style POST fill:#e1d5e7,stroke:#9673a6
style NORM fill:#f8cecc,stroke:#b85450
The book's signature move is conditioning as a problem-solving tool: if a problem looks hard, condition on something. The celebrated "Adam and Eve" examples (Chapter 9) show how conditioning on a natural rv reduces a hard problem to a one-line identity.
Distributions Are Stories
The book organizes every named distribution around a story — a sentence that defines it without formulas.
mindmap
root((Named Distributions))
Discrete
Bernoulli["1 trial, success/fail"]
Binomial["n trials, count successes"]
Geometric["trials until 1st success"]
NBin["trials until r-th success"]
Hypergeometric["sampling without replacement"]
Poisson["rare events in time/space"]
Continuous
Uniform["equal density on interval"]
Normal["sum of many small effects"]
Exponential["waiting time for Poisson event"]
Gamma["sum of Exponential wait times"]
Beta["distribution on probabilities"]
Joint
Multinomial["n trials, k categories"]
MNormal["generalization of Normal"]
Limit
ChiSq["sum of squared Normals"]
StudentT["Normal / sqrt(ChiSq/k)"]
| Family | Story | PMF/PDF | Mean | Variance |
|---|---|---|---|---|
| Bernoulli($p$) | Single success/fail trial | p^x (1-p)^{1-x} | $p$ | $p(1-p)$ |
| Binomial($n,p$) | Count successes in $n$ trials | \binom{n}{k} p^k (1-p)^{n-k} | $np$ | $np(1-p)$ |
| Geometric($p$) | Trials until 1st success | (1-p)^{k-1} p | $1/p$ | $(1-p)/p^2$ |
| Poisson($\lambda$) | Rare events at rate $\lambda$ | e^{-\lambda} \lambda^k / k! | $\lambda$ | $\lambda$ |
| Uniform($a,b$) | Equal density on $[a,b]$ | $1/(b-a)$ | $(a+b)/2$ | $(b-a)^2 / 12$ |
| Normal($\mu, \sigma^2$) | Sum of many small effects | \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2 / 2\sigma^2} | $\mu$ | $\sigma^2$ |
| Exponential($\lambda$) | Wait for next Poisson event | \lambda e^{-\lambda x} | $1/\lambda$ | $1/\lambda^2$ |
| Gamma($k, \theta$) | Sum of $k$ Exponential wait times | \frac{1}{\Gamma(k)} x^{k-1} e^{-x/\theta} / \theta^k | $k\theta$ | $k\theta^2$ |
| Beta($a, b$) | Distribution on probabilities | \frac{1}{B(a,b)} x^{a-1} (1-x)^{b-1} | $a/(a+b)$ | $ab / ((a+b)^2(a+b+1))$ |
The Universality of the Uniform
A foundational insight: to sample from any continuous distribution
with CDF $F$, take U \sim \text{Uniform}(0,1) and apply
F^{-1}(U). This is the bridge from discrete to continuous
probability and the engine of all simulation.
flowchart LR
U["U ~ Uniform(0,1)"] --> |"F⁻¹"| X["X ~ F<br/>(any continuous)"]
U2["Uniform samples"] --> |"Inverse CDF"| X2["Exponential, Normal, Beta, ..."]
style U fill:#d5e8d4,stroke:#82b366
style X fill:#dae8fc,stroke:#6c8ebf
style U2 fill:#d5e8d4,stroke:#82b366
style X2 fill:#e1d5e7,stroke:#9673a6
By the same token, the symmetry of iid continuous rvs gives order statistics a Beta distribution and the range a clean description.
Linearity of Expectation and the Fundamental Bridge
Two of the most-used tools in the book:
Fundamental bridge: for an indicator $I_A$,
\mathbb{E}[I_A] = P(A).
Linearity: \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]
— no independence required.
flowchart TD
IND["Indicator I_A = 1 if A, else 0"]
PROB["P(A) = E[I_A]"]
COUNT["# of successes = sum of I's"]
EXP["E[sum] = sum of E[I] = sum of P"]
IND --> PROB --> COUNT --> EXP
style IND fill:#d5e8d4,stroke:#82b366
style PROB fill:#dae8fc,stroke:#6c8ebf
style COUNT fill:#ffe6cc,stroke:#d79b00
style EXP fill:#e1d5e7,stroke:#9673a6
This pair is the secret behind most of the book. A "how many expected matches in a deck of cards?" problem becomes: write an indicator for each card, sum, take expectation, done. No combinatorics required.
Moment Generating Functions
The MGF of a random variable $X$ is M_X(t) = \mathbb{E}[e^{tX}].
The book's key observation: for independent rvs, the MGF of a
sum is the product of MGFs — far easier than convolving densities.
| Distribution | MGF | Why Useful |
|---|---|---|
| Normal($\mu, \sigma^2$) | e^{\mu t + \sigma^2 t^2 / 2} | Sum of Normals is Normal |
| Poisson($\lambda$) | e^{\lambda(e^t - 1)} | Sum of Poissons is Poisson |
| Gamma($k, \theta$) | (1 - \theta t)^{-k} | Sum of Gammas is Gamma |
| Chi-Square($k$) | (1 - 2t)^{-k/2} | Sum of Chi-Squares is Chi-Square |
flowchart LR
X1["X1 ~ Poisson(λ₁)"] --> SUM["X1 + X2 + ... + Xn"]
X2["X2 ~ Poisson(λ₂)"] --> SUM
Xn["Xn ~ Poisson(λₙ)"] --> SUM
SUM --> |"MGF product"| RESULT["Poisson(λ₁ + λ₂ + ... + λₙ)"]
style SUM fill:#dae8fc,stroke:#6c8ebf
style RESULT fill:#d5e8d4,stroke:#82b366
Markov Chains and Stationarity
A Markov chain is a sequence of states where the future depends on the past only through the present. The book formalizes this with a transition matrix $P$ and asks: when does the chain settle into a stationary distribution $\pi$ such that $\pi P = \pi$?
stateDiagram-v2
[*] --> Sunny
Sunny --> Sunny : 0.9
Sunny --> Rainy : 0.1
Rainy --> Sunny : 0.5
Rainy --> Rainy : 0.5
For the chain above, $\pi = (5/6, 1/6)$: in the long run, 5 out of 6 days are sunny. The book derives this from $\pi P = \pi$ and $\sum \pi_i = 1$.
The classification of states (transient, recurrent, periodic,
ergodic) and the reversibility criterion
\pi_i P_{ij} = \pi_j P_{ji} are presented cleanly and used to
analyze the Google PageRank algorithm as a real-world example.
Markov Chain Monte Carlo (MCMC)
Some distributions are too complex to sample from directly. The Metropolis-Hastings algorithm constructs a Markov chain whose stationary distribution is the target, then runs the chain until it mixes.
flowchart TD
INIT["Start at x₀"]
PROP["Propose x' from Q(x'|x)"]
ACC["Accept with prob<br/>min(1, π(x')Q(x|x') / π(x)Q(x'|x))"]
REJ["Reject and stay at x"]
NEXT["x ← x' (or x)"]
MIX["Run long enough to mix"]
SAMP["x_m, x_{m+1}, ... are samples from π"]
INIT --> PROP
PROP --> ACC
ACC --> |"accept"| NEXT
ACC --> |"reject"| REJ
REJ --> NEXT
NEXT --> MIX --> SAMP
style INIT fill:#d5e8d4,stroke:#82b366
style ACC fill:#ffe6cc,stroke:#d79b00
style SAMP fill:#e1d5e7,stroke:#9673a6
MCMC is the engine behind modern Bayesian inference. The book gives the cleanest treatment available at this level.
The Central Limit Theorem
The book states the Lindeberg-Levy CLT: for iid rvs $X_1, \ldots, X_n$ with mean $\mu$ and variance $\sigma^2$, the standardized sum
Z_n = (sum_i X_i - n*mu) / (sigma * sqrt(n))
converges in distribution to a standard Normal.
flowchart LR
SAMPLE["X₁, X₂, ..., Xₙ<br/>(any distribution)"] --> SUM["Sₙ = Σ Xᵢ"]
SUM --> |"standardize"| Z["Zₙ = (Sₙ - nμ) / σ√n"]
Z --> |"n → ∞"| N["Standard Normal N(0,1)"]
style SAMPLE fill:#d5e8d4,stroke:#82b366
style SUM fill:#dae8fc,stroke:#6c8ebf
style Z fill:#ffe6cc,stroke:#d79b00
style N fill:#e1d5e7,stroke:#9673a6
The CLT explains why the Normal appears everywhere in practice (sum of small effects) and underpins confidence intervals and hypothesis tests in statistics.
Poisson Processes
A Poisson process is a continuous-time stream of events with two properties: events in disjoint intervals are independent, and the count in an interval of length $t$ is Poisson($\lambda t$).
| Quantity | Distribution | Mean | |---|---|---| | Count in window of length $t$ | Poisson($\lambda t$) | $\lambda t$ | | Time until 1st event | Exponential($\lambda$) | $1/\lambda$ | | Time until $k$-th event | Gamma($k, 1/\lambda$) | $k / \lambda$ | | Given count $n$ in window, event times | Order stats of $n$ iid Uniform(0,$t$) | — |
The book generalizes to 2D and higher, where Poisson processes become models for spatial point patterns (genetics, ecology, epidemiology).
Key Lessons
- Stories are first-class objects. When you remember that the Binomial is "count of successes in $n$ trials" and not a formula, every problem becomes a recognition task.
- Condition first, compute second. Almost every hard problem becomes easy after the right conditioning step.
- Indicators turn counts into expectations. Linearity of expectation is your most powerful hammer.
- MGFs turn sums into products. Whenever a problem involves independent sums, reach for MGFs.
- Markov chains are everywhere. PageRank, MCMC, queueing, genetics — all Markov chains.
Practical Applications
- Statistics and inference: confidence intervals, hypothesis tests, regression, and the foundations of Bayesian inference
- Data science and machine learning: Bayesian methods, MCMC, expectation-maximization, hidden Markov models
- Computer science: PageRank, randomized algorithms, hashing analysis, communication channels
- Genetics and biology: population genetics, phylogenetic trees, neural spike trains
- Finance and economics: option pricing, risk models, queueing
- Physics and engineering: reliability, signal processing, stochastic simulation
analysis
Strengths
- Story-based pedagogy is transformative. The book never presents a distribution or identity without first giving the narrative that motivates it. Once you remember that the Poisson is "rare events at a constant rate," the formula is almost redundant. This is the most effective introductory probability book ever written for memory and intuition.
- Story proofs are a unique and powerful device. Blitzstein's signature technique — proving a result by constructing an intuitive story that does the work for you — turns abstract arguments into mental pictures. The birthday-problem story proof and the negative-hypergeometric / hypergeometric identity are unforgettable.
- Unmatched range of modern applications. PageRank, MCMC, genetics, information theory, the Monty Hall problem, the birthday problem, the St. Petersburg paradox — the book draws examples from everywhere probability actually shows up.
- R integration is genuine, not cosmetic. Every chapter ends with a short R section that simulates the chapter's main idea. Students see the math in action and learn a useful tool simultaneously.
- Exercises are excellent. Hundreds of carefully chosen problems, ranging from quick checks to genuine challenges. Selected solutions are freely available on Blitzstein's website.
- Lectures are free. The Stat 110 YouTube and edX videos cover the book chapter by chapter. The book plus the videos is one of the best self-study sequences in mathematics.
- The two authors complement each other. Blitzstein brings the lecture-tested pedagogy and Bayesian/MCMC expertise; Hwang brings the stochastic-processes and applications depth. The result is a balanced, modern book.
Weaknesses
- Not measure-theoretic. The book is rigorous by the standards of an undergraduate text but stops short of formal measure theory. Readers who need Lebesgue integration and sigma-algebras for graduate work must look elsewhere.
- Statistical inference is light. The book covers the Central Limit Theorem and a glimpse of Chi-Square and Student-t, but does not develop confidence intervals, hypothesis testing, or linear models in depth. A follow-up statistics text is necessary.
- Poisson processes are condensed. Chapter 13 is the shortest chapter on the longest topic. Readers who need deep coverage of renewal theory, Brownian motion, or stochastic calculus must supplement.
- Some advanced topics are marked with stars. Sections like "Using probability and expectation to prove existence" and "Geometric interpretation of conditional expectation" are flagged as optional. They are interesting but break the otherwise steady pace.
- Exercises occasionally rely on material from the Stat 110 problem sets rather than the book. Some problems are easier with the lecture videos alongside.
- The 1st edition has a handful of typos and unclear passages. The 2nd edition (2019) fixes most of these. Get the 2nd edition if possible.
Criticism
- Story proofs can feel hand-wavy to formalists. A mathematically trained reader may want the rigorous argument alongside the story. The book provides the rigorous version, but the stories sometimes take the spotlight.
- The "modern intro" framing leaves out historical context. The development of probability from Pascal and Fermat to Kolmogorov is mentioned in passing, not developed. A reader interested in the history must look elsewhere.
- The Bayesian emphasis is strong. The book is not dogmatic, but Bayes and conditioning get more airtime than frequentist inference. For a more balanced introduction, consider pairing with Casella and Berger.
- Some examples are overused. The "spike" birthday problem, the Monty Hall problem, and the 100-prisoners problem appear in nearly every probability book. The book uses them well, but they will be familiar to readers with any prior exposure.
Comparison
| Book | Author | Focus | |------|--------|-------| | Introduction to Probability | Blitzstein and Hwang | Story-based, modern, applications | | A First Course in Probability | Sheldon Ross | Classical, more theorem-proof | | Probability and Statistics | DeGroot and Schervish | Bayesian and decision-theoretic | | Introduction to Probability Models | Sheldon Ross | Stochastic processes focus | | Statistical Inference | Casella and Berger | Graduate-level foundations | | Probability | Jim Pitman | Combinatorial, very story-rich alternative |
Blitzstein-Hwang is the best choice for a first course in probability and for self-study. Ross is the most popular alternative. Casella-Berger is the standard graduate follow-up.
Final Assessment
| Dimension | Rating | Notes | |-----------|--------|-------| | Comprehensiveness | 9/10 | Covers the full undergraduate canon plus MCMC and Poisson processes | | Mathematical Rigor | 8/10 | Rigorous by undergraduate standards; not measure-theoretic | | Clarity | 10/10 | The clearest probability book ever written | | Practical Utility | 9/10 | The R sections and applications make the book immediately useful | | Exercise Quality | 9/10 | Wide range, well-graded, free solutions available | | Longevity | 9/10 | Core probability has not changed; the book will be standard for decades | | Overall | 9.5/10 | The default textbook for any first course in probability |
narration
Introduction
Welcome to BookAtlas. Today: Introduction to Probability by Joseph K. Blitzstein and Jessica Hwang. Published 2014, second edition 2019, by Chapman and Hall / CRC Press. Roughly 610 pages. The book that grew out of Harvard's Stat 110 — and the modern default for learning probability.
If you have ever opened a probability textbook and been beaten into submission by a wall of formulas, this is the book you wish you had. Blitzstein and Hwang teach you probability the way it should be taught: with stories, with intuition, and with code that makes the math come alive.
Why Stories
Curious: Probability has a reputation for being dry. Why does this book feel different?
Reader: Because it refuses to treat probability as a list of formulas. Every named distribution is introduced as a story. The Binomial is the number of successes in $n$ trials. The Poisson is the count of rare events in a window of time. The Normal is the sum of many small effects. The Beta is the distribution of an unknown probability.
Once you know the story, the formula is almost redundant. And once you can recall a dozen stories cold, problems stop being recognition tasks and start being applications of the right story.
Curious: What about the proofs? Probability proofs can be intimidating.
Reader: The book has a signature move: story proofs. Instead of grinding through algebra, Blitzstein constructs a single intuitive narrative that does the work. The classic example is proving that the distribution of waiting times in a Poisson process is the Exponential. No integration. Just a memoryless story. You finish the proof and feel like you understand why, not just that.
The Map of the Book
flowchart TD
subgraph Foundations["Ch 1-2: Foundations"]
C1["Probability and counting"]
C2["Conditional probability and Bayes"]
end
subgraph Discrete["Ch 3-4: Discrete"]
C3["Random variables and PMFs"]
C4["Expectation and the bridge"]
end
subgraph Continuous["Ch 5-6: Continuous"]
C5["Densities and the Uniform"]
C6["Moments and MGFs"]
end
subgraph Multivariate["Ch 7-9: Multivariate"]
C7["Joint, marginal, conditional"]
C8["Transformations"]
C9["Conditional expectation"]
end
subgraph Limits["Ch 10: Limits"]
C10["LLN, CLT, Chi-Square, t"]
end
subgraph Processes["Ch 11-13: Processes"]
C11["Markov chains"]
C12["MCMC (Metropolis-Hastings)"]
C13["Poisson processes"]
end
Foundations --> Discrete --> Continuous --> Multivariate --> Limits --> Processes
Reader: The arc is deliberate. Chapters 1-2 build the language and the most important problem-solving tool — conditioning. Chapters 3-6 introduce the discrete and continuous families, with the Universality of the Uniform bridging them. Chapters 7-9 handle multivariate and conditional expectation. Chapter 10 is the limit theorems that connect everything to statistics. Chapters 11-13 are the modern payoff: Markov chains, MCMC, and Poisson processes.
Curious: Why MCMC in an intro book?
Reader: Because MCMC is one of the most important algorithms of the past fifty years. If you are doing Bayesian inference, you are running MCMC. If you are using Stan or PyMC, you are running MCMC. Blitzstein and Hwang give you the cleanest derivation available: build a Markov chain whose stationary distribution is the one you want to sample from, and let the chain run.
Pebble World
Reader: The first chapter introduces a picture that runs through the whole book: Pebble World. Imagine a sample space filled with pebbles. If all the pebbles are equally likely, probabilities are just fractions of pebbles. Conditioning is picking a sub-region. Independence is two regions whose intersection has the right proportion.
Curious: That sounds naive. What if the pebbles are not equally likely?
Reader: That is exactly the point of section 1.6 — the "non-naive" definition. Pebble World is a teaching tool, not a restriction. Even when probabilities are unequal, the same identities work. The picture is the anchor; the formal definition extends it.
Bayes' Theorem and the Adam and Eve Law
Curious: What is the single most useful idea in the book?
Reader: Conditioning. The book says it on page 1 of Chapter 2 and shows it on every subsequent page. Bayes' theorem, the law of total probability, the Adam and Eve law $E(X) = E(E(X \mid Y))$ — all of them are conditioning in disguise.
The Adam and Eve examples in Chapter 9 are the climax. A character walks through the world gaining partial information. The book shows that the expected value of a random variable, conditioning on all the available partial information, is the optimal prediction. The math is just two lines. The intuition lasts a lifetime.
The Central Limit Theorem
Reader: Chapter 10 is the bridge from probability to statistics. The book states the Central Limit Theorem cleanly: sums of iid random variables, standardized, converge to a standard Normal. It does not matter whether the underlying distribution is Binomial, Exponential, Uniform, or anything else with finite variance.
Curious: Why does this matter?
Reader: Because the CLT is the reason the Normal distribution shows up everywhere in practice. A measurement is the sum of many small errors. A test statistic is the sum of many small effects. An average is the sum of many small contributions. The Normal emerges as the universal shape of aggregated randomness. The book also covers the Chi-Square and Student-t distributions in this chapter, which are the building blocks of confidence intervals and hypothesis tests.
The Verdict
Curious: Who is this book for?
Reader: Almost everyone learning probability for the first time. Undergraduates in statistics, mathematics, computer science, engineering, economics, physics, or any quantitative field. Self-learners working through Stat 110 online. Data scientists who want a rigorous foundation for Bayesian methods and MCMC. The prerequisite is one semester of calculus. No measure theory required.
Curious: Who is it not for?
Reader: Readers who already have a measure-theoretic background and want a graduate-level treatment. They should read Probability Theory by Varadhan or Probability with Martingales by Williams. Readers who want deep coverage of statistical inference, linear models, or causal inference will need a follow-up text. The book is also not a cookbook — practitioners who only want recipes will find the emphasis on understanding frustrating.
Curious: The 1st or 2nd edition?
Reader: The 2nd edition, almost certainly. Published 2019, it fixes typos, adds new examples and exercises, and adds online animations that pair with the chapters. The 1st edition is fine, but the 2nd is better in every measurable way.
Final Thoughts
Introduction to Probability is a masterclass in pedagogy. It takes a subject with a reputation for being abstract and makes it genuinely intuitive. The story-based approach is not a gimmick; it is the most efficient way to internalize the definitions and connections that make probability work. The R simulations make the abstract concrete. The exercises are the best in the field.
The book is also a quiet argument that the best way to learn mathematics is to learn its stories, not its formulas. That lesson generalizes far beyond probability.
This has been a BookAtlas narration of Introduction to Probability by Blitzstein and Hwang. Pair the book with the Stat 110 YouTube lectures. Work the exercises. Run the R simulations. And remember: when a probability problem feels hard, condition on something.
Thanks for listening.