Practical Statistics for Data Scientists
50+ Essential Concepts Using R and Python
sufficient
reading path: overview → analysis → narration
overview
Overview
Practical Statistics for Data Scientists (2nd ed., 2020) by Peter Bruce, Andrew Bruce, and Peter Gedeck is a curated tour of the statistical concepts that show up daily in data science work — and a deliberate exclusion of those that do not. Where a traditional statistics textbook covers 500 topics with equal weight, the Bruces distill the field to roughly 50 concepts that pay rent, each explained twice: once in R, once in Python.
The book is organized as a reference. Each section opens with a glossary of key terms, presents the concept compactly, shows working code in both languages, and warns of the misuses that working data scientists actually commit.
Executive Summary
The authors' central claim is that data scientists do not need most of what a statistics PhD covers, but they do need a small core of ideas — held deeply, applied correctly, and translated into code. That core spans seven chapters.
| Chapter | Concepts Covered | Why It Matters | |---|---|---| | 1. Exploratory Data Analysis | Location, variability, distribution, correlation, contingency | The first and most important phase of any project | | 2. Sampling Distributions | Bias, bootstrap, CIs, t/normal/binomial/Poisson | The bridge from sample to inference | | 3. Significance Testing | A/B tests, p-values, permutation, power, ANOVA, chi-square | Designing and judging real experiments | | 4. Regression & Prediction | OLS, multi-regression, factor vars, splines, GAM, diagnostics | The workhorse of supervised learning | | 5. Classification | Naive Bayes, LDA, logistic, ROC, AUC, imbalanced data | When the outcome is a class, not a number | | 6. Statistical ML | KNN, CART, bagging, random forest, boosting, regularization | Production-grade tabular models | | 7. Unsupervised Learning | PCA, K-means, hierarchical, model-based, mixed types | Pattern discovery without labels |
The 2nd edition's headline change: every R example is now mirrored in Python (NumPy, pandas, scikit-learn, statsmodels), making it a genuinely cross-stack book.
Key Takeaways
-
EDA is not optional. Tukey's exploratory mindset — look at the data before modeling it — is the highest-leverage statistical habit a data scientist can build.
-
Bootstrap first, formula second. Resampling replaces most of the algebraic distribution theory you would otherwise need to memorize. A loop and a sample function go a long way.
-
The p-value is a poor decision rule. Statistical significance says nothing about practical significance. Always pair p-values with effect sizes and confidence intervals.
-
Most "big data" problems are still small-data problems in disguise. Stratified sampling, careful cross-validation, and sensible baselines beat raw scale.
-
Regression is the gateway drug. Linear and logistic regression are interpretable, fast, and good enough surprisingly often. Learn them deeply before reaching for a neural net.
-
Trees and ensembles dominate tabular data. Random forest and gradient boosting are the right default for structured business data — not deep learning.
-
Regularization is non-negotiable. Ridge, lasso, and elastic net are how you get useful coefficients out of a wide regression.
-
Unsupervised learning is exploration, not prediction. PCA and clustering produce hypotheses for humans, not predictions for systems.
-
A/B testing is harder than it looks. Peeking, multiple comparisons, novelty effects, and underpowered designs sink most real-world experiments.
-
Cross-language fluency matters. R and Python each shine in different parts of the workflow; modern data scientists read both.
Who Should Read
| Reader Type | Why | |---|---| | Working data scientists without formal stats training | The book's exact target audience — concept-by-concept refresher | | Software engineers moving into ML | Closes the statistics gap with the minimum useful material | | Statistics-trained analysts learning Python or R | Side-by-side code in both languages accelerates the switch | | Bootcamp and self-taught learners | Concise enough to read cover to cover; deep enough to retain | | Interview prep candidates | Covers most stats-shaped questions asked in data science loops |
Who Should Skip
- Readers seeking deep statistical theory — try Casella & Berger or Wasserman instead
- Anyone wanting a deep-learning textbook — this is tabular statistics
- Pure programmers with no interest in the math behind their code
- Readers wanting a non-technical, lay-audience treatment — see Naked Statistics by Wheelan for that
Core Themes
| Theme | Description | |---|---| | Pick the useful 50 | A working data scientist needs depth on a small set, not breadth on all | | Resampling beats algebra | Bootstrap and permutation tests subsume most classical inference | | Effect size over p-value | Practical significance is the only kind that matters at work | | Tabular data needs trees | Random forest and boosting are the modern default, not regression | | Two languages, one workflow | R and Python are complements, not competitors |
Why This Book Matters
Most data scientists enter the field through code, not coursework. They can fit a model in fifteen seconds but cannot explain whether its coefficient is significant, whether its AUC is meaningfully better than baseline, or whether the A/B test their PM is staring at has any business being called a win.
This book was written to close that gap without sending the reader back to a six-semester statistics curriculum. By restricting itself to the ~50 concepts that earn their keep in production, and by showing each one in both R and Python, it became the de facto practitioner refresher for the field — the natural companion to An Introduction to Statistical Learning and a practical follow-on to Naked Statistics.
Related Books
| Book | Author | Connection | |---|---|---| | Naked Statistics | Charles Wheelan | Non-technical predecessor; read first if math-shy | | An Introduction to Statistical Learning | James, Witten, Hastie, Tibshirani | The theory companion; same modeling vocabulary | | The Elements of Statistical Learning | Hastie, Tibshirani, Friedman | The graduate-level deep dive | | R for Data Science | Hadley Wickham | Tidyverse counterpart for the R-only path | | Python for Data Analysis | Wes McKinney | pandas companion for the Python-only path | | Statistical Rethinking | Richard McElreath | Bayesian alternative to the frequentist core | | Causal Inference: The Mixtape | Scott Cunningham | Picks up where ch. 3 leaves off |
Final Verdict
Rating: 8.8/10 — The single best mid-career statistics reference for working data scientists. Compact, opinionated, and ruthlessly practical. The 2nd edition's dual R + Python treatment makes the 1st edition obsolete.
content map
The Seven-Chapter Map
flowchart TD
A["1. Exploratory<br/>Data Analysis"] --> B["2. Data and<br/>Sampling Distributions"]
B --> C["3. Statistical Experiments<br/>and Significance Testing"]
C --> D["4. Regression<br/>and Prediction"]
D --> E["5. Classification"]
E --> F["6. Statistical<br/>Machine Learning"]
F --> G["7. Unsupervised<br/>Learning"]
A -.->|"feeds every<br/>downstream phase"| F
B -.->|"underwrites every<br/>inferential claim"| C
The book moves in a deliberate arc: see the data (EDA), understand how samples behave (sampling distributions), test claims about them (significance testing), model them (regression, classification, statistical ML), and finally explore their hidden structure (unsupervised learning).
Chapter 1: Exploratory Data Analysis
EDA — coined by John Tukey in 1977 — is the first task in any data project. The Bruces argue it is also the most neglected. Their EDA toolbox:
Measures of Location
| Statistic | When to Use | Trap | |---|---|---| | Mean | Symmetric, no outliers | Not robust | | Trimmed mean | Mild outliers | Choice of trim is arbitrary | | Weighted mean | Unequal sample importance | Wrong weights → biased estimate | | Median | Skew, outliers, ordinal data | Ignores magnitude | | Mode | Categorical data | Multiple modes possible |
Measures of Variability
flowchart LR
R["Range"] --> P["Percentiles<br/>IQR"]
P --> S["Sample Standard<br/>Deviation (s)"]
S --> M["MAD<br/>Median Absolute<br/>Deviation"]
The book's recurring lesson: when in doubt, prefer robust measures. The median and MAD survive contamination that destroys the mean and standard deviation.
Exploring Distributions
- Histograms and density plots for univariate shape
- Boxplots for distribution at a glance plus outlier flagging
- Frequency tables for binned numeric data
- Bar and pie charts for categorical proportions
- Contingency tables for two-category cross-tabulation
- Scatterplots for two-variable continuous relationships
- Hexagonal binning and contour plots when scatterplots saturate
Correlation
- Pearson's r for linear association
- Spearman's ρ and Kendall's τ for rank-based association
- The correlation matrix and its correlation heatmap as a one-glance pre-modeling check
The Bruces emphasize: correlation does not imply causation, and a correlation of zero does not imply independence — only the lack of a linear relationship.
Chapter 2: Data and Sampling Distributions
flowchart LR
POP["Population<br/>(unknowable)"] -->|"sample n times"| S["Sample<br/>statistic"]
S -->|"resample with<br/>replacement"| BS["Bootstrap<br/>distribution"]
BS --> CI["Confidence<br/>Interval"]
S -->|"repeat sampling"| SD["Sampling<br/>distribution"]
SD --> SE["Standard<br/>Error"]
Core Concepts
- Random sampling beats convenience sampling every time. Bias is the systematic error a larger sample size will not fix.
- Selection bias is the silent killer of business analytics ("we looked at our best customers — they're great!").
- The sampling distribution of a statistic is the distribution of that statistic across hypothetical repeated samples. Its standard deviation is the standard error.
- The bootstrap is the practical workhorse: resample your data with replacement, recompute the statistic, repeat 1,000+ times, and read off the standard error or percentile-based confidence interval directly. No distributional assumptions required.
Useful Distributions
| Distribution | Models | Example | |---|---|---| | Normal | Continuous, symmetric, additive errors | Heights, measurement noise | | Student's t | Sample means, small n | Regression coefficients | | Binomial | Yes/no trials | Conversion rates | | Chi-square | Sum of squared normals | Independence tests | | F | Ratio of variances | ANOVA | | Poisson | Rare event counts | Server errors per hour | | Exponential | Time between events | Inter-arrival times | | Weibull | Time to failure | Survival analysis |
The book argues you do not need to memorize their formulas — you need to recognize when each one applies.
Chapter 3: Statistical Experiments and Significance Testing
flowchart TD
H["Hypothesize<br/>(null vs alternative)"] --> D["Design experiment<br/>(A/B, factorial, blocked)"]
D --> R["Randomize<br/>assignment"]
R --> C["Collect data"]
C --> T["Test<br/>(t, permutation, chi-sq, ANOVA)"]
T --> P{"p-value <br/>vs effect size"}
P -->|"Significant + meaningful"| ACT["Act"]
P -->|"One but not the other"| STOP["Reconsider"]
A/B Testing
Two groups, random assignment, one treatment, one control, one metric. The Bruces stress what most teams get wrong:
- Define the test statistic before collecting data. Otherwise every analysis is post-hoc.
- Pre-register the sample size. Computed from desired power (typically 0.80), the minimum detectable effect, and the variability of the outcome.
- Stop the test only at the planned endpoint. Peeking and stopping early inflate the false-positive rate.
Hypothesis Tests
- Null hypothesis (H₀): no effect; the status quo
- Alternative (Hₐ): one-sided or two-sided
- Type I error (α): false positive — typically 0.05
- Type II error (β): false negative — controlled via power (1 − β)
- p-value: probability of seeing data at least this extreme if H₀ were true — frequently misinterpreted
Resampling-Based Tests
The book's defining stance: the permutation test is the modern default. Combine all observations, shuffle group labels, recompute the test statistic, repeat. The p-value is the fraction of permutations that produced a statistic as extreme as the observed one. No distribution assumed; no formula needed.
Specific Tests
| Test | When | Replacement (Bootstrap/Permutation) | |---|---|---| | t-test (two-sample) | Compare two means | Permutation of group labels | | ANOVA | Compare 3+ means | Permutation; F-statistic | | Chi-square | Categorical independence | Permutation of contingency table | | Fisher's exact | Small-cell contingency | Exact enumeration | | Multi-arm bandit | Many treatments at once | Bayesian / Thompson sampling |
Multiple Testing and Power
- Bonferroni and FDR (Benjamini-Hochberg) corrections
- Degrees of freedom as the dimension you really have
- Power = 1 − β; calculate before the experiment, not after
Chapter 4: Regression and Prediction
Simple and Multiple Linear Regression
The textbook fit: minimize the sum of squared residuals (OLS). The book emphasizes interpretation over derivation:
- Coefficient: expected change in y per unit change in x, holding other predictors constant — a phrase most readers underestimate
- Standard error of a coefficient: computed via the bootstrap in practice, no need to memorize the closed form
- R² and adjusted R²: fraction of variance explained, with adjusted penalizing model complexity
- Root mean squared error (RMSE): the prediction-quality metric to report
Diagnostics
flowchart LR
F["Fit model"] --> R["Residual<br/>plots"]
R --> O["Outliers<br/>(standardized residuals)"]
R --> L["Leverage<br/>(hat values)"]
R --> C["Influence<br/>(Cook's distance)"]
R --> H["Heteroscedasticity"]
R --> N["Non-linearity"]
R --> A["Autocorrelation<br/>(Durbin-Watson)"]
A model that fits the training data perfectly but violates these checks is a model that will fail on new data.
Categorical Predictors
- One-hot encoding vs. reference (dummy) coding
- Multicollinearity and the dummy-variable trap
Beyond Linearity
| Technique | Use Case | |---|---| | Polynomial regression | Smooth curvature | | Spline regression | Local flexibility, controlled smoothness | | Generalized Additive Model (GAM) | Many predictors, each with its own non-linear shape | | Interaction terms | Effect of x depends on z |
Regularization (foreshadowing Chapter 6)
- Ridge (L2): shrinks all coefficients
- Lasso (L1): drives some coefficients to zero (feature selection)
- Elastic net: weighted combination of both
Chapter 5: Classification
flowchart LR
X["Features"] --> M["Classifier"]
M --> P["Predicted<br/>probability"]
P --> T{"Threshold"}
T -->|"≥ τ"| POS["Positive class"]
T -->|"< τ"| NEG["Negative class"]
P --> ROC["ROC / AUC<br/>(threshold-free)"]
Methods
- Naive Bayes: strong independence assumption; surprisingly competitive on text
- Linear Discriminant Analysis (LDA): Gaussian class distributions, shared covariance
- Logistic regression: the classification analogue of linear regression; coefficients = log-odds changes
- K-nearest neighbors (KNN): lazy, distance-based; needs scaling
Evaluation
| Metric | Formula | Use When | |---|---|---| | Accuracy | (TP + TN) / N | Classes balanced | | Precision | TP / (TP + FP) | Cost of false positive is high | | Recall (sensitivity) | TP / (TP + FN) | Cost of false negative is high | | F1 | Harmonic mean of P, R | Want balance | | ROC / AUC | TPR vs FPR across thresholds | Compare models threshold-free | | Lift / gains | Cumulative response vs random | Marketing, scoring |
Imbalanced Data
The Bruces dedicate substantial space to the practical reality that most business classification problems (fraud, churn, conversion) have rare positives:
- Undersample the majority class for speed
- Oversample the minority class for signal — SMOTE for synthetic samples
- Cost-sensitive learning: weight the loss
- Calibrated probability thresholds based on business cost
Chapter 6: Statistical Machine Learning
K-Nearest Neighbors
Simple, non-parametric. Scale your features first or distance is meaningless. K governs the bias-variance trade-off.
Tree Models
flowchart TD
R["Root: all data"] --> S1{"x_1 < 3.5?"}
S1 -->|"yes"| L1["Leaf:<br/>predict class A"]
S1 -->|"no"| S2{"x_2 < 7.2?"}
S2 -->|"yes"| L2["Leaf:<br/>predict class B"]
S2 -->|"no"| L3["Leaf:<br/>predict class A"]
CART (Classification and Regression Trees) recursively splits on the feature and threshold that maximally reduces impurity (Gini or entropy for classification, variance for regression). Pruning controls overfitting.
Bagging and Random Forest
- Bagging: train B trees on bootstrap samples, average predictions. Reduces variance.
- Random forest: bagging plus a random subset of features considered at each split. Decorrelates the trees and improves performance further.
- Variable importance: mean decrease in impurity or mean decrease in accuracy when each variable is permuted.
Boosting
Sequential, not parallel. Each tree focuses on the residuals of the previous ensemble.
| Algorithm | Idea | |---|---| | AdaBoost | Up-weights misclassified examples | | Gradient boosting | Fits trees to negative gradient of the loss | | XGBoost | Regularized, parallelized gradient boosting — the modern default for tabular ML |
Regularization in Practice
For both regression and classification, ridge, lasso, and elastic net are tuned via cross-validation. Lasso doubles as a feature selector. Standardize predictors first.
Chapter 7: Unsupervised Learning
Principal Component Analysis (PCA)
flowchart LR
X["High-dim<br/>features"] --> S["Standardize"]
S --> C["Covariance<br/>matrix"]
C --> E["Eigen-<br/>decomposition"]
E --> PC["Principal<br/>components"]
PC --> V["Variance<br/>explained"]
PC --> R["Reduced<br/>representation"]
PCA finds orthogonal directions of maximum variance. Use it for visualization (2-3 components), denoising, and as a preprocessing step before clustering or supervised learning.
Clustering
| Method | Assumes | Strength | Weakness | |---|---|---|---| | K-means | Spherical clusters, equal size | Fast, scalable | Must choose k; sensitive to scale | | Hierarchical (agglomerative) | None | Dendrogram visualization | O(n²) memory | | Model-based (GMM) | Gaussian mixture | Soft assignments, principled | Requires distribution assumption | | DBSCAN | Density-based | Finds arbitrary shapes; flags noise | Hyperparameter-sensitive |
Scaling and Mixed Variable Types
The chapter closes with the practical headache of clustering when your data is part-numeric, part-categorical. Gower's distance and explicit standardization of numeric features are the standard treatments.
Key Lessons Across All Chapters
- EDA is not optional, ever — and it pays for itself hundredfold by the end of the project
- The bootstrap is the single most useful technique in the book — learn it cold
- Effect sizes communicate; p-values gate-keep — report both
- Regression is interpretable; tree ensembles are accurate — pick based on your stakeholder, not your taste
- Regularization is required, not optional, the moment your predictors outnumber a small constant
- Unsupervised methods produce hypotheses, not answers — feed them to humans, not to systems
Action Plan
-
Audit your current EDA habit. Are you actually producing histograms, boxplots, and correlation matrices on every new dataset? If not, build a personal EDA template.
-
Replace one closed-form CI calculation with a bootstrap this week. Pick any metric on a real dataset. Resample 10,000 times. Read the 2.5th and 97.5th percentiles. Compare to your textbook formula.
-
Recompute the last p-value you reported, with its effect size and CI. Decide whether the practical conclusion changes. Apply that habit to every test going forward.
-
For your next tabular ML problem, start with logistic regression, then a random forest, then XGBoost. Compare via cross-validated AUC. Pick the simplest model whose performance loss is within tolerance.
-
Stop A/B tests at the planned size, not when they look significant. Pre-compute the sample size from power, MDE, and baseline variance.
-
Standardize your features before any distance-based method — KNN, K-means, PCA, hierarchical clustering, anything with a distance metric.
-
Read the book in both languages. Even if you live in one, the cross-language exposure tightens your understanding of the underlying statistics.
analysis
Strengths
- Ruthless curation. The book's biggest virtue is the decisions it makes about what to leave out. A traditional statistics text gives equal billing to topics a working data scientist will never use; the Bruces refuse. The result is a small, dense, defensible canon.
- Dual-language code. The 2nd edition's most consequential change: every R example is matched by a Python equivalent using NumPy, pandas, scikit-learn, and statsmodels. This makes the book genuinely usable in both ecosystems and ends the R-only friction of the 1st edition.
- Glossary-first structure. Every section opens with key terms defined precisely. The book functions as a reference long after the first read — flip open to "stratified sampling" or "leverage" and the answer is two paragraphs away.
- Resampling as the spine. The bootstrap and permutation tests are presented not as exotic alternatives but as the default. This frees the reader from memorizing distributional formulas they will never apply.
- Practitioner-honest warnings. Each concept is followed by the specific misuses working data scientists commit: p-hacking, peeking at A/B tests, ignoring multicollinearity, applying K-means without scaling. These are the failures the authors have actually seen.
- Excellent regression and tree chapters. Chapters 4 and 6 are the strongest in the book — clear, deep enough to be useful, shallow enough to be readable in an afternoon.
- Compact length. At ~360 pages it can plausibly be read cover to cover, unlike Hastie–Tibshirani–Friedman's 750-page Elements of Statistical Learning.
- Strong companion fit. Slotted neatly between Naked Statistics (lay introduction) and An Introduction to Statistical Learning (theory companion), it occupies the practitioner sweet spot.
Weaknesses
- Thin on theory. Readers who want to understand why the bootstrap works, or the derivation of the OLS estimator, will need a second book. The Bruces explicitly trade depth for applicability — a trade not every reader will accept.
- Light on time series and causal inference. These are covered briefly or skipped entirely. A working data scientist in marketing analytics or economics will need Forecasting: Principles and Practice or Causal Inference: The Mixtape alongside.
- Survival analysis and missing-data methods are touched lightly. Anyone working in healthcare, churn modeling, or any setting with censored outcomes will need supplementary reading.
- No deep learning. This is by design — the book is about statistics for tabular data — but readers expecting modern neural-network coverage will be surprised.
- Bayesian methods are largely absent. Frequentist throughout. Statistical Rethinking is the natural complement.
- Code is illustrative, not production-grade. Examples are pedagogical snippets; readers building real pipelines will need more on testing, packaging, and reproducibility.
- Glossary-style writing can read as terse. The compactness that makes the book a good reference can also make first reading feel encyclopedic rather than narrative.
Notable Failure Modes the Book Helps Prevent
| Mistake | Where the Book Addresses It | |---|---| | Comparing means with the t-test when distributions are skewed | Ch. 3: permutation alternative | | Stopping an A/B test early because it "looks significant" | Ch. 3: power, sample-size pre-registration | | Reporting a p-value of 0.04 as a "win" without effect size | Ch. 3: effect size + CI together | | Using accuracy on an imbalanced classification problem | Ch. 5: precision/recall, AUC, lift | | Skipping feature standardization before KNN, K-means, or PCA | Ch. 5–7: scaling as prerequisite | | Treating a high R² as proof a model will generalize | Ch. 4: train/test, cross-validation | | Choosing deep learning for a 50,000-row tabular problem | Ch. 6: XGBoost as the right default |
Comparative Position
| Book | Target Reader | Mathematical Depth | Code | |---|---|---|---| | Naked Statistics | Curious lay reader | Minimal | None | | Practical Statistics for Data Scientists | Working data scientist | Moderate | R + Python | | An Introduction to Statistical Learning | Advanced undergraduate / practitioner | Moderate-high | R (Python edition exists) | | The Elements of Statistical Learning | Graduate student / researcher | High | Minimal | | Statistical Rethinking | Bayesian curious practitioner | Moderate | R (Stan) |
The Bruces' book is the only one in this list explicitly written as a working reference for the moment when a data scientist needs to look something up, do it correctly, and ship — without re-reading half a textbook.
Final Assessment
| Dimension | Rating | Notes | |---|---|---| | Practical Utility | 9.5/10 | Directly applicable to daily data work | | Breadth | 8/10 | The intended seven chapters; deliberately skips many topics | | Depth | 7/10 | Sufficient for application; insufficient for derivation | | Code Quality | 9/10 | Cross-language, current, readable | | Readability | 8/10 | Compact and dense; reads better as reference than narrative | | Timeliness | 9/10 | 2nd edition (2020) covers XGBoost, modern scikit-learn | | Overall | 8.8/10 | The default working-statistician reference for data scientists |
The book earns its place on the short shelf of statistics titles a working data scientist should own. It will not turn the reader into a statistician, but it will close the gap between "I can fit a model" and "I can defend my conclusions" — which is the gap most data science careers actually live in.
narration
Introduction
Welcome to BookAtlas. Today: Practical Statistics for Data Scientists, second edition, by Peter Bruce, Andrew Bruce, and Peter Gedeck. Published by O'Reilly Media in 2020. About 360 pages. A working reference, not a textbook.
The Problem the Book Solves
There is a gap in the data science profession. Most data scientists enter the field through programming, not through a statistics curriculum. They learn pandas before they learn the central limit theorem. They can fit a random forest in fifteen seconds but cannot explain why its predictions are calibrated, or whether the lift over a logistic regression baseline is meaningfully better than noise.
This book exists to close that gap. Not by replacing a six- semester statistics degree, but by isolating the fifty or so concepts that actually appear in working data science, and explaining each one compactly, in both R and Python.
Who the Authors Are
Peter Bruce founded the Institute for Statistics Education, known as statistics.com, in 2002. He spent two decades discovering, by direct contact with thousands of online students, which statistical concepts working analysts actually needed and which they did not. This book is the distilled result of that experience.
His son Andrew Bruce, a principal research scientist at Amazon with a PhD in statistics from the University of Washington, contributes the depth on modeling and machine learning.
For the second edition, Peter Gedeck — a senior data scientist and chemist by training — joined to translate every R example into Python. This is the change that made the second edition displace the first.
The Seven Chapters
The book is organized into seven chapters that follow the arc of a real data project.
Chapter one is exploratory data analysis. Look at the data before modeling it. Compute medians and interquartile ranges, not just means and standard deviations, because outliers exist in every real dataset. Use boxplots and correlation matrices as your first questions to the data, not your last.
Chapter two covers sampling distributions. The bootstrap is the single most important technique in the book. Resample your data with replacement a thousand times, recompute your statistic, and read off the standard error and confidence interval directly. No formula, no distributional assumption. This replaces most of what a traditional statistics course teaches in its inference chapter.
Chapter three is statistical experiments and significance testing. The Bruces are blunt: the p-value is a poor decision rule. Pair it with effect size and confidence interval, or do not report it. They cover A/B testing, hypothesis testing, permutation tests, ANOVA, chi-square, multi-armed bandits, and power calculations.
Chapter four is regression and prediction. Linear and multiple regression, with serious attention to interpretation, diagnostics, factor variables, and the modeling decisions practitioners actually face. The chapter ends with splines, generalized additive models, and a preview of regularization.
Chapter five is classification. Naive Bayes, linear discriminant analysis, logistic regression, K-nearest neighbors. The evaluation section is extensive — accuracy, precision, recall, F1, ROC, AUC, lift — with the honest practical guidance that business classification problems are almost always imbalanced and you must adjust accordingly.
Chapter six is statistical machine learning. Tree models. Random forests. Boosting. The book's verdict, stated plainly: for tabular data, gradient boosting — and XGBoost in particular — is the modern default. Not deep learning.
Chapter seven is unsupervised learning. Principal component analysis, K-means, hierarchical clustering, model-based mixture models, and the practical headache of clustering mixed numeric and categorical data.
The Bootstrap as Spine
The book's defining technical choice is to make resampling — the bootstrap and the permutation test — the spine of statistical inference, instead of the algebraic distribution theory most textbooks lead with.
Why does this matter? Because the bootstrap works for any statistic you can compute. It does not require the data to be normal. It does not require you to remember which formula applies to a difference of medians versus a ratio of variances. You write a loop. You resample. You read off the answer.
For a data scientist whose comfort zone is code, this is liberating. You can do correct statistical inference today without first memorizing the closed-form variance of every estimator.
A/B Testing Done Right
The book's chapter on experiments deserves singling out. The Bruces walk through every failure mode a working A/B testing team commits.
Stopping the test early because it looks significant — wrong, inflates false positives.
Peeking at the dashboard daily and acting on what you see — wrong, same problem.
Running twenty tests and reporting the one that hits p less than 0.05 — wrong, that is exactly what the multiple-comparisons correction exists to prevent.
Calling a 1.2 percent lift a win without confidence intervals — wrong, you have no idea whether the true effect is positive at all.
Anyone running experiments at a product company should read chapter three before their next test, then re-read it after.
Regression Before Random Forests
Another piece of the book's practical wisdom: start with linear or logistic regression, even if you intend to end with a boosted tree.
A regression model is interpretable. A coefficient has a meaning a product manager can understand. The diagnostics tell you when the model is wrong and how. The regularization story generalizes directly to more complex methods. And surprisingly often, the regression's performance is within tolerance of the fancier model, at a fraction of the operational cost.
The Bruces present regression not as an old-fashioned starting point but as the right baseline for every supervised problem.
Trees and Boosting as the Modern Default
Chapter six's contribution is to settle, with quiet authority, the question of what works on tabular business data. The answer is gradient boosting, specifically XGBoost. Not because boosting is fashionable but because it consistently outperforms alternatives on structured data with mixed types, moderate sample sizes, and the kind of feature interactions that arise in real business problems.
The book treats deep learning's absence as a feature: the authors are not building image classifiers; they are building models for spreadsheets, and spreadsheets are still where most business value lives.
What the Book Leaves Out
The book is short, and shortness is a choice. There is no serious treatment of time series. There is almost nothing on causal inference. Survival analysis is touched lightly. Missing-data methods get less attention than they deserve. Bayesian methods are essentially absent.
These omissions are not failures. The Bruces are explicit: this is the practitioner's core. Anything beyond it lives in a more specialized book. If you work in marketing or economics, read Causal Inference: The Mixtape. If you forecast demand, read Hyndman and Athanasopoulos. If you want Bayesian, read McElreath's Statistical Rethinking.
How to Use It
This is not a book you read once and shelve. It is a book you flip open the day before a meeting where you are about to report on an A/B test, to make sure your effect size and confidence interval are properly stated. The day you start a new modeling project, to remind yourself of the regression diagnostics worth checking. The week you transition from R to Python, or Python to R, to find the equivalent function with a matching code example.
The dual-language presentation is the second edition's gift. You can read a concept in your strong language and the code in your weaker one, and both improve together.
The Verdict
If you are a working data scientist without formal statistical training, this is the single best book to fill the gap. It is compact enough to read in a week, deep enough to keep on the desk, and modern enough — covering bootstrap, XGBoost, and current scikit-learn — that it will not feel dated for years.
It will not turn you into a statistician. It will turn you into a data scientist who knows when to call one, and who can defend the bulk of their work without one.
Rating: 8.8/10. The default working-statistician reference for data science.
This has been a BookAtlas narration of Practical Statistics for Data Scientists, second edition, by Peter Bruce, Andrew Bruce, and Peter Gedeck. Thanks for listening.