Machine Learning Engineering
The Most Complete Applied AI Book on Building Reliable ML Solutions at Scale
sufficient
reading path: overview → analysis → narration
overview
Overview
Machine Learning Engineering (2020) by Andriy Burkov is the practical counterpart to his earlier The Hundred-Page Machine Learning Book. Where the earlier book teaches the science of ML — algorithms, math, and methods — this one teaches the engineering: the lifecycle a project must follow to deliver a model that survives contact with production.
Published on September 5, 2020 by True Positive Inc. (Burkov's own publishing imprint, distributed through Leanpub under a read-first-pay-later model), the book runs 310 pages and carries endorsements from Cassie Kozyrkov (Chief Decision Scientist at Google) and Karolis Urbonas (Head of Machine Learning at Amazon). On Goodreads it holds a 4.32/5 average across 135 ratings.
The book is organized not by algorithm but by project phase — a deliberate choice Burkov makes in the opening pages and which Cassie Kozyrkov flags in her foreword as the most important thing readers should learn from the table of contents alone.
-------|-------------------| | Decision-making & product management | Should we use ML at all? What does success look like? | | Domain expertise & business acumen | What does the customer actually need? | | Data engineering & analysis | How do we get clean, labeled, well-understood data? | | Prototype phase ML engineering | How do we find a candidate model quickly? | | Statistics | Is the result real, or did we get lucky? | | Production phase ML engineering | How do we serve it at scale, online or batch? | | Reliability engineering | How do we detect failure, recover, and prevent adversaries? |
The book is a thin volume with a wide scope. It is intentionally light on code, light on math, and heavy on patterns, checklists, and decision frameworks — closer to a senior engineer's field guide than a textbook.
Key Takeaways
-
ML engineering is not ML research. A research mindset optimizes for accuracy on a frozen dataset. An engineering mindset optimizes for business value, reliability, and cost over the lifetime of the system. The book opens by hammering this distinction.
-
Problem framing comes first — and most projects fail here. The first chapters guide the reader through deciding whether ML is the right tool at all, defining success in business terms, and establishing baselines that a heuristic or a simple model might already beat.
-
Data is the long pole. Most project time goes to data engineering and analysis, not modeling. Burkov devotes substantial space to data quality, labeling strategies, missing values, and the difference between what your data says and what your users do.
-
The right model is rarely the newest model. The prototype phase emphasizes trying simple baselines (logistic regression, decision trees) before reaching for deep learning. Many production models are deliberately simple because they are easy to debug, cheap to serve, and good enough.
-
Statistics is the bridge between prototype and production. A model that beats a baseline on a test set is not yet a model that will beat the baseline in production. Significance testing, confidence intervals, and effect size are not academic exercises — they are the only honest way to decide whether a new model is actually better.
-
Production is a separate discipline. Online serving (REST APIs, streaming, edge) and batch serving (scheduled jobs) have different latency, throughput, and cost profiles. Burkov walks through the patterns: shadow deployments, canary rollouts, A/B testing, and how to make a model rollbackable.
-
Reliability is the most important production concern. Models degrade silently. The book is unusually strong on monitoring for data drift, concept drift, and adversarial inputs; on fallback strategies when the model is wrong; and on graceful degradation when the system is unavailable. The closing chapters treat ML systems with the same care that Site Reliability Engineering applies to any other production service.
-
Mistakes are inevitable. Hope is not a strategy. A recurring theme, quoted from a foreword contributor: assume the model will fail, the pipeline will break, the input distribution will shift, and an adversary will try to exploit it. The engineering work is about detection, recovery, and limiting blast radius — not preventing failure entirely.
-
Domain expertise cannot be automated away. Every model sits inside a domain with its own constraints, jargon, and edge cases. The book repeatedly emphasizes that the ML engineer's job is to translate domain knowledge into features, labels, and success criteria — work that cannot be delegated to a model.
-
Read it alongside The Hundred-Page Machine Learning Book. The two books are deliberately non-overlapping: one teaches the what of ML, this one teaches the how of shipping it. Most reviewers recommend them as a pair.
Who Should Read
| Reader Type | Why | |---|---| | ML engineers shipping models to production | The only concise book covering the full ML lifecycle end to end | | Data scientists transitioning from notebooks to systems | Bridges the gap between research-style modeling and engineering-style operations | | Software engineers adding ML to a product | Teaches the failure modes, monitoring, and reliability concerns unique to ML systems | | Technical product managers leading ML projects | The framing and decision chapters are arguably the most valuable in the book | | MLOps and platform engineers | A pattern catalog for the production and reliability chapters | | Senior engineers preparing for ML interviews at large companies | Captures the "ML system design" interview canon in a single volume |
Who Should Skip
- Beginners learning ML for the first time — start with Burkov's own Hundred-Page Machine Learning Book or Géron's Hands-On Machine Learning for the modeling foundations
- Readers looking for hands-on coding tutorials — the book is deliberately code-light; pair it with Hands-On Machine Learning or Designing Machine Learning Systems for executable examples
- Practitioners needing deep dives on a single phase (e.g. only feature engineering, or only deployment) — Chip Huyen's Designing Machine Learning Systems goes deeper on data and deployment at the cost of breadth
- Researchers focused on model architecture or novel algorithms — this is engineering, not research
- Anyone wanting a vendor- or cloud-specific playbook — the book is tool-agnostic by design
Core Themes
| Theme | Description | |---|---| | Order of operations matters | A successful ML project must be executed in a specific sequence; doing steps out of order causes silent waste or catastrophic failure | | Engineering over algorithms | The hard problems are not in the model; they are in the data, the pipeline, the deployment, and the operations | | Data-centric thinking | Invest in data quality, labeling, and validation before reaching for a more complex model | | Decision intelligence | ML is a tool for making better decisions; it must be framed by the decision it serves, not the other way around | | Reliability by design | Models fail silently, inputs drift, adversaries exist; design for detection, recovery, and graceful degradation | | Domain first, model second | Domain expertise shapes features, labels, success criteria, and edge cases; without it, even a strong model ships the wrong product | | Pragmatism over novelty | The right model is the one that is good enough, debuggable, cheap to serve, and easy to retire — not the one with the best benchmark score |
Why This Book Matters
When Machine Learning Engineering was published in September 2020, the ML industry was just beginning to grapple with the operational consequences of having put thousands of models into production. The "MLOps" category was being named. Most engineering teams were learning — painfully — that a model that worked in a notebook behaved differently in production, and that nobody owned what happened after launch.
Burkov's contribution is to lay out the entire lifecycle in the order it must actually be executed, with decision frameworks for each phase. The book's influence shows up in:
- The widely-cited Cassie Kozyrkov foreword, which frames ML engineering as "decision intelligence"
- The adoption of its lifecycle structure by academic courses and industry training programs
- Its pairing with The Hundred-Page Machine Learning Book as the default two-book curriculum for ML practitioners
- The "read-first-pay-later" Leanpub model, which Burkov pioneered and which lowered the barrier to professional ML education
The book does not try to compete with Chip Huyen's Designing Machine Learning Systems (deeper on data and deployment) or with Géron's Hands-On Machine Learning (deeper on modeling with code). Its niche is breadth, brevity, and order: a single 310-page volume that walks a project from "should we use ML?" to "what do we do when the model breaks at 3 a.m.?"
Related Books
| Book | Author | Connection | |---|---|---| | The Hundred-Page Machine Learning Book | Andriy Burkov | The modeling companion — the what of ML, condensed | | Designing Machine Learning Systems | Chip Huyen | Deeper treatment of data engineering, deployment, and MLOps | | Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow | Aurélien Géron | The code-first ML tutorial that complements this engineering overview | | Reliable Machine Learning | Todd Underwood | Goes deeper on the reliability and monitoring chapters | | Building Machine Learning Pipelines | Hapke & Nelson | Hands-on TFX and Kubeflow for the production phase | | Machine Learning Engineering in Action | Ben Wilson | A longer, code-heavy alternative for the production-phase chapters |
Final Verdict
Machine Learning Engineering is the book you hand to a strong ML practitioner who is about to ship their first model to production — or to a senior engineer who needs to understand why ML systems behave differently from any other software system. It will not teach you to code a neural network. It will teach you whether you should, what to do before you do, and what to do when it breaks.
The book's greatest strength is its ordering. The greatest weakness, by design, is its breadth — a reviewer wanting deep treatment of feature stores, canary deployments, or drift detection will need a supplementary text. Read it for the map; read Huyen for the territory.
Rating: 8/10 — The best concise end-to-end guide to ML engineering, and the natural second book after Burkov's own Hundred-Page Machine Learning Book.
content map
The ML Project Lifecycle
The book's organizing idea is the lifecycle: a successful ML project must execute the phases in a specific order, and skipping or reordering them is the most common cause of failure.
flowchart TD
A["1. Problem Framing<br/>(Should we use ML?)"] --> B["2. Domain & Objectives<br/>(What does success look like?)"]
B --> C["3. Data Engineering<br/>& Analysis"]
C --> D["4. Prototype Phase<br/>ML Engineering"]
D --> E["5. Statistics<br/>(Is it actually better?)"]
E --> F["6. Production Phase<br/>ML Engineering"]
F --> G["7. Reliability Engineering<br/>(Detect, recover, defend)"]
G -.->|"Drift / failure"| C
G -.->|"Retrain"| D
Cassie Kozyrkov's foreword summarizes the book in culinary terms: figure out what to cook (decision-making), understand the suppliers and customers (domain), process ingredients at scale (data), try combinations quickly (prototyping), check quality (statistics), serve millions of dishes (production), and stay top-notch when the truck brings the wrong delivery (reliability).
Phase 1 — Problem Framing
The first decision is whether to use ML at all. Burkov enumerates cases where ML is and is not the right tool:
Use ML when:
- The problem is perceptive (image, speech, video) — humans recognize patterns in high-dimensional inputs that no hand-coded rule captures
- The problem is an unstudied phenomenon with observable examples (personalized medicine, log anomaly detection, content recommendation)
- The objective is simple (yes/no, a number) and there are sufficient examples
Avoid ML when:
- Every decision must be explainable in human terms (regulatory environments often forbid this)
- The cost of an error is unacceptably high and unrecoverable (life-critical systems, irreversible financial actions)
- The marketing value of "AI" exceeds the engineering value — a common trap that Burkov calls out explicitly
The framing chapter's signature move: define success in business terms before defining it in model terms. Revenue, retention, time saved — not accuracy, F1, or AUC.
Phase 2 — Domain Expertise and Business Acumen
ML does not eliminate the need for domain knowledge; it amplifies it. The engineer who understands the domain will:
- Choose features that encode real signal, not noise
- Recognize label leakage before it ruins a model
- Set baseline targets that a simple heuristic might already exceed
- Identify the edge cases that the test set will miss
Burkov argues that domain expertise is the single hardest capability to hire for. A strong modeler with weak domain knowledge will build a model that wins benchmarks and loses customers. A weak modeler with strong domain knowledge will at least build a system that solves the right problem.
Phase 3 — Data Engineering and Analysis
"The greatest challenges must be solved before you type
from sklearn.linear_model import LogisticRegression, and the rest of the problem is solved after you typemodel.fit(X, y)." — Andriy Burkov
flowchart LR
subgraph Sources["Data Sources"]
S1["User Actions"]
S2["Application Logs"]
S3["External APIs"]
S4["Manual Labeling"]
end
subgraph Pipeline["Data Pipeline"]
C["Collection"]
V["Validation<br/>(schema, range, drift)"]
CL["Cleaning<br/>(missing, outliers)"]
L["Labeling<br/>(quality > quantity)"]
end
subgraph Storage["Storage"]
FS["Feature Store"]
DL["Data Lake / Warehouse"]
end
S1 --> C
S2 --> C
S3 --> C
S4 --> L
C --> V --> CL --> L
L --> FS
L --> DL
Key engineering concerns:
| Concern | What it means | Why it matters | |---|---|---| | Data quality | Missing values, label errors, duplicates, schema drift | Garbage in, garbage out — a model trained on bad data cannot recover | | Labeling strategy | Who labels, how disagreements are resolved, what the ground truth actually is | The label defines the objective; ambiguous labels produce ambiguous models | | Train/serve skew | The features the model sees in production differ from those in training | The single most common cause of production model failure | | Data versioning | Datasets are versioned and reproducible alongside code | Without it, you cannot debug, audit, or roll back | | Privacy and consent | PII handling, retention policies, regulatory compliance (GDPR, HIPAA) | Legal and ethical exposure, especially under European and US state laws |
Burkov devotes particular attention to the cost of labeling. Active learning, weak supervision, and crowdsourcing each have trade-offs that the engineer must understand before committing a budget.
Phase 4 — Prototype Phase ML Engineering
The prototype phase exists to answer one question quickly: can a model beat a strong baseline on this problem?
flowchart LR
B["Baseline<br/>(heuristic, random, majority)"] --> S["Simple Model<br/>(linear, tree)"]
S --> C["Complex Model<br/>(GBM, neural net)"]
C --> E["Error Analysis"]
E -.->|"More data"| B
E -.->|"Better features"| S
E -.->|"Different model"| C
The progression is deliberate. Each step exists to be either accepted or rejected cheaply:
- Baseline first. A simple rule, a random predictor, or a majority-class classifier. If a model cannot beat this, the project is not ready for ML.
- Simple model second. Linear regression, logistic regression, a small decision tree. These train in seconds, are easy to debug, and often beat a complex model in production.
- Complex model only if justified. Gradient boosting, deep learning. Justified by measured lift over the simple model, not by fashion.
- Error analysis at every step. Confusion matrices, residual plots, segment-level performance. The errors are the most informative output of the prototype phase.
Burkov emphasizes that the prototype is a means, not a deliverable. The deliverable is a credible answer to "can ML solve this?", not a Jupyter notebook.
Phase 5 — Statistics
A model that beats a baseline on a single test set has not been proven better. The statistics chapters cover:
- Significance testing. Is the improvement over the baseline larger than what random variation would produce? Burkov walks through the common tests (paired t-test, McNemar's test for classifiers, bootstrap confidence intervals) and the common pitfalls (multiple comparisons, peeking at test results).
- Effect size vs. statistical significance. A 0.1% accuracy improvement can be statistically significant at scale but operationally meaningless. Burkov insists on translating statistical results into business terms.
- Confidence intervals for production metrics. Latency, throughput, error rate — all should be reported with intervals, not point estimates.
- A/B testing for production rollouts. The model that wins in offline evaluation is not guaranteed to win with live users. The book covers sample size calculation, interleaving, and the long list of A/B test anti-patterns.
The recurring message: an engineer who cannot speak the language of statistics cannot honestly ship a model.
Phase 6 — Production Phase ML Engineering
flowchart TB
subgraph Serving["Serving Patterns"]
O["Online / Real-time<br/>REST, gRPC, streaming<br/>Low latency, model in memory"]
B["Batch<br/>Scheduled jobs<br/>High throughput, no SLA"]
E["Edge / Embedded<br/>On-device inference<br/>Offline, small models"]
end
subgraph Deployment["Deployment Strategies"]
SHD["Shadow"]
CAN["Canary"]
BG["Blue-Green"]
AB["A/B"]
end
subgraph Infra["Infrastructure"]
CTR["Container"]
ORC["Orchestrator"]
GW["API Gateway"]
end
O --> CTR
B --> CTR
E --> CTR
CTR --> ORC
O --> GW
SHD --> O
CAN --> O
BG --> O
AB --> O
Serving Patterns
| Pattern | Latency | Throughput | Use case | |---|---|---|---| | Online / real-time | ms | lower | Recommendations, fraud, search ranking | | Batch | hours | very high | Reporting, nightly personalization, scoring | | Streaming | ms-seconds | medium | Time-sensitive features, real-time features | | Edge / embedded | ms | very low | Mobile keyboards, on-device vision |
Deployment Strategies
- Shadow deployment. The new model receives production traffic but its outputs are not served to users. The team compares shadow predictions to live predictions offline.
- Canary release. A small percentage of traffic is routed to the new model. Roll forward or roll back based on monitored metrics.
- Blue-green deployment. Two identical production environments; switch traffic atomically. Best for instant rollback.
- A/B testing. Random assignment of users to model variants for statistical comparison. The book is firm: A/B testing is a measurement tool, not a deployment tool — use it to learn, then use canary to ship.
Reproducibility and Versioning
Every artifact in the production pipeline must be versioned and reproducible:
- Code (Git)
- Data (DVC, lakeFS, or a feature store with snapshots)
- Model artifacts (MLflow, Weights & Biases, or a model registry)
- Environment (Docker images, lock files)
- Configuration (feature flags, hyperparameters)
Burkov's framing: if you cannot reproduce the model that is running in production right now, you cannot debug it, audit it, or roll it back.
Phase 7 — Reliability Engineering
The book's strongest section. Reliability treats the ML system the way Site Reliability Engineering treats any production service: assume it will fail, design for detection and recovery.
flowchart TD
subgraph Failure_Modes["Failure Modes"]
DD["Data Drift<br/>Input distribution changes"]
CD["Concept Drift<br/>P(X->Y) changes"]
AD["Adversarial Inputs<br/>Exploits, prompt injection"]
SS["System Failures<br/>Latency, downtime, OOM"]
MS["Model Staleness<br/>Trained on old data"]
end
subgraph Defenses["Defenses"]
MON["Monitoring & Alerting"]
FB["Fallback Strategies"]
GR["Graceful Degradation"]
RT["Retraining Triggers"]
SEC["Input Validation"]
end
DD --> MON
CD --> MON
AD --> SEC
SS --> GR
MS --> RT
MON --> RT
MON --> FB
Monitoring Categories
| Category | What to watch | When to alert | |---|---|---| | Input data | Distribution, schema, missingness | Drift beyond historical range | | Output | Prediction distribution, confidence | Sudden shift, mass class collapse | | Model performance | Accuracy, precision, recall (when labels are available) | Degradation vs. baseline | | System | Latency, throughput, error rate, memory | SLO breach | | Business | Conversion, revenue, engagement | Untied to model performance | | Adversarial | Suspicious input patterns, abuse signals | Anomalous rate spikes |
Fallback Strategies
When the model is wrong, unavailable, or under attack, the system must do something sensible:
- Default to a heuristic. If the model is unavailable, fall back to a hand-coded rule (e.g. "most popular item").
- Default to a simpler model. If a deep model is too slow, a smaller model or a cached prediction can carry the load.
- Refuse to predict. Sometimes the correct action is to show no recommendation rather than a bad one.
- Human-in-the-loop. Route low-confidence or high-stakes predictions to a human reviewer.
Adversarial Considerations
Burkov devotes space to adversaries — users who intentionally craft inputs to make the model misbehave. Recommendations include input validation, rate limiting, anomaly detection on input features, and treating model outputs as untrusted inputs to downstream systems.
Communication and Stakeholder Management
A chapter many reviewers cite as underrated: the engineering work is half technical and half organizational. The book covers how to:
- Set realistic expectations with product and business stakeholders
- Communicate model performance in business terms
- Write an ML project plan that survives contact with a product roadmap
- Handle disagreement between modelers, engineers, and domain experts
- Decide when not to deploy a model
Key Lessons
- Order matters. The seven phases must happen roughly in sequence. A project that begins with model selection before data engineering will be redone; a project that begins with data engineering before problem framing will optimize the wrong thing.
- The model is the small part. Data engineering, statistics, production, and reliability consume most of the effort and most of the failures.
- Mistakes are inevitable; design for them. The model will be wrong, the pipeline will break, the input distribution will shift. The engineering question is not whether but how fast can we detect and recover.
- Simple models win in production. A logistic regression that is interpretable, cheap to serve, and easy to debug beats a deep network on most business problems.
- Reproducibility is non-negotiable. If you cannot rebuild the model that is running in production, you cannot maintain it.
- Domain expertise is non-substitutable. The ML engineer who understands the domain will outbuild one who does not, every time.
Action Plan
-
Frame the problem first. Before writing code, write a one-page document answering: what decision is this model supporting, what does success look like in business terms, and what is the baseline we need to beat?
-
Audit your data pipeline. Where is data quality at risk? What happens when the schema changes? How are labels collected, audited, and versioned?
-
Build a baseline before a model. A heuristic, a simple rule, a majority-class predictor. If a model cannot beat this, the project is not ready for ML.
-
Establish a model registry. Every trained model is versioned, reproducible, and linked to the code, data, and config that produced it.
-
Design for rollback from day one. Every deployment strategy (canary, blue-green, shadow) must include a tested path back to the previous version.
-
Instrument the system for reliability. Monitor inputs, outputs, system health, and business metrics. Alert on drift, not just on crashes.
-
Plan for the failure modes. For each phase, enumerate the ways it can fail and the fallback for each. Hope is not a strategy.
-
Read The Hundred-Page Machine Learning Book alongside this one. One teaches the what; this one teaches the how.
analysis
Strengths
-
The right scope for a 310-page book. Where Chip Huyen's Designing Machine Learning Systems runs 390 pages on roughly the same territory, Burkov covers the full lifecycle in a third less space, without losing the lifecycle view. The discipline of concision is itself a teaching tool.
-
Unmatched ordering. The single most important contribution of the book is the structure itself: a project that follows the seven phases in order will not waste the most common failure modes. Cassie Kozyrkov's foreword treats the table of contents as the lesson.
-
Strong on the parts most ML books skip. The problem-framing chapters, the statistics chapter, and the reliability chapters are unusually good. Few competing books treat "should we use ML at all?" with the seriousness it deserves.
-
Engineering pragmatism. Burkov repeatedly argues for the simpler, more debuggable, more retirable option. The book's anti-fashion stance — "logistic regression first, deep learning only when justified" — is a corrective to the field's benchmarks-above-all culture.
-
Read-first-pay-later distribution. Burkov publishes through Leanpub and lets readers download the book for free and pay later if they find it valuable. This is a meaningful contribution to ML education access.
-
Endorsements from the right people. Cassie Kozyrkov (Google's Chief Decision Scientist) and Karolis Urbonas (Head of ML at Amazon) blurbed the book, signaling that the engineering bar is taken seriously at the highest levels of the industry.
-
Composable with the author's other book. Read alongside The Hundred-Page Machine Learning Book, the two together form a curriculum that covers both modeling and engineering in fewer than 500 pages — small enough to actually finish.
-
Decision-intelligence framing. Kozyrkov's foreword plants a flag: ML is a tool for making better decisions, and the engineering work exists to serve the decision, not the other way around. This is a more honest framing than the prevailing "build the best model" mentality.
Weaknesses
-
Light on concrete code. A recurring criticism in reviews: the book is heavy on patterns, light on code. There is no companion GitHub repository, no Jupyter notebooks, no end-to-end example project. Readers who learn by running code will need to pair this book with Géron or Huyen.
-
Brevity sacrifices depth. Each phase gets a chapter or two, not a part. A reader who needs deep treatment of feature stores, canary deployments, or drift detection will find this book a starting point rather than a reference.
-
Light on cloud and tooling specifics. The book is tool-agnostic by design, but readers building on AWS, GCP, or Azure will not find a roadmap to the managed services that implement many of these patterns. Building Machine Learning Pipelines (Hapke & Nelson) covers the TFX and Kubeflow side; Machine Learning Engineering in Action (Wilson) covers more code.
-
Uneven balance between modeling and MLOps. A Polish reviewer on Goodreads put it bluntly: the book is "more for the person doing modeling, neglecting MLOps engineer topics." Burkov is more confident in the modeling-adjacent chapters (problem framing, statistics) than in the pure-infrastructure chapters (CI/CD for ML, infrastructure-as-code for pipelines).
-
Writing style is functional, not literary. Several reviewers report that the prose can feel like a checklist in places. This is a feature for some readers and a weakness for others.
-
No diagrams of real architectures. A few well-drawn reference architectures (e.g. "an online recommender system with a feature store, model registry, and monitoring pipeline") would have made the production chapters more concrete. The Mermaid diagrams in this BookAtlas entry fill some of that gap.
-
Examples skew toward tabular business data. Readers building computer vision or NLP systems will find the data-engineering and labeling chapters less directly applicable than readers working on classical ML problems.
-
No exercises or worked examples. The book is a reading experience, not a course. There is no problem set, no end-of-chapter projects, no way to test understanding without applying it at work.
Criticism
The "Surface Area" Critique
The most common negative review: the book is a survey, not a textbook. A reader expecting deep treatment of feature stores, drift detection, or model serving will find each topic covered in 10–20 pages. The book is honest about this — it is a map, not a territory. Readers who need depth should plan a follow-up reading list per chapter.
The "Modeling Bias" Critique
A nontrivial subset of Goodreads reviewers note that the book leans toward the perspective of a data scientist transitioning to engineering, not an engineer transitioning to ML. The MLOps chapters (CI/CD, IaC, container orchestration) are thinner than the modeling-adjacent chapters (problem framing, statistics, evaluation). This is partly a function of Burkov's own background (PhD in AI, team lead at Gartner) and partly a function of the field's center of gravity in 2020.
The "Where Is the Code?" Critique
Reviews that dock the book a star or two consistently mention the absence of runnable code. Hands-On Machine Learning succeeds partly because readers can clone a repository and run every example. Burkov's book does not offer this, by design — but the design choice is a weakness for some readers.
The "Already Known" Critique
A few reviewers who have shipped production ML systems report that the book contains little they did not already know. The book is more useful as a consolidation of best practices than as a source of novel techniques. For this audience, Chip Huyen's book has more to offer.
Comparison to Similar Books
| Book | Key Difference | |---|---| | The Hundred-Page Machine Learning Book (Burkov) | The modeling companion. Same author, same style, half the length. Read both. | | Designing Machine Learning Systems (Huyen) | Deeper on data engineering, deployment, and MLOps; longer; more diagrams. The natural follow-up read. | | Hands-On Machine Learning (Géron) | Code-first, framework-specific (Scikit-Learn, Keras, TF). The natural prerequisite read. | | Building Machine Learning Pipelines (Hapke & Nelson) | Hands-on TFX and Kubeflow. Covers the implementation of the production phase this book describes. | | Machine Learning Engineering in Action (Wilson) | A code-heavy alternative, with a longer treatment of the production phase. | | Reliable Machine Learning (Underwood) | Goes much deeper on the reliability and monitoring chapters. | | Making Friends with Machine Learning (Kozyrkov) | Google's internal applied ML course. The structural inspiration for this book, free online. |
Historical Context
The book appeared at an inflection point. By September 2020, the industry had deployed thousands of models and was waking up to the operational reality: most models degraded within months, most data pipelines broke more often than the models themselves, and most teams had no formal process for any of it. The "MLOps" category was just being named (the first MLOps conferences and tooling startups were 2019–2020).
Burkov's contribution was to consolidate what working ML engineers already knew into a single ordered, accessible volume and to publish it under a model (read-first-pay-later) that ensured it would reach the practitioners who needed it most. The book's structure is widely echoed in subsequent training programs, course curricula, and team rituals.
Five years later, the book's prescriptions are still uncontroversial: framing before modeling, baselines before models, statistics before claims, production before celebration, reliability before scaling. Few books from 2020 in any technical field have aged this well.
Scientific Grounding
| Concept | Source | Application | |---|---|---| | Decision intelligence | Cassie Kozyrkov (Google) | Framing ML as decision support, not autonomous prediction | | Data-centric AI | Andrew Ng (Stanford, 2021) | Investing in data quality over model complexity | | CRISP-DM | Shearer (2000) | The Cross-Industry Standard Process for Data Mining — earlier lifecycle model that influences the book | | TDSP (Team Data Science Process) | Microsoft (2017) | Another lifecycle model with overlapping phases | | Statistical significance testing | Fisher, Neyman-Pearson (20th c.) | Deciding whether an observed improvement is real | | Site Reliability Engineering | Google SRE book (Beveridge, Beyer, Jones, Petoff, Murphy, 2016) | The reliability philosophy applied to ML systems | | A/B testing | Kohavi, Crook, Longbotham (2009) | Online controlled experiments at scale | | Active learning | Settles (2010) | Reducing labeling cost by intelligent sample selection | | Data drift / concept drift | Widmer, Kubat (1996); Tsymbal (2004) | Formalizing the failure modes the reliability chapter addresses |
Final Assessment
| Dimension | Rating | Notes | |---|---|---| | Practical Utility | 8/10 | Directly applicable; would be 9 with more code | | Breadth | 9/10 | The only single-book treatment of the full ML lifecycle | | Depth | 6/10 | Intentionally shallow per phase; pair with Huyen | | Clarity of structure | 10/10 | The lifecycle ordering is the book | | Timeliness (2020 → 2026) | 8/10 | The prescriptions have aged remarkably well; only the tooling examples feel dated | | Code density | 3/10 | A few pseudocode snippets; no runnable examples | | Readability | 8/10 | Functional prose; some reviewers find it dry | | Overall | 8/10 | The best concise end-to-end guide to ML engineering, and the natural complement to the author's Hundred-Page book |
Machine Learning Engineering fills a niche that no other single book fills: a concise, ordered, engineering-first map of the ML project lifecycle. It is not the deepest book in any single phase, but it is the only one you can hand to a new ML engineer and have them understand the full shape of the work in a week. Read it for the map; read Huyen, Géron, and the SRE book for the territory.
narration
Introduction
Welcome to BookAtlas. Today: Machine Learning Engineering by Andriy Burkov. Published September 2020 by True Positive Inc., the author's own publishing imprint, distributed through Leanpub under a read-first-pay-later model. 310 pages. Endorsed by Cassie Kozyrkov, Google's Chief Decision Scientist, and Karolis Urbonas, Head of Machine Learning at Amazon. 4.32 stars on Goodreads across 135 ratings.
This is the book that tells you what happens after you have trained the model. Tonight we have two perspectives. On one side, a senior ML engineer who has shipped a dozen models to production. On the other, a data scientist who has spent years in notebooks and is finally being asked to put something live. Let us get into it.
Why This Book Exists
ML Engineer: Most ML books stop at the point where the work actually gets hard. They teach you gradient descent, backpropagation, transformer architectures. They show you a Jupyter notebook with a clean dataset and a 95% accuracy score. Then you go to work and discover that getting the data clean takes three months, the model degrades in production within weeks, and nobody knows whose job it is to fix any of it.
Data Scientist: That has been my experience exactly. I have a stack of ML books on my desk. None of them prepared me for the question my manager actually asks: "Why is the model wrong?"
ML Engineer: That is the gap Burkov is filling. He is not trying to teach you machine learning. He assumes you already know that. He is teaching you machine learning engineering — the work of taking a model from a notebook to a system that runs at scale, in production, without silently breaking.
The Structure: Order Matters
Data Scientist: The first thing I noticed is that the book is organized by project phase, not by algorithm. Phase 1: problem framing. Phase 2: domain expertise. Phase 3: data engineering. Phase 4: prototyping. Phase 5: statistics. Phase 6: production. Phase 7: reliability.
ML Engineer: Yes. And Cassie Kozyrkov's foreword says this is the most important thing to learn from the book — not the content of the chapters, but the order. She says if you internalize the table of contents, you will save yourself from the most common failure mode in ML: doing steps out of order.
Data Scientist: What does "out of order" look like?
ML Engineer: Picking a model before you have clean data. Deploying before you have a baseline. Optimizing accuracy before you have defined success in business terms. Every ML team has done at least one of these. Burkov's argument is that if you follow the seven phases in sequence, you avoid most of the common ways these projects die.
Phase 1: Problem Framing
The first chapter is the one I wish every product manager would read. Burkov spends it on a question most ML books ignore: should we use ML at all?
Data Scientist: I have been in kickoff meetings where the answer is "of course we use ML — that is why we are building this." The framing has already been done before the team is hired.
ML Engineer: That is exactly the failure mode. Burkov lists the cases where ML is the wrong tool: when every decision must be explainable, when the cost of an error is unacceptably high, when the problem can be solved with a simple rule. He is not anti-ML. He is anti-ML-by-default.
Data Scientist: And the cases where ML is right?
ML Engineer: Perceptive problems (image, speech, video), unstudied phenomena with observable examples, problems with simple objectives and lots of data. Notice that none of these include "we have a great new transformer model and we want to use it." The problem picks the technique, not the other way around.
Phase 2: Domain Expertise
Data Scientist: This is the chapter I did not expect from a book on engineering. Burkov spends a whole chapter on domain expertise. What does the customer actually need? What are the edge cases the test set will miss?
ML Engineer: This is the most underrated chapter. The thesis is that the modeler with weak domain knowledge will build a model that wins benchmarks and loses customers. The modeler with strong domain knowledge will at least solve the right problem. Domain knowledge is the hardest thing to hire for and the hardest thing to fake.
Data Scientist: I will confess: I have built models where the features were technically correct and the predictions were operationally useless. Because I did not understand what the downstream user actually needed.
ML Engineer: Exactly. The book's argument is that every model sits inside a domain with its own jargon, constraints, and history. The ML engineer's job is to translate that domain into features, labels, and success criteria. You cannot delegate that to a model.
Phase 3: Data Engineering
This is the longest section, and for good reason.
flowchart LR
A["User Actions"] --> C["Collection"]
B["Application Logs"] --> C
C --> V["Validation"]
V --> L["Labeling"]
L --> S["Storage"]
S --> M["Model Training"]
Data Scientist: Burkov has a quote I want to put on my
wall: "The greatest challenges must be solved before you type
from sklearn.linear_model import LogisticRegression, and the
rest of the problem is solved after you type model.fit(X, y)."
ML Engineer: That is the book in one sentence. The challenge is the data — getting it, cleaning it, labeling it, keeping it versioned, ensuring that the features at training time match the features at serving time. That last one — train/serve skew — is, in my experience, the single most common cause of production model failure.
Data Scientist: What does skew look like in practice?
ML Engineer: You train on data from January through June. The model learns that "user_age" is a useful feature. By October, your production feature pipeline is passing through ages as strings, or as zero-padded integers, or with a slightly different definition of "age." The model still receives something called "user_age" but it is not the same distribution it was trained on. The predictions degrade. The team spends two weeks debugging. The root cause was a feature pipeline change nobody documented.
Data Scientist: And Burkov's prescription?
ML Engineer: Versioned feature pipelines, point-in-time correctness, ideally a feature store. The book does not insist on any specific tool, but it is firm on the discipline: if you cannot reproduce the data the model was trained on, you cannot debug, audit, or roll back the model.
Phase 4: Prototyping
Data Scientist: This is the part I am good at. Try a baseline, then a simple model, then a complex model. Look at the errors. Iterate.
ML Engineer: The book's contribution here is the order. Start with a heuristic. If a model cannot beat the heuristic, the project is not ready for ML. Start with linear models. If logistic regression gets within 2% of the deep network's accuracy, ship the logistic regression.
Data Scientist: Why?
ML Engineer: Because logistic regression is interpretable, debuggable, cheap to serve, and easy to retire. A deep network that is 2% more accurate on your test set might be 10x more expensive to serve, 10x harder to debug, and impossible to explain to a regulator. The book's anti-fashion stance is one of its best features.
Phase 5: Statistics
Data Scientist: I will admit I learned things in this chapter. Burkov covers significance testing — paired t-test, McNemar's test for classifiers, bootstrap confidence intervals. The book is firm: a model that beats the baseline on a single test set has not been proven better. You need to show the improvement is larger than what random variation would produce.
ML Engineer: And effect size, not just significance. A 0.1% accuracy improvement can be statistically significant at scale but operationally meaningless. The book insists on translating statistical results into business terms. A 0.5% lift in click-through rate at a billion impressions per day is worth more than a 5% lift at a hundred thousand.
Data Scientist: He also covers A/B testing in the production section, which most engineering books either skip or do badly.
ML Engineer: Yes, and his framing is sharp: A/B testing is a measurement tool, not a deployment tool. Use it to learn. Use canary to ship.
Phase 6: Production
flowchart TB
O["Online / Real-time"] --> S["REST, gRPC, streaming"]
B["Batch"] --> J["Scheduled jobs"]
E["Edge"] --> D["On-device"]
S --> DEP["Deployment Strategies"]
J --> DEP
D --> DEP
DEP --> SH["Shadow"]
DEP --> CA["Canary"]
DEP --> BG["Blue-Green"]
DEP --> AB["A/B"]
Data Scientist: This is where I have the most to learn. Serving patterns, deployment strategies, infrastructure. The book covers online vs. batch vs. streaming vs. edge, and the trade-offs between them.
ML Engineer: And reproducibility. Every artifact in the production pipeline must be versioned: code, data, model, environment, configuration. If you cannot rebuild the model that is running in production right now, you cannot debug it, audit it, or roll it back. This is the discipline that separates a research notebook from a production system.
Phase 7: Reliability
ML Engineer: This is the strongest section. Burkov treats the ML system the way Site Reliability Engineering treats any other production service: assume it will fail, design for detection and recovery.
Data Scientist: What does that look like for ML?
ML Engineer: Monitoring for data drift, concept drift, and adversarial inputs. Fallback strategies when the model is wrong: default to a heuristic, default to a simpler model, refuse to predict, route to a human. Graceful degradation when the system is unavailable. The book is unusually strong on adversarial considerations — users who intentionally craft inputs to make the model misbehave.
Data Scientist: The closing argument is "mistakes are inevitable; hope is not a strategy." It is a remarkable closing line for a 2020 ML book.
ML Engineer: It is the line I remember. And the book earns it by working through every failure mode in detail rather than waving at them.
The Verdict
Data Scientist: I came in expecting a checklist. I got something more useful: a map. The book does not go deep on any single phase, but it gives me the shape of the whole work. I know what I do not know, and I know what book to read next for each phase.
ML Engineer: The book's greatest achievement is the ordering. Reading it before your first production model will save you from most of the common mistakes. Reading it after your fifth production model will be a useful consolidation even if the specifics feel familiar.
Data Scientist: I will pair it with the author's other book, The Hundred-Page Machine Learning Book, on the modeling side, and with Chip Huyen's Designing Machine Learning Systems for the data and deployment depth.
ML Engineer: That is exactly the right reading list. Burkov for the map, Huyen for the territory, Géron for the code.
The book's limitations are real — code-light, breadth-over-depth, no companion repository. But for a 310-page book published under a pay-what-you-want model, it punches well above its weight. It is the book I would hand to a new ML engineer on their first day.
Final Thoughts
Machine Learning Engineering is the book that closes the loop between ML research and ML operations. It is not the deepest book in any single phase, but it is the only one you can hand to a new ML engineer and have them understand the full shape of the work in a week.
Five years after publication, its prescriptions are still the industry standard: framing before modeling, baselines before models, statistics before claims, production before celebration, reliability before scaling. Few 2020 technical books have aged this well.
Rating: 8/10 — The best concise end-to-end guide to ML engineering, and the natural second book after Burkov's own Hundred-Page Machine Learning Book.
This has been a BookAtlas narration of Machine Learning Engineering by Andriy Burkov. Thanks for listening.