Pattern Recognition and Machine Learning
sufficient
reading path: overview → analysis → narration
overview
Pattern Recognition and Machine Learning
Christopher M. Bishop · Springer, 2007 · ISBN 9780387310732 · 738 pp.
Chapter Map
flowchart TD
C1["Ch 1: Introduction"] --> C2
C2["Ch 2: Probability Distributions"] --> C3
C3["Ch 3: Linear Models for Regression"] --> C4
C4["Ch 4: Linear Models for Classification"] --> C5
C5["Ch 5: Neural Networks"] --> C6
C6["Ch 6: Kernel Methods"] --> C7
C7["Ch 7: Sparse Kernel Machines"] --> C8
C8["Ch 8: Graphical Models"] --> C9
C9["Ch 9: Mixture Models & EM"] --> C10
C10["Ch 10: Approximate Inference"] --> C11
C11["Ch 11: Sampling Methods"] --> C12
C12["Ch 12: Continuous Latent Variables"] --> C13
C13["Ch 13: Sequential Data"]
C1 --> |"motivation, Bayes basics"| C2
C2 --> |"exp. family, Gaussian"| C3
C3 --> C4
C4 --> |"logistic regression, probit"| C5
C5 --> |"backprop, error surfaces"| C6
C6 --> |"kernel trick, dual rep."| C7
C7 --> |"SVM, RVM"| C8
C8 --> |"factor graphs, d-sep."| C9
C9 --> |"Gaussian mixture, EM"| C10
C10 --> |"variational inference"| C11
C11 --> |"MCMC, Gibbs, slice"| C12
C12 --> |"PPCA, ICA, FA"| C13
C13 --> |"HMM, Kalman, particle"| END["END"]
Prerequisites
| Topic | Requirement | |-------|-------------| | Linear algebra | Essential — matrix calculus used throughout | | Multivariate calculus | Essential — gradients of log-likelihoods | | Probability theory | Essential — Bayes' theorem, distributions, expectation | | Basic statistics | Helpful — MLE, sufficient statistics | | Programming | Helpful — exercises reference MATLAB/Octave |
Key Concepts
mindmap
root((PRML))
Probability
Exponential Family
Gaussian Distribution
Beta & Gamma
Conjugate Priors
Bayesian Inference
Prior Likelihood Posterior
Posterior Over Parameters
Predictive Distribution
Evidence Approximation
Linear Models
Basis Functions
Regularized LS
Bayesian Linear Reg
Logistic Regression
Neural Networks
Activation Functions
Backpropagation
Cross-Entropy Error
Regularization
Kernel Methods
Kernel Trick
Dual Representation
Gaussian Processes
Kernel PCA
Graphical Models
Directed Graphs
Undirected Graphs
Factor Graphs
d-Separation
Mixture Models
Gaussian Mixture
EM Algorithm
Component Analysis
Sampling & Approx
MCMC
Gibbs Sampling
Variational Inference
Sequential Data
Hidden Markov Models
Kalman Filter
Particle Filter content map
Chapter 01 — Introduction
journey
title PRML Chapter 1: From Classical to Bayesian
section Classical ML
Discriminant Functions: 3: Classical ML
Decision Boundaries: 4: Classical ML
Probabilistic Approach: 5: Bridge
section Bayesian Viewpoint
Prior Probability: 6: Bayesian
Likelihood: 7: Bayesian
Posterior: 8: Bayesian
section Why Bayesian
Uncertainty Quantification: 9: Benefit
Model Comparison: 10: Benefit
1.1 Example: Polynomial Curve Fitting
The book opens with the polynomial curve fitting problem as a running motivator. Given a dataset {(x_n, t_n)}, fit a polynomial y(x, w) = sum(w_j * x^j) of degree M. The squared-error loss leads to the least-squares solution, revealing overfitting (high M) and underfitting (low M).
1.2 Probability Theory as a Unifying Framework
Bishop argues that probability theory — specifically Bayesian probability — is the correct framework for handling uncertainty in machine learning.
| Approach | Characteristic | |----------|---------------| | Frequentist | Parameters are fixed; data varies | | Bayesian | Parameters are random variables; express uncertainty via distributions |
The Bayesian formula:
P(w | D) = P(D | w) * P(w) / P(D)
↑ ↑ ↑
likelihood prior evidence
1.3 Decision Theory
Decision theory connects probabilities to actions:
- Inference: compute
P(C_k | x)— posterior class probabilities - Decision: choose class that minimizes expected loss
The reject option is introduced for cautious classification, and loss matrices formalize asymmetric misclassification costs.
1.4 Information Theory Overview
The book defines entropy, cross-entropy, and Kullback-Leibler divergence as measures of information:
KL(q || p) = ∫ q(x) ln(q(x)/p(x)) dx
Cross-entropy links to log-loss (-ln P(t | x)) — the first connection between information theory and training objectives later used heavily in neural networks.
1.5 Exercises (Selected)
| # | Topic | |---|-------| | Ex 1.1 | Sum of two dice probability | | Ex 1.2 | Beta distribution mean/variance derivation | | Ex 1.3 | Loss matrix plotting |
Chapter 02 — Probability Distributions
flowchart LR
B["Bernoulli"] --> Bin["Binomial"]
Bin --> Beta["Beta<br/>(conjugate prior)"]
M["Multinomial"] --> Dir["Dirichlet"]
G["Gaussian"] --> NG["Normal-Gamma<br/>(conjugate prior)"]
E["Exponential<br/>Family"] --> MG["Sufficient Stats"]
MG --> MLE["MLE Properties"]
C["Central Limit<br/>Theorem"] --> G
2.1 Binary Variables — Bernoulli and Beta
The Bernoulli distribution models a single binary trial:
Bern(x | μ) = μ^x * (1-μ)^(1-x)
Its conjugate prior is the Beta distribution:
Beta(μ | a, b) = Gamma(a+b)/(Gamma(a)Gamma(b)) * μ^(a-1) * (1-μ)^(b-1)
2.2 The Exponential Family
The exponential family unifies most common distributions:
p(x | η) = h(x) * exp[η^T * T(x) - A(η)]
where η is the natural parameter, T(x) the sufficient statistic, and A(η) the log-partition function.
Bernoulli, binomial, Gaussian, gamma, and Dirichlet are all members.
2.3 Multinomial and Dirichlet
The multinomial distribution generalizes Bernoulli to K categories. Its conjugate prior is the Dirichlet:
Dir(μ | α) = Gamma(Σα_k)/(Π Gamma(α_k)) * Π μ_k^(α_k-1)
2.4 The Gaussian Distribution
The univariate Gaussian:
N(x | μ, σ²) = 1/√(2πσ²) * exp(-(x-μ)²/(2σ²))
Key properties:
- Sum of independent Gaussians is Gaussian (convolution)
- Marginal and conditional of joint Gaussian are Gaussian (closure properties)
Also covers the Central Limit Theorem — explains why Gaussian distributions appear so frequently in nature.
2.5 Conjugate Priors for the Gaussian
sequenceDiagram
participant Data
participant Prior
participant Posterior
participant Predictive
Prior->>Posterior: N(μ₀, σ₀²)
Note over Posterior: Combine with data (N, x̄, σ²)
Posterior->>Predictive: N(μ_N, σ_N²)
Note over Predictive: Student-t distribution
- Known variance: Gaussian prior → Gaussian posterior
- Unknown mean & variance: Normal-Gamma prior
- Predictive distribution: Student-t distribution (heavier tails than Gaussian)
2.6 Nonparametric Methods
Brief introduction to kernel density estimation (Parzen windows) and nearest-neighbor methods as alternatives to fully parametric modeling.
2.7 Exercises (Selected)
| # | Topic | |---|-------| | Ex 2.1 | Bernoulli MLE and MAP | | Ex 2.12 | SU-L distribution derivation | | Ex 2.27 | Student-t distribution as infinite mixture |
analysis
Chapter 03 — Linear Models for Regression
flowchart TD
BF["Basis Functions<br/>(φ_j(x))"] --> LM["Linear Model<br/>y(x,w) = Σ w_j φ_j(x)"]
LM --> Loss["Least Squares Loss<br/>E_D = ½Σ(t-y)²"]
Loss --> MLE["MLE: w_ML = (ΦᵀΦ)⁻¹Φᵀt"]
MLE --> Over["Overfitting Risk"]
Reg["Regularization<br/>λ||w||²"] --> MAP["MAP Estimate"]
MAP --> BLR["Bayesian Linear Regression"]
BLR --> Pred["Predictive Posterior<br/>Student-t"]
3.1 Linear Basis Function Models
A general linear model applies fixed nonlinear basis functions φ_j(x) to the input:
y(x, w) = Σ_{j=0}^{M-1} w_j φ_j(x) = w^T φ(x)
Common basis functions:
| Basis | Formula |
|-------|---------|
| Polynomial | φ_j(x) = x^j |
| Gaussian | φ_j(x) = exp(-(x-μ_j)²/(2s²)) |
| Sigmoidal | φ_j(x) = σ(x - μ_j) |
| Fourier | φ_j(x) = sin/cos(ω_j x) |
3.2 Maximum Likelihood and Least Squares
Assuming Gaussian noise N(t | y(x,w), β⁻¹):
E(w) = ½ Σ (t_n - w^T φ_n)²
w_ML = (Φ^T Φ)^(-1) Φ^T t
Bias-variance dilemma: as M (basis functions) increases, variance of w_ML increases while bias decreases.
3.3 Regularized Least Squares (Ridge Regression)
Adding a quadratic regularizer (equivalent to Gaussian prior on w):
E(w) = ½ ||t - Φw||² + (λ/2) ||w||²
w_MAP = (λI + Φ^T Φ)^(-1) Φ^T t
This is ridge regression. The parameter λ controls model complexity. Bishop also discusses automatic relevance determination (ARD) as a Bayesian alternative to choosing λ by cross-validation.
3.4 Bayesian Linear Regression
flowchart LR
Prior["Prior over w:<br/>N(m_N, S_N)"] --> Post["Posterior:<br/>N(m_N, S_N)"]
Data["Data D = {X, t}"] --> Post
Post --> Pred["Predictive:<br/>Student-t"]
Sub1["m_N = β S_N Φ^T t<br/>S_N = αI + βΦ^TΦ"] --> Post
Sub2["α = prior precision<br/>β = noise precision"] --> Sub1
The Bayesian approach places a prior over w (usually zero-mean Gaussian with precision α), then computes the posterior p(w | D). The predictive distribution for a new input x integrates over w:
p(t | x, D) = ∫ p(t | x, w) p(w | D) dw = St(t | μ, σ², ν)
The predictive distribution is a Student-t distribution — robust to outliers.
3.5 Evidence Framework
The evidence function p(D | α, β) is maximized to set hyperparameters α and β, without cross-validation:
ln p(D | α, β) = -E_D - E_W + const
where E_D is the data-fit term and E_W is the complexity penalty controlled by the effective number of parameters γ.
Effective number of parameters: γ = Σ_i λ_i / (λ_i + α), where λ_i are eigenvalues of β Φ^T Φ.
3.6 Limitations and Extensions
- Linear in parameters, but nonlinear via basis functions
- Scaling to large
Mrequires iterative methods (N-gradient) - Sparse Bayesian Learning (Tipping 2001) automatically drives some
α_j → ∞to prune irrelevant basis functions
3.7 Exercises (Selected)
| # | Topic | |---|-------| | Ex 3.1 | Polynomial curve fitting by hand | | Ex 3.5 | Effect of basis function width | | Ex 3.10 | Bayesian linear regression derivation |
narration
Chapter 04 — Linear Models for Classification
flowchart TD
DD["Discriminant Functions"] --> DF["g_k(x) = w_k^T x + w_k0"]
Prob["Probabilistic Approach"] --> LR["Logistic Regression"]
LR --> Binary["Binary: σ(w^T φ(x))"]
LR --> Multi["Multi-class: softmax p(C_k|x)=exp(a_k)/Σexp(a_j)"]
LR --> Probit["Probit Regression"]
CE["Cross-Entropy Error"] --> Gradient["IRLS = Newton-Raphson"]
Prob --> Disc["Generative Discriminants"]
Disc --> QDA["QDA: full Σ_k"]
Disc --> LDA["LDA: shared Σ"]
Disc --> Naive["Naive Bayes"]
4.1 Discriminant Functions
Linear discriminant: g_k(x) = w_k^T x + w_k0. The decision boundary g_k(x) = g_j(x) is a hyperplane. For K classes, at most K(K-1)/2 pairwise boundaries.
The least-squares classification approach (mapping class labels to one-hot targets) is shown to be suboptimal for classification because it is sensitive to outliers and violates class distribution assumptions.
4.2 Probabilistic Generative Models
For generative models, model p(x | C_k) and p(C_k), then apply Bayes' theorem:
p(C_k | x) = p(x | C_k) * p(C_k) / p(x)
For Gaussian class-conditional densities:
| Model | Shared Σ? | Decision Boundary | |-------|-----------|-------------------| | QDA | No | Quadratic | | LDA | Yes | Linear | | Naive Bayes | Diagonal Σ | Axis-aligned |
4.3 Logistic Regression (Discriminative)
No explicit model of p(x | C_k). Directly model p(C_k | x):
p(C_1 | x) = σ(w^T x) = 1/(1 + exp(-w^T x))
For K > 2 classes, use the softmax:
p(C_k | x) = exp(w_k^T x) / Σ_j exp(w_j^T x)
= softmax_k(w^1, ..., w^K; x)
Cross-entropy error function and its gradient:
E(w) = -Σ_n Σ_k t_{nk} ln y_{nk}
∇E = Φ^T (y - t)
Iterative Reweighted Least Squares (IRLS): Newton-Raphson update where at each step a weighted least-squares problem is solved. The Hessian is H = Φ^T R Φ where R is diagonal with entries y_k(1-y_k).
4.4 Probit Regression
Replace logistic sigmoid with probit function Φ(a) = ∫_{-∞}^{a} N(u | 0,1) du. Useful for Bayesian treatment — more efficient to integrate over a latent Gaussian.
Bayesian logistic regression uses Laplace approximation. The posterior p(w | D) is approximated as N(w | w_MAP, A^(-1)) where A is the Hessian at w_MAP. No exact conjugate prior exists.
4.5 Bayesian Treatment of Generalized Linear Models
The framework generalizes:
p(t | w) = N(t | y(x, w), σ²) → regression
p(t | w) = Bern(t | y(x, w)) → classification
log(y) = w^T φ(x) → logistic regression
Φ^(-1)(y) = w^T φ(x) → probit regression
For the Bayesian treatment of logistic/probit, the Laplace approximation and MCMC are used (MCMC covered in Chapter 11).
4.6 Exercises (Selected)
| # | Topic | |---|-------| | Ex 4.1 | LDA vs. logistic regression decision boundary | | Ex 4.3 | IRLS derivation | | Ex 4.5 | Probit link function properties |