Pattern Recognition and Machine Learning

Christopher M. Bishop · 738pp

sufficient

reading path: overview → analysis → narration

overview

Pattern Recognition and Machine Learning

Christopher M. Bishop · Springer, 2007 · ISBN 9780387310732 · 738 pp.

Chapter Map

flowchart TD
    C1["Ch 1: Introduction"] --> C2
    C2["Ch 2: Probability Distributions"] --> C3
    C3["Ch 3: Linear Models for Regression"] --> C4
    C4["Ch 4: Linear Models for Classification"] --> C5
    C5["Ch 5: Neural Networks"] --> C6
    C6["Ch 6: Kernel Methods"] --> C7
    C7["Ch 7: Sparse Kernel Machines"] --> C8
    C8["Ch 8: Graphical Models"] --> C9
    C9["Ch 9: Mixture Models & EM"] --> C10
    C10["Ch 10: Approximate Inference"] --> C11
    C11["Ch 11: Sampling Methods"] --> C12
    C12["Ch 12: Continuous Latent Variables"] --> C13
    C13["Ch 13: Sequential Data"]

    C1 --> |"motivation, Bayes basics"| C2
    C2 --> |"exp. family, Gaussian"| C3
    C3 --> C4
    C4 --> |"logistic regression, probit"| C5
    C5 --> |"backprop, error surfaces"| C6
    C6 --> |"kernel trick, dual rep."| C7
    C7 --> |"SVM, RVM"| C8
    C8 --> |"factor graphs, d-sep."| C9
    C9 --> |"Gaussian mixture, EM"| C10
    C10 --> |"variational inference"| C11
    C11 --> |"MCMC, Gibbs, slice"| C12
    C12 --> |"PPCA, ICA, FA"| C13
    C13 --> |"HMM, Kalman, particle"| END["END"]

Prerequisites

| Topic | Requirement | |-------|-------------| | Linear algebra | Essential — matrix calculus used throughout | | Multivariate calculus | Essential — gradients of log-likelihoods | | Probability theory | Essential — Bayes' theorem, distributions, expectation | | Basic statistics | Helpful — MLE, sufficient statistics | | Programming | Helpful — exercises reference MATLAB/Octave |

Key Concepts

mindmap
  root((PRML))
    Probability
      Exponential Family
      Gaussian Distribution
      Beta & Gamma
      Conjugate Priors
    Bayesian Inference
      Prior Likelihood Posterior
      Posterior Over Parameters
      Predictive Distribution
      Evidence Approximation
    Linear Models
      Basis Functions
      Regularized LS
      Bayesian Linear Reg
      Logistic Regression
    Neural Networks
      Activation Functions
      Backpropagation
      Cross-Entropy Error
      Regularization
    Kernel Methods
      Kernel Trick
      Dual Representation
      Gaussian Processes
      Kernel PCA
    Graphical Models
      Directed Graphs
      Undirected Graphs
      Factor Graphs
      d-Separation
    Mixture Models
      Gaussian Mixture
      EM Algorithm
      Component Analysis
    Sampling & Approx
      MCMC
      Gibbs Sampling
      Variational Inference
    Sequential Data
      Hidden Markov Models
      Kalman Filter
      Particle Filter

content map

Chapter 01 — Introduction

journey
  title PRML Chapter 1: From Classical to Bayesian
  section Classical ML
    Discriminant Functions: 3: Classical ML
    Decision Boundaries: 4: Classical ML
    Probabilistic Approach: 5: Bridge
  section Bayesian Viewpoint
    Prior Probability: 6: Bayesian
    Likelihood: 7: Bayesian
    Posterior: 8: Bayesian
  section Why Bayesian
    Uncertainty Quantification: 9: Benefit
    Model Comparison: 10: Benefit

1.1 Example: Polynomial Curve Fitting

The book opens with the polynomial curve fitting problem as a running motivator. Given a dataset {(x_n, t_n)}, fit a polynomial y(x, w) = sum(w_j * x^j) of degree M. The squared-error loss leads to the least-squares solution, revealing overfitting (high M) and underfitting (low M).

1.2 Probability Theory as a Unifying Framework

Bishop argues that probability theory — specifically Bayesian probability — is the correct framework for handling uncertainty in machine learning.

| Approach | Characteristic | |----------|---------------| | Frequentist | Parameters are fixed; data varies | | Bayesian | Parameters are random variables; express uncertainty via distributions |

The Bayesian formula:

P(w | D) = P(D | w) * P(w) / P(D)
           ↑            ↑       ↑
      likelihood   prior   evidence

1.3 Decision Theory

Decision theory connects probabilities to actions:

Inference: compute P(C_k | x) — posterior class probabilities
Decision: choose class that minimizes expected loss

The reject option is introduced for cautious classification, and loss matrices formalize asymmetric misclassification costs.

1.4 Information Theory Overview

The book defines entropy, cross-entropy, and Kullback-Leibler divergence as measures of information:

KL(q || p) = ∫ q(x) ln(q(x)/p(x)) dx

Cross-entropy links to log-loss (-ln P(t | x)) — the first connection between information theory and training objectives later used heavily in neural networks.

1.5 Exercises (Selected)

| # | Topic | |---|-------| | Ex 1.1 | Sum of two dice probability | | Ex 1.2 | Beta distribution mean/variance derivation | | Ex 1.3 | Loss matrix plotting |

Chapter 02 — Probability Distributions

flowchart LR
    B["Bernoulli"] --> Bin["Binomial"]
    Bin --> Beta["Beta<br/>(conjugate prior)"]
    M["Multinomial"] --> Dir["Dirichlet"]
    G["Gaussian"] --> NG["Normal-Gamma<br/>(conjugate prior)"]
    E["Exponential<br/>Family"] --> MG["Sufficient Stats"]
    MG --> MLE["MLE Properties"]
    C["Central Limit<br/>Theorem"] --> G

2.1 Binary Variables — Bernoulli and Beta

The Bernoulli distribution models a single binary trial:

Bern(x | μ) = μ^x * (1-μ)^(1-x)

Its conjugate prior is the Beta distribution:

Beta(μ | a, b) = Gamma(a+b)/(Gamma(a)Gamma(b)) * μ^(a-1) * (1-μ)^(b-1)

2.2 The Exponential Family

The exponential family unifies most common distributions:

p(x | η) = h(x) * exp[η^T * T(x) - A(η)]

where η is the natural parameter, T(x) the sufficient statistic, and A(η) the log-partition function.

Bernoulli, binomial, Gaussian, gamma, and Dirichlet are all members.

2.3 Multinomial and Dirichlet

The multinomial distribution generalizes Bernoulli to K categories. Its conjugate prior is the Dirichlet:

Dir(μ | α) = Gamma(Σα_k)/(Π Gamma(α_k)) * Π μ_k^(α_k-1)

2.4 The Gaussian Distribution

The univariate Gaussian:

N(x | μ, σ²) = 1/√(2πσ²) * exp(-(x-μ)²/(2σ²))

Key properties:

Sum of independent Gaussians is Gaussian (convolution)
Marginal and conditional of joint Gaussian are Gaussian (closure properties)

Also covers the Central Limit Theorem — explains why Gaussian distributions appear so frequently in nature.

2.5 Conjugate Priors for the Gaussian

sequenceDiagram
  participant Data
  participant Prior
  participant Posterior
  participant Predictive

  Prior->>Posterior: N(μ₀, σ₀²)
  Note over Posterior: Combine with data (N, x̄, σ²)
  Posterior->>Predictive: N(μ_N, σ_N²)
  Note over Predictive: Student-t distribution

Known variance: Gaussian prior → Gaussian posterior
Unknown mean & variance: Normal-Gamma prior
Predictive distribution: Student-t distribution (heavier tails than Gaussian)

2.6 Nonparametric Methods

Brief introduction to kernel density estimation (Parzen windows) and nearest-neighbor methods as alternatives to fully parametric modeling.

2.7 Exercises (Selected)

| # | Topic | |---|-------| | Ex 2.1 | Bernoulli MLE and MAP | | Ex 2.12 | SU-L distribution derivation | | Ex 2.27 | Student-t distribution as infinite mixture |

analysis

Chapter 03 — Linear Models for Regression

flowchart TD
    BF["Basis Functions<br/>(φ_j(x))"] --> LM["Linear Model<br/>y(x,w) = Σ w_j φ_j(x)"]
    LM --> Loss["Least Squares Loss<br/>E_D = ½Σ(t-y)²"]
    Loss --> MLE["MLE: w_ML = (ΦᵀΦ)⁻¹Φᵀt"]
    MLE --> Over["Overfitting Risk"]
    Reg["Regularization<br/>λ||w||²"] --> MAP["MAP Estimate"]
    MAP --> BLR["Bayesian Linear Regression"]
    BLR --> Pred["Predictive Posterior<br/>Student-t"]

3.1 Linear Basis Function Models

A general linear model applies fixed nonlinear basis functions φ_j(x) to the input:

y(x, w) = Σ_{j=0}^{M-1} w_j φ_j(x) = w^T φ(x)

Common basis functions:

| Basis | Formula | |-------|---------| | Polynomial | φ_j(x) = x^j | | Gaussian | φ_j(x) = exp(-(x-μ_j)²/(2s²)) | | Sigmoidal | φ_j(x) = σ(x - μ_j) | | Fourier | φ_j(x) = sin/cos(ω_j x) |

3.2 Maximum Likelihood and Least Squares

Assuming Gaussian noise N(t | y(x,w), β⁻¹):

E(w) = ½ Σ (t_n - w^T φ_n)²
w_ML = (Φ^T Φ)^(-1) Φ^T t

Bias-variance dilemma: as M (basis functions) increases, variance of w_ML increases while bias decreases.

3.3 Regularized Least Squares (Ridge Regression)

Adding a quadratic regularizer (equivalent to Gaussian prior on w):

E(w) = ½ ||t - Φw||² + (λ/2) ||w||²
w_MAP = (λI + Φ^T Φ)^(-1) Φ^T t

This is ridge regression. The parameter λ controls model complexity. Bishop also discusses automatic relevance determination (ARD) as a Bayesian alternative to choosing λ by cross-validation.

3.4 Bayesian Linear Regression

flowchart LR
    Prior["Prior over w:<br/>N(m_N, S_N)"] --> Post["Posterior:<br/>N(m_N, S_N)"]
    Data["Data D = {X, t}"] --> Post
    Post --> Pred["Predictive:<br/>Student-t"]
    Sub1["m_N = β S_N Φ^T t<br/>S_N = αI + βΦ^TΦ"] --> Post
    Sub2["α = prior precision<br/>β = noise precision"] --> Sub1

The Bayesian approach places a prior over w (usually zero-mean Gaussian with precision α), then computes the posterior p(w | D). The predictive distribution for a new input x integrates over w:

p(t | x, D) = ∫ p(t | x, w) p(w | D) dw = St(t | μ, σ², ν)

The predictive distribution is a Student-t distribution — robust to outliers.

3.5 Evidence Framework

The evidence function p(D | α, β) is maximized to set hyperparameters α and β, without cross-validation:

ln p(D | α, β) = -E_D - E_W + const

where E_D is the data-fit term and E_W is the complexity penalty controlled by the effective number of parameters γ.

Effective number of parameters: γ = Σ_i λ_i / (λ_i + α), where λ_i are eigenvalues of β Φ^T Φ.

3.6 Limitations and Extensions

Linear in parameters, but nonlinear via basis functions
Scaling to large M requires iterative methods (N-gradient)
Sparse Bayesian Learning (Tipping 2001) automatically drives some α_j → ∞ to prune irrelevant basis functions

3.7 Exercises (Selected)

| # | Topic | |---|-------| | Ex 3.1 | Polynomial curve fitting by hand | | Ex 3.5 | Effect of basis function width | | Ex 3.10 | Bayesian linear regression derivation |

narration

Chapter 04 — Linear Models for Classification

flowchart TD
    DD["Discriminant Functions"] --> DF["g_k(x) = w_k^T x + w_k0"]
    Prob["Probabilistic Approach"] --> LR["Logistic Regression"]
    LR --> Binary["Binary: σ(w^T φ(x))"]
    LR --> Multi["Multi-class: softmax p(C_k|x)=exp(a_k)/Σexp(a_j)"]
    LR --> Probit["Probit Regression"]
    CE["Cross-Entropy Error"] --> Gradient["IRLS = Newton-Raphson"]
    Prob --> Disc["Generative Discriminants"]
    Disc --> QDA["QDA: full Σ_k"]
    Disc --> LDA["LDA: shared Σ"]
    Disc --> Naive["Naive Bayes"]

4.1 Discriminant Functions

Linear discriminant: g_k(x) = w_k^T x + w_k0. The decision boundary g_k(x) = g_j(x) is a hyperplane. For K classes, at most K(K-1)/2 pairwise boundaries.

The least-squares classification approach (mapping class labels to one-hot targets) is shown to be suboptimal for classification because it is sensitive to outliers and violates class distribution assumptions.

4.2 Probabilistic Generative Models

For generative models, model p(x | C_k) and p(C_k), then apply Bayes' theorem:

p(C_k | x) = p(x | C_k) * p(C_k) / p(x)

For Gaussian class-conditional densities:

| Model | Shared Σ? | Decision Boundary | |-------|-----------|-------------------| | QDA | No | Quadratic | | LDA | Yes | Linear | | Naive Bayes | Diagonal Σ | Axis-aligned |

4.3 Logistic Regression (Discriminative)

No explicit model of p(x | C_k). Directly model p(C_k | x):

p(C_1 | x) = σ(w^T x) = 1/(1 + exp(-w^T x))

For K > 2 classes, use the softmax:

p(C_k | x) = exp(w_k^T x) / Σ_j exp(w_j^T x)
  = softmax_k(w^1, ..., w^K; x)

Cross-entropy error function and its gradient:

E(w) = -Σ_n Σ_k t_{nk} ln y_{nk}
∇E = Φ^T (y - t)

Iterative Reweighted Least Squares (IRLS): Newton-Raphson update where at each step a weighted least-squares problem is solved. The Hessian is H = Φ^T R Φ where R is diagonal with entries y_k(1-y_k).

4.4 Probit Regression

Replace logistic sigmoid with probit function Φ(a) = ∫_{-∞}^{a} N(u | 0,1) du. Useful for Bayesian treatment — more efficient to integrate over a latent Gaussian.

Bayesian logistic regression uses Laplace approximation. The posterior p(w | D) is approximated as N(w | w_MAP, A^(-1)) where A is the Hessian at w_MAP. No exact conjugate prior exists.

4.5 Bayesian Treatment of Generalized Linear Models

The framework generalizes:

p(t | w) = N(t | y(x, w), σ²)   →  regression
p(t | w) = Bern(t | y(x, w))     →  classification
log(y) = w^T φ(x)                →  logistic regression
Φ^(-1)(y) = w^T φ(x)             →  probit regression

For the Bayesian treatment of logistic/probit, the Laplace approximation and MCMC are used (MCMC covered in Chapter 11).

4.6 Exercises (Selected)

| # | Topic | |---|-------| | Ex 4.1 | LDA vs. logistic regression decision boundary | | Ex 4.3 | IRLS derivation | | Ex 4.5 | Probit link function properties |